-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange Memory Consumption Phenomenon in vLLM #89
Comments
Sorry for the late reply. We'll reproduce it first. |
Glad to see your reply! Hope to see the progress! |
As the code @wangxiyuan linked above, About The mem occupation in my situation: |
Thanks for your response! The vllm I used is https://github.com/wangshuai09/vllm/tree/npu_support, which may be the main reason. I had try to test the vllm-ascend with v0.7.1rc1, but encountered some problems. I have submitted an issue here |
To summary, the memory issue occurs with the version of vllm from https://github.com/wangshuai09/vllm/tree/npu_support, which is somehow too old. @MengqingCao has check it and found no problem. Close this issue. |
Today when I test the inference of Qwen2.5-Math-7B-Instruct on one card (TP=PP=1) it reported the OOM error.
I'm curously why this happened because the weights of 7B model only occupy 14GB NPU memory, there is 50GB memory left. Then I found that the OOM could be solved when reduce
gpu_memory_utilization
from 0.96 to 0.8. I didn't understand this all even if I set the max_tokens to 1024 in LLM.We know in the infer mode, the memory is mainly occupied by model weight, activations and KV cache. Searching on the doc of vllm, I found this:
I was confued when I decrease the
gpu_memory_utilization
the problem solved. So I did some experiments:7B模型,max_tokens = 1K,这里考虑关闭
cpu_offload_gb
参数7B模型,max_tokens = 32K,这里考虑关闭
cpu_offload_gb
参数So my questions are:
gpu_memory_utilization
mean? When I set the value, the real memory occupation usually is higher than the threshold.gpu_memory_utilization
threshold ? about 10GB regardless of thegpu_memory_utilization
andmax_tokens
.max_tokens
? Then themax_tokens
is 32x bigger, the memory is unchanged.The text was updated successfully, but these errors were encountered: