Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: The measured concurrency value is twice as high as the calculated value in the formula, why? #14350

Open
1 task done
xwzheng1020 opened this issue Mar 6, 2025 · 0 comments
Labels
performance Performance-related issues

Comments

@xwzheng1020
Copy link

Proposal to improve performance

I want to calculate the concurrency that can be achieved from the kv cache usage. I use the following calculation method:

KVCache Size = 2×L×b×n×d×(Byte count per element)

where L, b. N and d are the number of hidden layers, concurrency value, sequence length, and hidden layer size, respectively.

So for the following model, there will be a theoretical concurrency value:

<style> </style>
Model Parameters (Billion) Layers Hidden size 量化类型 Total Mem (GB) Reserved Mem(GB) Sequence Length Cocurrency (Theoretical) Cocurrency (Measured )
DeepSeek-R1-Distill-Qwen-1.5B 1.5 28 1536 fp16 16 2 4096 16.76 33.52
DeepSeek-R1-Distill-Qwen-7B 7 28 3584 fp16 32 2 4096 10.45 20.90
DeepSeek-R1-Distill-Llama-8B-q4 8 32 4096 int4 32 2 4096 48.60 97.21
DeepSeek-R1-Distill-Qwen-14B 14 48 5120 fp16 48 2 4096 4.80 9.60
DeepSeek-R1-Distill-Qwen-32B 32 64 5120 fp16 96 4 4096 5.60 11.20
DeepSeek-R1-Distill-Llama-70B 70 80 8192 fp16 160 4 4096 1.60 3.20
DeepSeek-V3-q2 671 61 7168 int2 220 4 4096 12.03 24.06
DeepSeek-R1 671 61 7168 fp8 1128 4 4096 135.79 271.59

But, The concurrency value I actually measured will be twice as large as the calculated value.

Where did I make a calculation error? Or, vllm has special optimizations ?

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@xwzheng1020 xwzheng1020 added the performance Performance-related issues label Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

1 participant