You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to calculate the concurrency that can be achieved from the kv cache usage. I use the following calculation method:
KVCache Size = 2×L×b×n×d×(Byte count per element)
where L, b. N and d are the number of hidden layers, concurrency value, sequence length, and hidden layer size, respectively.
So for the following model, there will be a theoretical concurrency value:
<style>
</style>
Model
Parameters (Billion)
Layers
Hidden size
量化类型
Total Mem (GB)
Reserved Mem(GB)
Sequence Length
Cocurrency (Theoretical)
Cocurrency (Measured )
DeepSeek-R1-Distill-Qwen-1.5B
1.5
28
1536
fp16
16
2
4096
16.76
33.52
DeepSeek-R1-Distill-Qwen-7B
7
28
3584
fp16
32
2
4096
10.45
20.90
DeepSeek-R1-Distill-Llama-8B-q4
8
32
4096
int4
32
2
4096
48.60
97.21
DeepSeek-R1-Distill-Qwen-14B
14
48
5120
fp16
48
2
4096
4.80
9.60
DeepSeek-R1-Distill-Qwen-32B
32
64
5120
fp16
96
4
4096
5.60
11.20
DeepSeek-R1-Distill-Llama-70B
70
80
8192
fp16
160
4
4096
1.60
3.20
DeepSeek-V3-q2
671
61
7168
int2
220
4
4096
12.03
24.06
DeepSeek-R1
671
61
7168
fp8
1128
4
4096
135.79
271.59
But, The concurrency value I actually measured will be twice as large as the calculated value.
Where did I make a calculation error? Or, vllm has special optimizations ?
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Proposal to improve performance
I want to calculate the concurrency that can be achieved from the kv cache usage. I use the following calculation method:
where L, b. N and d are the number of hidden layers, concurrency value, sequence length, and hidden layer size, respectively.
So for the following model, there will be a theoretical concurrency value:
<style> </style>But, The concurrency value I actually measured will be twice as large as the calculated value.
Where did I make a calculation error? Or, vllm has special optimizations ?
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: