[Performance]: The measured concurrency value is twice as high as the calculated value in the formula, why? #14350

xwzheng1020 · 2025-03-06T10:06:44Z

I want to calculate the concurrency that can be achieved from the kv cache usage. I use the following calculation method:

KVCache Size = 2×L×b×n×d×(Byte count per element)

where L, b. N and d are the number of hidden layers, concurrency value, sequence length, and hidden layer size, respectively.

So for the following model, there will be a theoretical concurrency value:

Model	Parameters (Billion)	Layers	Hidden size	量化类型	Total Mem (GB)	Reserved Mem(GB)	Sequence Length	Cocurrency （Theoretical）	Cocurrency （Measured ）
DeepSeek-R1-Distill-Qwen-1.5B	1.5	28	1536	fp16	16	2	4096	16.76	33.52
DeepSeek-R1-Distill-Qwen-7B	7	28	3584	fp16	32	2	4096	10.45	20.90
DeepSeek-R1-Distill-Llama-8B-q4	8	32	4096	int4	32	2	4096	48.60	97.21
DeepSeek-R1-Distill-Qwen-14B	14	48	5120	fp16	48	2	4096	4.80	9.60
DeepSeek-R1-Distill-Qwen-32B	32	64	5120	fp16	96	4	4096	5.60	11.20
DeepSeek-R1-Distill-Llama-70B	70	80	8192	fp16	160	4	4096	1.60	3.20
DeepSeek-V3-q2	671	61	7168	int2	220	4	4096	12.03	24.06
DeepSeek-R1	671	61	7168	fp8	1128	4	4096	135.79	271.59

But, The concurrency value I actually measured will be twice as large as the calculated value.

Where did I make a calculation error? Or， vllm has special optimizations ?

No response

No response

The output of `python collect_env.py`

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

xwzheng1020 added the performance Performance-related issues label Mar 6, 2025

Provide feedback