Why quantized model of 640MB took almost 3GB of VRAM? #1229

Bakht-Ullah · 2025-03-06T04:16:44Z

I have quantized openai/whisper-medium model with W4A16 GPTQ method having 2.8GB of size to 640MB. But I am facing the issue when I load two quantized model simultaneously it utilize almost 6GB of the VRAM. Can anyone know why this utilize too much memory.

#loading model

llm = LLM(
model="Bakht123/whisper-medium-gptq-W4A16-G128",
max_model_len=448,
max_num_seqs=64,
limit_mm_per_prompt={"audio": 1},
)

llm1 = LLM(
model="Bakht123/whisper-medium-gptq-W4A16-G128",
max_model_len=448,
max_num_seqs=64,
limit_mm_per_prompt={"audio": 1},
)

#memory utilization

dsikka added question Further information is requested vllm Using vLLM labels Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why quantized model of 640MB took almost 3GB of VRAM? #1229

Why quantized model of 640MB took almost 3GB of VRAM? #1229

Bakht-Ullah commented Mar 6, 2025 •

edited

Loading

Why quantized model of 640MB took almost 3GB of VRAM? #1229

Why quantized model of 640MB took almost 3GB of VRAM? #1229

Comments

Bakht-Ullah commented Mar 6, 2025 • edited Loading

Bakht-Ullah commented Mar 6, 2025 •

edited

Loading