Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization Memory Requirements #1228

Open
sneha5gsm opened this issue Mar 5, 2025 · 0 comments
Open

Quantization Memory Requirements #1228

sneha5gsm opened this issue Mar 5, 2025 · 0 comments
Labels
question Further information is requested

Comments

@sneha5gsm
Copy link

Hello!

I was trying the various quantization recipes for quantizing a 70B Llama 3 based model to FP8, INT8, INT4(A16) precisions as mentioned in the quantization docs by vLLM.

  1. Could you help me understand the memory requirements for the quantization recipes, i.e SmoothQuant (SmoothQuantModifier), GPTQ (GPTQModifier) and RTN (QuantizationModifier). A calculation/formula would help, for example, like the one we have for calculating kv cache:
memory in bytes for kv cache = 80 (layers) * 8 (kv heads) * 128 (head_dim) * 8192 (seq length) * 2 (k and v) * 2 (fp16)

I understand that the calculate_offload_device_map creates a custom device map by reserving memory for
GPTQ (reserve_for_hessians), but I would still like to understand the memory requirements to be able to utilize the GPU memory efficiently, to understand where all the GPU memory is consumed and to ensure that there are no bugs.

  1. Also, I understand that currently, for quantization of big models, the model is split in a pipeline parallel way on multiple GPUs available on the instance.
  • Since the GPU which is being used at any given time is the one which has the model layer that is being quantized at that time, would the time taken to quantize the model be similar to using a single GPU to quantize the model vs using multiple GPUs?
  • Is it possible to split the model in a tensor parallel way?
  • I understand that 'non-sequential GPTQ ' is deprecated, but how much memory is required for a non-sequential GPTQ? I think the above memory calculation would help. Also, how much speed up would we see using the non-sequential approach (compared to the sequential one)?

Thank you!

@dsikka dsikka added the question Further information is requested label Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants