Loading Quantized Model on 2 GPU's #137

mxtsai · 2024-12-26T10:05:45Z

Hi,

Is is possible to load a quantized model onto 2 GPU's?

I tried loading a quantized model like this:

model = AutoHQQHFModel.from_quantized(save_dir, compute_dtype=torch.bfloat16, device=[torch.device('cuda:0'), torch.device('cuda:1')])

But the model only loads onto the 1st GPU.

Thanks!

The text was updated successfully, but these errors were encountered:

mobicham · 2024-12-26T10:12:37Z

I think multi-gpu via the hqq lib is only supported when you quantize on-the-fly, meaning via AutoHQQHFMode.quantize_model and use device=['cuda:0','cuda:1'] https://github.com/mobiusml/hqq/blob/master/hqq/models/base.py#L266

Otherwise, you can simply do it directly via transformers.

hqq's lib multi-gpu implementation is a bit faster than HF though because they use accelerate and I don't use any of that by minimizing the number of transfers across layers.

The other option is to use gpt-fast and hqq via torchao, but they don't support all models I think only Llama-like models.

mxtsai · 2024-12-26T15:05:29Z

Thanks for the response. Do you happen to know any examples (or similar examples) that shows how to use gpt-fast and hqq via torchao?

mobicham · 2024-12-26T19:53:12Z

I think it's this for gpt-fast:
https://github.com/mobicham/ao/tree/main/torchao/_models/llama
And to quantize with hqq:
https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#a16w4-weightonly-quantization

mobicham · 2024-12-26T19:53:55Z

Or you could simply use VLLM with HQQ, probably the best option for serving. You can use this script:
https://gist.github.com/mobicham/6efb1f7af3bf5b24fdc88f1edbcacd9a

mxtsai · 2024-12-27T02:18:11Z

Thank you for the links. From what I understand, it seems like gpt-fast and torchao only support single device quantization + inference? It seems like they provide methods to compile the model, but I'm not sure if these can be applicable for multi-gpu model inference (i.e., tensor parallel).

If I were to use VLLM with HQQ, is there support for HQQ models like in the OpenAI compatible server API? Or is the current support only available with running from Huggingface Transfomers -> VLLM (as in your script).

Btw, I see that hqq is one of the --quantization parameters in the VLLM OpenAI compatible server (link), but I can't seem to find any script that can convert an existing HF model onto the format that can be used for serving in VLLM. The only model that I found that works is this Llama 8B model, but no scripts are provided to show how this model was generated.

mobicham · 2024-12-27T11:36:36Z

The script I shared with you takes an unquantized huggingface model -> quantizes it with hqq -> saves it. Then you can load that model in VLLM directly. That should work for any huggingface model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading Quantized Model on 2 GPU's #137

Loading Quantized Model on 2 GPU's #137

mxtsai commented Dec 26, 2024

mobicham commented Dec 26, 2024

mxtsai commented Dec 26, 2024

mobicham commented Dec 26, 2024

mobicham commented Dec 26, 2024 •

edited

Loading

mxtsai commented Dec 27, 2024

mobicham commented Dec 27, 2024

Loading Quantized Model on 2 GPU's #137

Loading Quantized Model on 2 GPU's #137

Comments

mxtsai commented Dec 26, 2024

mobicham commented Dec 26, 2024

mxtsai commented Dec 26, 2024

mobicham commented Dec 26, 2024

mobicham commented Dec 26, 2024 • edited Loading

mxtsai commented Dec 27, 2024

mobicham commented Dec 27, 2024

mobicham commented Dec 26, 2024 •

edited

Loading