-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading Quantized Model on 2 GPU's #137
Comments
I think multi-gpu via the hqq lib is only supported when you quantize on-the-fly, meaning via Otherwise, you can simply do it directly via transformers. hqq's lib multi-gpu implementation is a bit faster than HF though because they use accelerate and I don't use any of that by minimizing the number of transfers across layers. The other option is to use gpt-fast and hqq via torchao, but they don't support all models I think only Llama-like models. |
Thanks for the response. Do you happen to know any examples (or similar examples) that shows how to use gpt-fast and hqq via torchao? |
I think it's this for gpt-fast: |
Or you could simply use VLLM with HQQ, probably the best option for serving. You can use this script: |
Thank you for the links. From what I understand, it seems like If I were to use VLLM with HQQ, is there support for HQQ models like in the OpenAI compatible server API? Or is the current support only available with running from Huggingface Transfomers -> VLLM (as in your script). Btw, I see that |
The script I shared with you takes an unquantized huggingface model -> quantizes it with hqq -> saves it. Then you can load that model in VLLM directly. That should work for any huggingface model |
Hi,
Is is possible to load a quantized model onto 2 GPU's?
I tried loading a quantized model like this:
model = AutoHQQHFModel.from_quantized(save_dir, compute_dtype=torch.bfloat16, device=[torch.device('cuda:0'), torch.device('cuda:1')])
But the model only loads onto the 1st GPU.
Thanks!
The text was updated successfully, but these errors were encountered: