-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging problem #35
Comments
Hi, You can use this version of the auto_gptq code: https://github.com/xxw11/AutoGPTQ_QALoRA. We've made adjustments to the original quantization process so that the quantized qzero will be in floating-point format. If you encounter any issues, please feel free to reach out. |
Thanks for replying, but is it possible to make the new matrix B be an integer matrix? |
@xxw11 Hello, I believe that storing qzeros in floating-point format does not lead to improved hardware inference performance. During dequantization, the weights have already been quantized to INT4, and they need to be converted to fp16 format to be added with qzeros. Unless the qzeros are initially quantized to INT4, the paper has overlooked this step. @yuhuixu1993 What do you think ? |
Hi, @LuletterSoul The inference efficiency is not relevant with the qzeros. The insight of our paper is that our tuned models are still in int4 which can be efficiently inferenced while the tuned models of other Lora based methods are in fp16. Even though weight olny quantization need to be dequantized during inference, it is still faster than fp16 models, as the so-called memory-io bottleneck for llm. Besides some kernels such as marlin has been released which makes weight only quantization extremely faster. |
@yuhuixu1993 |
@LuletterSoul, weight only quantization need to be dequantized during inference no matter the format of qzeros are int or float. By the way the scale of the quantized weight are float. In the original GPTQ code, zeros are also float. |
@xxw11 Thank you for sharing the related code. I have questions of the code you shared. 1. Changes to CUDA Files
2. Modification of qlinear_cuda.py |
Hello, this repository primarily focuses on modifying the types in the GPTQ algorithm. The original QALoRA code execution doesn't involve these CUDA files, so they weren't modified. To implement a comprehensive modification, changes would need to be made to the forward passes of three modes: CUDA, Triton, and PyTorch backend. |
@xxw11 Thank you for your response. However, if I fine-tune using QALoRA and then merge into qzeros for model inference, wouldn't I also need to modify the CUDA files called by the forward function in GPTQ? Could I get some guidance on how to perform inference using the merged parameters? |
@xxw11 I have another question. If I use the version of AutoGPTQ that you shared, do I need to reflect the changes in the auto-gptq within the specified Python version as well in order to perform quantization? Change the peft_utils.py in your own auto-gptq path(python path/auto_gptq/utils/peft_utils.py) with the new one. For the users of [GPTQLORA](https://github.com/qwopqwop200/gptqlora), you only need to change the peft_utils.py file. When I tried to quantize the llama7B model with AutoGPTQ_QALoRA, an error like the one below occurred, and the quantization didn't proceed as below. 2024-10-29 15:29:34 INFO [auto_gptq.modeling._base] Start quantizing layer 1/32
2024-10-29 15:29:35 INFO [auto_gptq.modeling._base] Quantizing self_attn.k_proj in layer 1/32...
2024-10-29 15:29:36 INFO [auto_gptq.quantization.gptq] duration: 1.2955830097198486
2024-10-29 15:29:36 INFO [auto_gptq.quantization.gptq] avg loss: 765204325466112.0
2024-10-29 15:29:36 INFO [auto_gptq.modeling._base] Quantizing self_attn.v_proj in layer 1/32...
2024-10-29 15:29:38 INFO [auto_gptq.quantization.gptq] duration: 1.7773809432983398
2024-10-29 15:29:38 INFO [auto_gptq.quantization.gptq] avg loss: 766996199243776.0
2024-10-29 15:29:38 INFO [auto_gptq.modeling._base] Quantizing self_attn.q_proj in layer 1/32...
2024-10-29 15:29:40 INFO [auto_gptq.quantization.gptq] duration: 1.6986031532287598
2024-10-29 15:29:40 INFO [auto_gptq.quantization.gptq] avg loss: 764688862281728.0
2024-10-29 15:29:40 INFO [auto_gptq.modeling._base] Quantizing self_attn.o_proj in layer 1/32...
Traceback (most recent call last):
File "quant_with_alpaca.py", line 178, in <module>
main()
File "quant_with_alpaca.py", line 121, in main
model.quantize(
File "/home/***/.conda/envs/qalora/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/***/.conda/envs/qalora/lib/python3.8/site-packages/auto_gptq/modeling/_base.py", line 361, in quantize
scale, zero, g_idx = gptq[name].fasterquant(
File "/home/***/.conda/envs/qalora/lib/python3.8/site-packages/auto_gptq/quantization/gptq.py", line 94, in fasterquant
H = torch.linalg.cholesky(H)
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite). |
Im very confused on the merging step. In Appendix B, the proof is solid, however there is no guarantee that the new matrix B is in integer format. In standard linear quantization, zeros are represented by integer, you can’t force a ‘qzeros’ to be a floating matrix . If i misunderstood, how do you do it then? Thanks
The text was updated successfully, but these errors were encountered: