Merging problem #35

samuelqy · 2024-05-28T03:16:57Z

Im very confused on the merging step. In Appendix B, the proof is solid, however there is no guarantee that the new matrix B is in integer format. In standard linear quantization, zeros are represented by integer, you can’t force a ‘qzeros’ to be a floating matrix . If i misunderstood, how do you do it then? Thanks

xxw11 · 2024-05-28T08:14:29Z

Hi, You can use this version of the auto_gptq code: https://github.com/xxw11/AutoGPTQ_QALoRA. We've made adjustments to the original quantization process so that the quantized qzero will be in floating-point format.

If you encounter any issues, please feel free to reach out.

samuelqy · 2024-05-28T11:36:19Z

Thanks for replying, but is it possible to make the new matrix B be an integer matrix?

LuletterSoul · 2024-06-06T08:12:20Z

@xxw11 Hello, I believe that storing qzeros in floating-point format does not lead to improved hardware inference performance. During dequantization, the weights have already been quantized to INT4, and they need to be converted to fp16 format to be added with qzeros. Unless the qzeros are initially quantized to INT4, the paper has overlooked this step.

@yuhuixu1993 What do you think ?

yuhuixu1993 · 2024-06-06T13:51:08Z

@xxw11 Hello, I believe that storing qzeros in floating-point format does not lead to improved hardware inference performance. During dequantization, the weights have already been quantized to INT4, and they need to be converted to fp16 format to be added with qzeros. Unless the qzeros are initially quantized to INT4, the paper has overlooked this step.

@yuhuixu1993 What do you think ?

Hi, @LuletterSoul The inference efficiency is not relevant with the qzeros. The insight of our paper is that our tuned models are still in int4 which can be efficiently inferenced while the tuned models of other Lora based methods are in fp16. Even though weight olny quantization need to be dequantized during inference, it is still faster than fp16 models, as the so-called memory-io bottleneck for llm. Besides some kernels such as marlin has been released which makes weight only quantization extremely faster.

LuletterSoul · 2024-06-07T11:21:40Z

@yuhuixu1993
I apologize, but I may have misunderstood your question. I agree that QA-QLoRA can help alleviate memory-bound issues. As you mentioned, the QA-LoRA weights are in INT4 format, and weight-only quantization requires dequantizing the weights to FP16 for inference, but the qzeros are in the floating-point domain while the weights are INT4. According to the dequantization formula (W_s4 - qzeros) * scales, I'm unsure how to perform the subtraction between W_s4 and qzeros. One approach I can think of is either quantizing the qzeros to S4 or dequantizing the W_s4 to FP16. The former may compromise accuracy as the fine-tuning process did not introduce quantization error for qzeros, while the latter may sacrifice performance due to the higher cost of floating-point operations.

yuhuixu1993 · 2024-06-07T17:31:18Z

@yuhuixu1993

I apologize, but I may have misunderstood your question. I agree that QA-QLoRA can help alleviate memory-bound issues. As you mentioned, the QA-LoRA weights are in INT4 format, and weight-only quantization requires dequantizing the weights to FP16 for inference, but the qzeros are in the floating-point domain while the weights are INT4. According to the dequantization formula (W_s4 - qzeros) * scales, I'm unsure how to perform the subtraction between W_s4 and qzeros. One approach I can think of is either quantizing the qzeros to S4 or dequantizing the W_s4 to FP16. The former may compromise accuracy as the fine-tuning process did not introduce quantization error for qzeros, while the latter may sacrifice performance due to the higher cost of floating-point operations.

@LuletterSoul, weight only quantization need to be dequantized during inference no matter the format of qzeros are int or float. By the way the scale of the quantized weight are float. In the original GPTQ code, zeros are also float.

freeSoul-SNU · 2024-10-28T02:19:24Z

Hi, You can use this version of the auto_gptq code: https://github.com/xxw11/AutoGPTQ_QALoRA. We've made adjustments to the original quantization process so that the quantized qzero will be in floating-point format.

If you encounter any issues, please feel free to reach out.

@xxw11 Thank you for sharing the related code. I have questions of the code you shared.

1. Changes to CUDA Files
Is it okay not to modify the CUDA-related code? In the modified qlinear_cuda_old.py, the forward method of QuantLinear calls self.autogptq_cuda.vecquant*matmul depending on the bit width. Is it okay not to modify these functions?

if self.bits == 2:
self.autogptq_cuda.vecquant2matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx)
elif self.bits == 3:
self.autogptq_cuda.vecquant3matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx)
elif self.bits == 4:
self.autogptq_cuda.vecquant4matmul(x.half(), self.qweight, out.half(), self.scales, self.qzeros, self.g_idx, self.infeatures // 2)
elif self.bits == 8:
self.autogptq_cuda.vecquant8matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx)
else:

2. Modification of qlinear_cuda.py
The shared repository only has modifications to qlinear_cuda_old.py, and I'm wondering if it's okay not to modify the similar qlinear_cuda.py.

xxw11 · 2024-10-28T02:43:39Z

Hi, You can use this version of the auto_gptq code: https://github.com/xxw11/AutoGPTQ_QALoRA. We've made adjustments to the original quantization process so that the quantized qzero will be in floating-point format.
If you encounter any issues, please feel free to reach out.

@xxw11 Thank you for sharing the related code. I have questions of the code you shared.

1. Changes to CUDA Files Is it okay not to modify the CUDA-related code? In the modified qlinear_cuda_old.py, the forward method of QuantLinear calls self.autogptq_cuda.vecquant*matmul depending on the bit width. Is it okay not to modify these functions?

if self.bits == 2: self.autogptq_cuda.vecquant2matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx) elif self.bits == 3: self.autogptq_cuda.vecquant3matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx) elif self.bits == 4: self.autogptq_cuda.vecquant4matmul(x.half(), self.qweight, out.half(), self.scales, self.qzeros, self.g_idx, self.infeatures // 2) elif self.bits == 8: self.autogptq_cuda.vecquant8matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx) else:

2. Modification of qlinear_cuda.py The shared repository only has modifications to qlinear_cuda_old.py, and I'm wondering if it's okay not to modify the similar qlinear_cuda.py.

Hello, this repository primarily focuses on modifying the types in the GPTQ algorithm. The original QALoRA code execution doesn't involve these CUDA files, so they weren't modified. To implement a comprehensive modification, changes would need to be made to the forward passes of three modes: CUDA, Triton, and PyTorch backend.

freeSoul-SNU · 2024-10-28T08:03:49Z

Hi, You can use this version of the auto_gptq code: https://github.com/xxw11/AutoGPTQ_QALoRA. We've made adjustments to the original quantization process so that the quantized qzero will be in floating-point format.

If you encounter any issues, please feel free to reach out.

@xxw11 Thank you for your response.
I checked the shared Git repository, and it seems to be code for keeping the zeros as floats in GPTQ.

However, if I fine-tune using QALoRA and then merge into qzeros for model inference, wouldn't I also need to modify the CUDA files called by the forward function in GPTQ? Could I get some guidance on how to perform inference using the merged parameters?

freeSoul-SNU · 2024-10-29T06:33:29Z

@xxw11 I have another question.

If I use the version of AutoGPTQ that you shared, do I need to reflect the changes in the auto-gptq within the specified Python version as well in order to perform quantization?

Change the peft_utils.py in your own auto-gptq path(python path/auto_gptq/utils/peft_utils.py) with the new one. For the users of [GPTQLORA](https://github.com/qwopqwop200/gptqlora), you only need to change the peft_utils.py file.

When I tried to quantize the llama7B model with AutoGPTQ_QALoRA, an error like the one below occurred, and the quantization didn't proceed as below.

2024-10-29 15:29:34 INFO [auto_gptq.modeling._base] Start quantizing layer 1/32
2024-10-29 15:29:35 INFO [auto_gptq.modeling._base] Quantizing self_attn.k_proj in layer 1/32...
2024-10-29 15:29:36 INFO [auto_gptq.quantization.gptq] duration: 1.2955830097198486
2024-10-29 15:29:36 INFO [auto_gptq.quantization.gptq] avg loss: 765204325466112.0
2024-10-29 15:29:36 INFO [auto_gptq.modeling._base] Quantizing self_attn.v_proj in layer 1/32...
2024-10-29 15:29:38 INFO [auto_gptq.quantization.gptq] duration: 1.7773809432983398
2024-10-29 15:29:38 INFO [auto_gptq.quantization.gptq] avg loss: 766996199243776.0
2024-10-29 15:29:38 INFO [auto_gptq.modeling._base] Quantizing self_attn.q_proj in layer 1/32...
2024-10-29 15:29:40 INFO [auto_gptq.quantization.gptq] duration: 1.6986031532287598
2024-10-29 15:29:40 INFO [auto_gptq.quantization.gptq] avg loss: 764688862281728.0
2024-10-29 15:29:40 INFO [auto_gptq.modeling._base] Quantizing self_attn.o_proj in layer 1/32...
Traceback (most recent call last):
  File "quant_with_alpaca.py", line 178, in <module>
    main()
  File "quant_with_alpaca.py", line 121, in main
    model.quantize(
  File "/home/***/.conda/envs/qalora/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/***/.conda/envs/qalora/lib/python3.8/site-packages/auto_gptq/modeling/_base.py", line 361, in quantize
    scale, zero, g_idx = gptq[name].fasterquant(
  File "/home/***/.conda/envs/qalora/lib/python3.8/site-packages/auto_gptq/quantization/gptq.py", line 94, in fasterquant
    H = torch.linalg.cholesky(H)
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite).

yuhuixu1993 mentioned this issue Feb 5, 2025

merging question :) #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging problem #35

Merging problem #35

samuelqy commented May 28, 2024

xxw11 commented May 28, 2024

samuelqy commented May 28, 2024

LuletterSoul commented Jun 6, 2024

yuhuixu1993 commented Jun 6, 2024 •

edited

Loading

LuletterSoul commented Jun 7, 2024

yuhuixu1993 commented Jun 7, 2024

freeSoul-SNU commented Oct 28, 2024 •

edited

Loading

xxw11 commented Oct 28, 2024

freeSoul-SNU commented Oct 28, 2024

freeSoul-SNU commented Oct 29, 2024 •

edited

Loading

Merging problem #35

Merging problem #35

Comments

samuelqy commented May 28, 2024

xxw11 commented May 28, 2024

samuelqy commented May 28, 2024

LuletterSoul commented Jun 6, 2024

yuhuixu1993 commented Jun 6, 2024 • edited Loading

LuletterSoul commented Jun 7, 2024

yuhuixu1993 commented Jun 7, 2024

freeSoul-SNU commented Oct 28, 2024 • edited Loading

xxw11 commented Oct 28, 2024

freeSoul-SNU commented Oct 28, 2024

freeSoul-SNU commented Oct 29, 2024 • edited Loading

yuhuixu1993 commented Jun 6, 2024 •

edited

Loading

freeSoul-SNU commented Oct 28, 2024 •

edited

Loading

freeSoul-SNU commented Oct 29, 2024 •

edited

Loading