Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I just quantize a matrix W? #149

Open
bxren opened this issue Feb 11, 2025 · 7 comments
Open

How can I just quantize a matrix W? #149

bxren opened this issue Feb 11, 2025 · 7 comments

Comments

@bxren
Copy link

bxren commented Feb 11, 2025

I am trying to apply hqq to other fields other than LLM or DL, such as approximate nearest neighbor search. I want to quantize the n-by-d matrix with each element being FP32 into n-by-d matrix with each element being 4bit/8bit integers. How can I use hqq API to achieve it? Thanks a lot!

@mobicham
Copy link
Collaborator

mobicham commented Feb 11, 2025

Hey, just follow the basic usage section:

from hqq.core.quantize import *
#Quantization settings
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, axis=0) #use axis=0 for better quality


tmp_linear_layer = torch.nn.Linear(W.shape[1], W.shape[0], bias=False)
tmp_linear_layer.data = W #your matrix

#Replace your linear layer 
hqq_layer = HQQLinear(tmp_linear_layer, #torch.nn.Linear or None 
                      quant_config=quant_config, #quantization configuration
                      compute_dtype=W.dtype, #compute dtype
                      device='cuda', #cuda device
                      initialize=True, #Use False to quantize later
                      del_orig=True #if True, delete the original layer
                      )


W_r = hqq_layer.dequantize() #reconstructed matrix

You can also do the following:

hqq_layer = HQQLinear.from_weights(W, bias=None, quant_config=quant_config, compute_dtype=W.dtype, device="cuda")

@bxren
Copy link
Author

bxren commented Feb 11, 2025

I got it. Thank you! I still have a problem. Could you tell me how to use the parameters s and z generated by W(n-by-d) to quantize another query vector q(1-by-d)?

@mobicham
Copy link
Collaborator

I am not sure I understand, the s and z would be specific to W, quantizing q would require re-quantizing. Can you provide a detailed example?

@bxren
Copy link
Author

bxren commented Feb 11, 2025

W can be seen as n d-dimensional base vectors, and q is the d-dimensional query vector. In ANNS(approximate nearest neighbor search), we need to compute the euclidean distance between q and some of the base vectors, and find the closest k vectors. If we apply the same transform parameter s and z to q as W, we can just compute the distances between q_q and W_q, which is the same as the distances between q and W. That is why I want to get the s and z specific to W. I am not sure if I made it clear.

@mobicham
Copy link
Collaborator

mobicham commented Feb 11, 2025

You can't use the same s and z, these are specific quant parameters per data group defined by the group_size.

What you can do instead is to use the dot-prodct directly with the quantized weights via the HQQLinear module. For example, if you are calculating the cosine distance via the dot-product between W and q, you can quantize multiple chunks of W (the ones that fit the vram), and for each chunk W_i that is quantized in hqq_linear[i], you can do hqq_linear[i](q) -> distance_scores_i.

Note that by convention, hqq_linear[i](x) will actually do x @ dequantized().T, so you should transpose before quantizing if you want to do x @ W

By default, HQQLinear will dequantize first then do the dot-product, but you can use other backends like torchao_int4 that will do a fused dot-product without dequantization.

@bxren
Copy link
Author

bxren commented Feb 11, 2025

Thanks a lot! I really appreciate your help.

@mobicham
Copy link
Collaborator

Happy to help, if you have a toy example with code, I can help you out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants