-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
W4A8 model larger than W4A16 #1215
Comments
Hi @chmeyers - can you share the config produced as well as the recipe that you applied? |
Recipes were: config.json of the W4A8 model was: |
Hi @chmeyers the variation that you're seeing is because of the compressor that is being applied when saving the quantized model to disk. When doing weight only quantization, (W4A16/W8A16) we select the packed_quantized compressor. When adding in activation quantization, we select the int-quantized or float-quantized compressor. You can see further details on how the compressor is selected, by referring to the docstring and functions listed here. |
Describe the bug
I ran the example w4a16 script on meta-llama/Llama-3.1-8B-Instruct with both W4A16 and W4A8 schemes, and the W4A8 model was much larger. Specifically, the W4A16 model came out to 5,700,595,200 bytes, and the W4A8 model was 9,190,252,544 bytes. (5.7GB vs 9.2GB; values taken from model.safetensors.index.json but they seem to match the size on disk)
The W4A16 model seems to be the correct size, but the W4A8 model seems to be the of similar size to a W8A8. Maybe the weight tensors are being saved using 8 bits?
Expected behavior
W4A8 should be smaller, right?
Environment
Include all relevant environment information:
OS: Amazon Linux 2023.6.20250203
Python version: 3.10
llmcompressor==0.4.1
ML framework version(s) [e.g. torch 2.3.1]:
Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
compressed-tensors==0.9.0
accelerate==1.1.1
onnx==1.17.0
optimum==1.23.3
transformers==4.47.0
torch==2.5.1
vllm==0.7.0
ray==2.40.0
numpy==1.26.4
Other relevant environment information [e.g. hardware, CUDA version]:
Ran on a Ray node on a AWS g6e.24xlarge (4xL40 GPUs, but it only used one for this model.)
To Reproduce
Exact steps to reproduce the behavior:
I used this example: https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py
Ran it twice, once with the scheme changed to w4a8
Errors
N/A
Additional context
Add any other context about the problem here. Also include any relevant files.
The text was updated successfully, but these errors were encountered: