Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy Loading of Weights for Large Model Quantization #1216

Open
zjnyly opened this issue Mar 1, 2025 · 2 comments
Open

Lazy Loading of Weights for Large Model Quantization #1216

zjnyly opened this issue Mar 1, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@zjnyly
Copy link

zjnyly commented Mar 1, 2025

I noticed that LLM-Compressor needs to fully load the model weights onto the CPU/GPU before starting the pruning or quantization process. However, when dealing with very large models on the CPU, the pruning process becomes extremely slow and could take an impractically long time.

Would it be possible to implement a layer-by-layer pruning approach on the GPU to improve efficiency and reduce memory overhead? This would significantly speed up the process and make it more feasible for large-scale models.

@zjnyly zjnyly added the enhancement New feature or request label Mar 1, 2025
@kylesayrs
Copy link
Collaborator

Hi @zjnyly!

Actually, layer-by-layer GPU onloading is already the default strategy implemented by LLM Compressor. You can verify this by monitoring GPU utilization while compressing CPU-offloaded modules.

init_device = (
"cpu" if self.offload_hessians else get_execution_device(module)
)
self._hessians[module] = make_empty_hessian(module, device=init_device)
self._num_samples[module] = 0

LLM Compressor performs compression operations on the same execution device as the module which is being compressed. The execution device can be controlled by the HF device map.

@zjnyly
Copy link
Author

zjnyly commented Mar 2, 2025

offload_hessians

Hi, If the model is too large to load on single GPU, I have to swith to CPU. And I set offload_hessians to be True, but it still take so long.

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="cpu", torch_dtype="auto",
)
recipe = [
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"], offload_hessians=True),
]
2025-03-02T14:21:04.286499+0800 | apply_compression | INFO - Running GPTQModifier calibration with 512 samples...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:01<00:00, 395.78it/s]
2025-03-02T14:21:05.581816+0800 | apply_compression | INFO - 
===== Compressing layer 1/32  =====
2025-03-02T14:21:05.581898+0800 | apply_compression | INFO - Calibrating model.layers.0...
  0%|▌                                                                                                                               | 2/512 [02:15<9:46:16, 68.97s/it]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants