Lazy Loading of Weights for Large Model Quantization #1216

zjnyly · 2025-03-01T02:26:06Z

I noticed that LLM-Compressor needs to fully load the model weights onto the CPU/GPU before starting the pruning or quantization process. However, when dealing with very large models on the CPU, the pruning process becomes extremely slow and could take an impractically long time.

Would it be possible to implement a layer-by-layer pruning approach on the GPU to improve efficiency and reduce memory overhead? This would significantly speed up the process and make it more feasible for large-scale models.

kylesayrs · 2025-03-02T00:02:34Z

Hi @zjnyly!

Actually, layer-by-layer GPU onloading is already the default strategy implemented by LLM Compressor. You can verify this by monitoring GPU utilization while compressing CPU-offloaded modules.

llm-compressor/src/llmcompressor/modifiers/quantization/gptq/base.py

Lines 314 to 318 in 5105475

    
           init_device = ( 
        
               "cpu" if self.offload_hessians else get_execution_device(module) 
        
           ) 
        
           self._hessians[module] = make_empty_hessian(module, device=init_device) 
        
           self._num_samples[module] = 0

LLM Compressor performs compression operations on the same execution device as the module which is being compressed. The execution device can be controlled by the HF device map.

zjnyly · 2025-03-02T06:23:55Z

offload_hessians

Hi, If the model is too large to load on single GPU, I have to swith to CPU. And I set offload_hessians to be True, but it still take so long.

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="cpu", torch_dtype="auto",
)
recipe = [
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"], offload_hessians=True),
]

2025-03-02T14:21:04.286499+0800 | apply_compression | INFO - Running GPTQModifier calibration with 512 samples...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:01<00:00, 395.78it/s]
2025-03-02T14:21:05.581816+0800 | apply_compression | INFO - 
===== Compressing layer 1/32  =====
2025-03-02T14:21:05.581898+0800 | apply_compression | INFO - Calibrating model.layers.0...
  0%|▌                                                                                                                               | 2/512 [02:15<9:46:16, 68.97s/it]

zjnyly added the enhancement New feature or request label Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy Loading of Weights for Large Model Quantization #1216

Lazy Loading of Weights for Large Model Quantization #1216

zjnyly commented Mar 1, 2025 •

edited

Loading

kylesayrs commented Mar 2, 2025

zjnyly commented Mar 2, 2025

Lazy Loading of Weights for Large Model Quantization #1216

Lazy Loading of Weights for Large Model Quantization #1216

Comments

zjnyly commented Mar 1, 2025 • edited Loading

kylesayrs commented Mar 2, 2025

zjnyly commented Mar 2, 2025

zjnyly commented Mar 1, 2025 •

edited

Loading