You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that LLM-Compressor needs to fully load the model weights onto the CPU/GPU before starting the pruning or quantization process. However, when dealing with very large models on the CPU, the pruning process becomes extremely slow and could take an impractically long time.
Would it be possible to implement a layer-by-layer pruning approach on the GPU to improve efficiency and reduce memory overhead? This would significantly speed up the process and make it more feasible for large-scale models.
The text was updated successfully, but these errors were encountered:
Actually, layer-by-layer GPU onloading is already the default strategy implemented by LLM Compressor. You can verify this by monitoring GPU utilization while compressing CPU-offloaded modules.
LLM Compressor performs compression operations on the same execution device as the module which is being compressed. The execution device can be controlled by the HF device map.
I noticed that LLM-Compressor needs to fully load the model weights onto the CPU/GPU before starting the pruning or quantization process. However, when dealing with very large models on the CPU, the pruning process becomes extremely slow and could take an impractically long time.
Would it be possible to implement a layer-by-layer pruning approach on the GPU to improve efficiency and reduce memory overhead? This would significantly speed up the process and make it more feasible for large-scale models.
The text was updated successfully, but these errors were encountered: