-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM during save_pretrained of compressed model #1183
Labels
bug
Something isn't working
Comments
Can you share the recipe you're applying that is showing the spike in the memory? |
|
Any progress on this? I've met the same one. |
Hi, will prioritize to investigate what you're seeing in the next week or so. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
The OOM was for CPU RAM. GPU RAM usage was normal, the model takes up less than half of the GPU.
This was hitting the llmcompressor's modified save_pretrained_wrapper from
llm-compressor/src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py
Line 122 in 1101723
I ran save_pretrained with skip_compression_stats=True
After some investigation it seems the main offenders was:
save_pretrained_wrapper uses get_state_dict_offloaded_model() which pulls some of the tensors off of the GPU.
Other things I noticed while investigating
In get_model_compressor: SparsityConfigMetadata.infer_sparsity_structure() is called even if it's never used when skip_compression_stats==True and sparsity_config==None and save_compressed==False. There didn't seem to be a way to just disable it even though I knew my model wasn't sparse. It seemed like this sparsity inference was using a lot of RAM, but I did not test whether this was still a problem after working around the get_state_dict_offloaded_model thing.
Expected behavior
save_pretrained() should not require more CPU RAM than the max_shard_size for the safetensor files.
Environment
llm-compressor 0.4.1
This was during a W4A16 quantization of meta-llama/Llama-3.1-8B-Instruct on an AWS g6e.xlarge (L40 GPU with 48GB VRAM, 32 GB CPU RAM, running on Ray such that 28 GB of CPU RAM was available to the worker.)
To Reproduce
Exact steps to reproduce the behavior:
It basically reproed with the example script:
https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py
Errors
OOM
Additional context
n/a
The text was updated successfully, but these errors were encountered: