OOM during save_pretrained of compressed model #1183

chmeyers · 2025-02-22T00:14:39Z

Describe the bug
The OOM was for CPU RAM. GPU RAM usage was normal, the model takes up less than half of the GPU.

This was hitting the llmcompressor's modified save_pretrained_wrapper from

llm-compressor/src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py

Line 122 in 1101723

def save_pretrained_wrapper(

I ran save_pretrained with skip_compression_stats=True

After some investigation it seems the main offenders was:
save_pretrained_wrapper uses get_state_dict_offloaded_model() which pulls some of the tensors off of the GPU.

Other things I noticed while investigating
In get_model_compressor: SparsityConfigMetadata.infer_sparsity_structure() is called even if it's never used when skip_compression_stats==True and sparsity_config==None and save_compressed==False. There didn't seem to be a way to just disable it even though I knew my model wasn't sparse. It seemed like this sparsity inference was using a lot of RAM, but I did not test whether this was still a problem after working around the get_state_dict_offloaded_model thing.

Expected behavior
save_pretrained() should not require more CPU RAM than the max_shard_size for the safetensor files.

Environment
llm-compressor 0.4.1
This was during a W4A16 quantization of meta-llama/Llama-3.1-8B-Instruct on an AWS g6e.xlarge (L40 GPU with 48GB VRAM, 32 GB CPU RAM, running on Ray such that 28 GB of CPU RAM was available to the worker.)

To Reproduce
Exact steps to reproduce the behavior:
It basically reproed with the example script:
https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py

Errors
OOM

Additional context
n/a

HelloCard · 2025-02-22T13:17:44Z

I also noticed that after updating llm-compressor to 0.4.1, memory consumption increased again, and even using calculate_offload_device_map method still caused crashes.

HelloCard · 2025-02-22T16:38:25Z

Lowering NUM_CALIBRATION_SAMPLES to 1024 still crashes, something that was a breeze in v0.4.0, is so hard now.

HelloCard · 2025-02-22T19:47:01Z

After rolling back to version 0.4.0, the memory usage of the exact same script is as follows:

dsikka · 2025-02-24T01:55:32Z

@HelloCard

Can you share the recipe you're applying that is showing the spike in the memory?

HelloCard · 2025-02-24T07:12:04Z

@dsikka

from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

MODEL_ID = "/root/autodl-tmp/Cydonia-24B-v2"
device_map = calculate_offload_device_map(
    MODEL_ID, num_gpus=1, reserve_for_hessians=True, torch_dtype="auto")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map=device_map, torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)



from datasets import load_dataset

NUM_CALIBRATION_SAMPLES = 1024
MAX_SEQUENCE_LENGTH = 4096


# Load and preprocess the dataset
ds = load_dataset("/root/autodl-tmp/ultrachat_2k", split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))


def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)


from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

# Configure the quantization algorithms

recipe = """
DEFAULT_stage:
  DEFAULT_modifiers:
    SmoothQuantModifier:
      smoothing_strength: 0.7
      mappings:
      - - ['re:.*q_proj', 're:.*k_proj', 're:.*v_proj']
        - re:.*input_layernorm
      - - ['re:.*gate_proj', 're:.*up_proj']
        - re:.*post_attention_layernorm
      - - ['re:.*down_proj']
        - re:.*up_proj
    GPTQModifier:
      sequential_update: true
      dampening_frac: 0.1
      quantize: true
      config_groups:
        group_0:
          targets: [Linear]
          weights: {num_bits: 8, type: int, symmetric: true, strategy: channel, observer: mse}
          input_activations: {num_bits: 8, type: int, symmetric: true, strategy: token, dynamic: true,
            observer: null}
      ignore: [lm_head]
"""

# Apply quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save the compressed model
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

DoubleRedX · 2025-03-05T04:32:55Z

Any progress on this? I've met the same one.

dsikka · 2025-03-05T17:24:08Z

Hi, will prioritize to investigate what you're seeing in the next week or so.

chmeyers added the bug Something isn't working label Feb 22, 2025

dsikka self-assigned this Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM during save_pretrained of compressed model #1183

OOM during save_pretrained of compressed model #1183

chmeyers commented Feb 22, 2025

HelloCard commented Feb 22, 2025

HelloCard commented Feb 22, 2025

HelloCard commented Feb 22, 2025

dsikka commented Feb 24, 2025

HelloCard commented Feb 24, 2025 •

edited

Loading

DoubleRedX commented Mar 5, 2025

dsikka commented Mar 5, 2025

OOM during save_pretrained of compressed model #1183

OOM during save_pretrained of compressed model #1183

Comments

chmeyers commented Feb 22, 2025

HelloCard commented Feb 22, 2025

HelloCard commented Feb 22, 2025

HelloCard commented Feb 22, 2025

dsikka commented Feb 24, 2025

HelloCard commented Feb 24, 2025 • edited Loading

DoubleRedX commented Mar 5, 2025

dsikka commented Mar 5, 2025

HelloCard commented Feb 24, 2025 •

edited

Loading