Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM during save_pretrained of compressed model #1183

Open
chmeyers opened this issue Feb 22, 2025 · 7 comments
Open

OOM during save_pretrained of compressed model #1183

chmeyers opened this issue Feb 22, 2025 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@chmeyers
Copy link

Describe the bug
The OOM was for CPU RAM. GPU RAM usage was normal, the model takes up less than half of the GPU.

This was hitting the llmcompressor's modified save_pretrained_wrapper from

I ran save_pretrained with skip_compression_stats=True

After some investigation it seems the main offenders was:
save_pretrained_wrapper uses get_state_dict_offloaded_model() which pulls some of the tensors off of the GPU.

Other things I noticed while investigating
In get_model_compressor: SparsityConfigMetadata.infer_sparsity_structure() is called even if it's never used when skip_compression_stats==True and sparsity_config==None and save_compressed==False. There didn't seem to be a way to just disable it even though I knew my model wasn't sparse. It seemed like this sparsity inference was using a lot of RAM, but I did not test whether this was still a problem after working around the get_state_dict_offloaded_model thing.

Expected behavior
save_pretrained() should not require more CPU RAM than the max_shard_size for the safetensor files.

Environment
llm-compressor 0.4.1
This was during a W4A16 quantization of meta-llama/Llama-3.1-8B-Instruct on an AWS g6e.xlarge (L40 GPU with 48GB VRAM, 32 GB CPU RAM, running on Ray such that 28 GB of CPU RAM was available to the worker.)

To Reproduce
Exact steps to reproduce the behavior:
It basically reproed with the example script:
https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py

Errors
OOM

Additional context
n/a

@chmeyers chmeyers added the bug Something isn't working label Feb 22, 2025
@HelloCard
Copy link

Image

Image

I also noticed that after updating llm-compressor to 0.4.1, memory consumption increased again, and even using calculate_offload_device_map method still caused crashes.

@HelloCard
Copy link

Image

Image

Lowering NUM_CALIBRATION_SAMPLES to 1024 still crashes, something that was a breeze in v0.4.0, is so hard now.

@HelloCard
Copy link

After rolling back to version 0.4.0, the memory usage of the exact same script is as follows:

Image

Image

@dsikka dsikka self-assigned this Feb 23, 2025
@dsikka
Copy link
Collaborator

dsikka commented Feb 24, 2025

@HelloCard

Can you share the recipe you're applying that is showing the spike in the memory?

@HelloCard
Copy link

HelloCard commented Feb 24, 2025

@dsikka

from transformers import AutoTokenizer, AutoModelForCausalLM
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

MODEL_ID = "/root/autodl-tmp/Cydonia-24B-v2"
device_map = calculate_offload_device_map(
    MODEL_ID, num_gpus=1, reserve_for_hessians=True, torch_dtype="auto")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map=device_map, torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)



from datasets import load_dataset

NUM_CALIBRATION_SAMPLES = 1024
MAX_SEQUENCE_LENGTH = 4096


# Load and preprocess the dataset
ds = load_dataset("/root/autodl-tmp/ultrachat_2k", split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))


def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)


from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

# Configure the quantization algorithms

recipe = """
DEFAULT_stage:
  DEFAULT_modifiers:
    SmoothQuantModifier:
      smoothing_strength: 0.7
      mappings:
      - - ['re:.*q_proj', 're:.*k_proj', 're:.*v_proj']
        - re:.*input_layernorm
      - - ['re:.*gate_proj', 're:.*up_proj']
        - re:.*post_attention_layernorm
      - - ['re:.*down_proj']
        - re:.*up_proj
    GPTQModifier:
      sequential_update: true
      dampening_frac: 0.1
      quantize: true
      config_groups:
        group_0:
          targets: [Linear]
          weights: {num_bits: 8, type: int, symmetric: true, strategy: channel, observer: mse}
          input_activations: {num_bits: 8, type: int, symmetric: true, strategy: token, dynamic: true,
            observer: null}
      ignore: [lm_head]
"""

# Apply quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save the compressed model
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

@DoubleRedX
Copy link

Any progress on this? I've met the same one.

@dsikka
Copy link
Collaborator

dsikka commented Mar 5, 2025

Hi, will prioritize to investigate what you're seeing in the next week or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants