[Question] Has anyone successfully quantinize Deepseek-V3 to int4-w4a16? #1203

halexan · 2025-02-27T01:03:30Z

Has anyone successfully quantinize Deepseek-V3 to int4-w4a16?

JeffRody · 2025-02-27T06:45:59Z

KeyError: 'model.layers.61.self_attn.q_a_proj.weight'

JeffRody · 2025-02-27T06:46:09Z

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor.entrypoints import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

NOTE: transformers 4.49.0 has an attribute error with DeepSeek.

Please consider either downgrading your transformers version to a

previous version or upgrading to a version where this bug is fixed

select a Mixture of Experts model for quantization

MODEL_ID = "/home/wanglch/data/DeepSeek-R1-bf16"

adjust based off number of desired GPUs

if not enough memory is available, some layers will automatically be offlaoded to cpu

device_map = calculate_offload_device_map(
MODEL_ID,
reserve_for_hessians=True,
num_gpus=6,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map=device_map, torch_dtype=torch.bfloat16, trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

Select calibration dataset.

its recommended to use more calibration samples for MoE models so each expert is hit

DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 2048
MAX_SEQUENCE_LENGTH = 2048

Load dataset and preprocess.

ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
)
}

ds = ds.map(preprocess)

Tokenize inputs.

def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)

ds = ds.map(tokenize, remove_columns=ds.column_names)

define a llmcompressor recipe for INT8 W8A8 quantization

since the MoE gate layers are sensitive to quantization, we add them to the ignore

list so they remain at full precision

recipe = [
GPTQModifier(
targets="Linear",
scheme="W8A8",
ignore=["lm_head", "re:.*mlp.gate$"],
),
]

SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8"

oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
trust_remote_code_model=True,
save_compressed=True,
output_dir=SAVE_DIR,
)

print("========== SAMPLE GENERATION ==============")
SAMPLE_INPUT = ["I love quantization because"]
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
inputs = tokenizer(SAMPLE_INPUT, return_tensors="pt", padding=True).to(model.device)
output = model.generate(**inputs, max_length=50)
text_output = tokenizer.batch_decode(output)
print(text_output)

halexan · 2025-02-27T07:49:32Z

@JeffRody

You can use markdown code block to show your codes. Its easy to read

brian-dellabetta · 2025-02-28T22:46:37Z

Hi, is your stacktrace the same as shown in #1204 ? If so, please see this comment. If not, please reply

liu316484231 · 2025-03-05T01:28:10Z

KeyError: 'model.layers.61.self_attn.q_a_proj.weight'

hello，did u solve the problem，i encountered the same

brian-dellabetta self-assigned this Feb 28, 2025

halexan mentioned this issue Mar 4, 2025

run llmcompressor to transfer deepseek r1 bf16 to w8a8 occur error #1221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Has anyone successfully quantinize Deepseek-V3 to int4-w4a16? #1203

[Question] Has anyone successfully quantinize Deepseek-V3 to int4-w4a16? #1203

halexan commented Feb 27, 2025

JeffRody commented Feb 27, 2025

JeffRody commented Feb 27, 2025

halexan commented Feb 27, 2025 •

edited

Loading

brian-dellabetta commented Feb 28, 2025 •

edited

Loading

liu316484231 commented Mar 5, 2025

[Question] Has anyone successfully quantinize Deepseek-V3 to int4-w4a16? #1203

[Question] Has anyone successfully quantinize Deepseek-V3 to int4-w4a16? #1203

Comments

halexan commented Feb 27, 2025

JeffRody commented Feb 27, 2025

JeffRody commented Feb 27, 2025

NOTE: transformers 4.49.0 has an attribute error with DeepSeek.

Please consider either downgrading your transformers version to a

previous version or upgrading to a version where this bug is fixed

select a Mixture of Experts model for quantization

adjust based off number of desired GPUs

if not enough memory is available, some layers will automatically be offlaoded to cpu

Select calibration dataset.

its recommended to use more calibration samples for MoE models so each expert is hit

Load dataset and preprocess.

Tokenize inputs.

define a llmcompressor recipe for INT8 W8A8 quantization

since the MoE gate layers are sensitive to quantization, we add them to the ignore

list so they remain at full precision

halexan commented Feb 27, 2025 • edited Loading

brian-dellabetta commented Feb 28, 2025 • edited Loading

liu316484231 commented Mar 5, 2025

halexan commented Feb 27, 2025 •

edited

Loading

brian-dellabetta commented Feb 28, 2025 •

edited

Loading