Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Has anyone successfully quantinize Deepseek-V3 to int4-w4a16? #1203

Open
halexan opened this issue Feb 27, 2025 · 5 comments
Open
Assignees

Comments

@halexan
Copy link

halexan commented Feb 27, 2025

Has anyone successfully quantinize Deepseek-V3 to int4-w4a16?

@JeffRody
Copy link

KeyError: 'model.layers.61.self_attn.q_a_proj.weight'

@JeffRody
Copy link

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor.entrypoints import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

NOTE: transformers 4.49.0 has an attribute error with DeepSeek.

Please consider either downgrading your transformers version to a

previous version or upgrading to a version where this bug is fixed

select a Mixture of Experts model for quantization

MODEL_ID = "/home/wanglch/data/DeepSeek-R1-bf16"

adjust based off number of desired GPUs

if not enough memory is available, some layers will automatically be offlaoded to cpu

device_map = calculate_offload_device_map(
MODEL_ID,
reserve_for_hessians=True,
num_gpus=6,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map=device_map, torch_dtype=torch.bfloat16, trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

Select calibration dataset.

its recommended to use more calibration samples for MoE models so each expert is hit

DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 2048
MAX_SEQUENCE_LENGTH = 2048

Load dataset and preprocess.

ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
)
}

ds = ds.map(preprocess)

Tokenize inputs.

def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)

ds = ds.map(tokenize, remove_columns=ds.column_names)

define a llmcompressor recipe for INT8 W8A8 quantization

since the MoE gate layers are sensitive to quantization, we add them to the ignore

list so they remain at full precision

recipe = [
GPTQModifier(
targets="Linear",
scheme="W8A8",
ignore=["lm_head", "re:.*mlp.gate$"],
),
]

SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8"

oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
trust_remote_code_model=True,
save_compressed=True,
output_dir=SAVE_DIR,
)

print("========== SAMPLE GENERATION ==============")
SAMPLE_INPUT = ["I love quantization because"]
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
inputs = tokenizer(SAMPLE_INPUT, return_tensors="pt", padding=True).to(model.device)
output = model.generate(**inputs, max_length=50)
text_output = tokenizer.batch_decode(output)
print(text_output)

@halexan
Copy link
Author

halexan commented Feb 27, 2025

@JeffRody

You can use markdown code block to show your codes. Its easy to read

@brian-dellabetta
Copy link
Collaborator

brian-dellabetta commented Feb 28, 2025

Hi, is your stacktrace the same as shown in #1204 ? If so, please see this comment. If not, please reply

@liu316484231
Copy link

KeyError: 'model.layers.61.self_attn.q_a_proj.weight'

hello,did u solve the problem,i encountered the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants