DeepSeekV3-671B-BF16 Lora Finetune #6843

xs1997zju · 2025-02-07T02:37:45Z

What does this PR do?

DeepSeekV3-671B-BF16 Lora Finetune

Fixes #6824
Fixes #6829

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

lxg2015 · 2025-02-08T02:52:23Z

@xs1997zju 你好，这是使用了什么硬件资源配置跑起来的？

hiyouga · 2025-02-08T16:59:36Z

src/llamafactory/model/model_utils/moe.py

+        # deepseek_v3 moe module set as leaf node
+        for layer in model.model.layers:
+            if 'DeepseekV3MoE' in str(type(layer.mlp)):
+                layer.mlp._z3_leaf=True


How about

if model_type == "deepseek_v3": _set_z3_leaf_modules(model, [model.model.layers[0].mlp.__class__])

?

How about

if model_type == "deepseek_v3": _set_z3_leaf_modules(model, [model.model.layers[0].mlp.__class__])

?
这样写看起来也ok，可以直接调_set_z3_leaf_modules接口

前三层是一个普通的MLP不是MOE，看上去_set_z3_leaf_modules(model, [model.model.layers[0].mlp.__class__])写法不能作用到DeepseekV3MoE

xs1997zju · 2025-02-10T03:27:27Z

@xs1997zju 你好，这是使用了什么硬件资源配置跑起来的？

你好， 4机32卡 A卡

Harryjun · 2025-02-10T10:15:18Z

@xs1997 跑的lora训练吗？4机 A100-80G吗？

xs1997zju · 2025-02-10T10:59:46Z

@hiyouga 请问下你这边也有跑过v3-671B的lora不，最大长度能训到多少？

hiyouga · 2025-02-10T11:06:41Z

@xs1997zju 看机器数量。跑训练要修改 modeling 文件才行吧

xs1997zju · 2025-02-10T11:08:14Z

@xs1997zju 看机器数量。跑训练要修改 modeling 文件才行吧

你这边实际跑用了几机呢？

xs1997zju · 2025-02-10T11:10:18Z

@xs1997zju 看机器数量。跑训练要修改 modeling 文件才行吧

你这边实际跑用了几机呢？

@hiyouga

xs1997zju · 2025-02-10T11:11:40Z

80 张

8卡x10机， A卡还是H卡，你测试过最大能训到多长的seq-len? 1024？ 2048？ @hiyouga

Harryjun · 2025-02-10T11:43:24Z

能训到多长的seq-len? 1024？ 2048？ @hiyouga

lora rank多大？
开zero3 cpuoff吗？

Harryjun · 2025-02-10T11:44:06Z

@hiyouga @xs1997zju 代码还不合呢

xs1997zju · 2025-02-10T11:57:15Z

80G 卡，4k seqlen

@hiyouga bf16？

Han-Huaqiao · 2025-02-10T13:06:33Z

@hiyouga 如上图，deepseek V3的MoEGate模块的forward，通过 assert not self.training 限制了模型不支持微调。这部分的代码是怎么修改的呢？

xs1997zju · 2025-02-10T13:15:52Z

80G 卡，4k seqlen

@hiyouga bf16？

@hiyouga 方便show下你的具体配置？

Han-Huaqiao · 2025-02-10T13:24:11Z

80G 卡，4k seqlen

@hiyouga bf16？

@hiyouga 方便show下你的具体配置？

这个commit： https://github.com/hiyouga/LLaMA-Factory/pull/6843/files/be21531bab79793c5fc87928d63d793ca6dd8e98

xs1997zju · 2025-02-10T14:47:40Z

80G 卡，4k seqlen
确认下是4096打满的数据来测的？还是只是max_seq_len设置成了4096，实际数据没到这个长度呢

xs1997zju · 2025-02-10T14:47:55Z

80G 卡，4k seqlen
确认下是4096打满的数据来测的？还是只是max_seq_len设置成了4096，实际数据没到这个长度呢

@hiyouga

flyinghu123 · 2025-02-11T03:04:00Z

modeling 文件需要做哪些修改啊，目前有下面两个问题

Cccei000 · 2025-02-11T03:25:04Z

modeling 文件需要做哪些修改啊，目前有下面两个问题

assert not self.training 应该可以直接忽略掉，这段是已经得到所有expert的routing scores之后算出topk，然后根据topk indices去取routing scores。中间部分都是在算topk indices，训练只要保证取出来scores也就是topk weight有梯度回传就行。

Han-Huaqiao · 2025-02-11T03:31:00Z

modeling 文件需要做哪些修改啊，目前有下面两个问题

assert not self.training 应该可以直接忽略掉，这段是已经得到所有expert的routing scores之后算出topk，然后根据topk indices去取routing scores。中间部分都是在算topk indices，训练只要保证取出来scores也就是topk weight有梯度回传就行。

如果忽略掉的话，那整个MOE模块就不会更新参数了。这样的话，是load已有的模型还好，但是如果是想要借用该文件，从零开始train的话，就肯定不成了。

但是看deepseek V2的moe模块有相关的训练代码，在试能不能复用

Cccei000 · 2025-02-11T05:54:05Z

assert not self.training 应该可以直接忽略掉，这段是已经得到所有expert的routing scores之后算出topk，然后根据topk indices去取routing scores。中间部分都是在算topk indices，训练只要保证取出来scores也就是topk weight有梯度回传就行。

如果忽略掉的话，那整个MOE模块就不会更新参数了。这样的话，是load已有的模型还好，但是如果是想要借用该文件，从零开始train的话，就肯定不成了。

但是看deepseek V2的moe模块有相关的训练代码，在试能不能复用

为啥忽略掉就不会更新参数，是这个实现中topk_weight没有梯度吗？我现在没资源，你是跑过发现这个情况吗？还有我这里只是说MoEGate，DeepseekV3MoE哪个moe_infer还在看

flyinghu123 · 2025-02-11T05:54:25Z

modeling 文件需要做哪些修改啊，目前有下面两个问题

assert not self.training 应该可以直接忽略掉，这段是已经得到所有expert的routing scores之后算出topk，然后根据topk indices去取routing scores。中间部分都是在算topk indices，训练只要保证取出来scores也就是topk weight有梯度回传就行。

如果忽略掉的话，那整个MOE模块就不会更新参数了。这样的话，是load已有的模型还好，但是如果是想要借用该文件，从零开始train的话，就肯定不成了。

但是看deepseek V2的moe模块有相关的训练代码，在试能不能复用

不理解assert not self.training忽略掉，为什么整个MOE模块就不会更新参数，下面的DeepseekV3MoE forward在V2上看到对应train代码，看上去可以复用，应该要去掉Aux loss那一行

Han-Huaqiao · 2025-02-11T05:59:40Z

modeling 文件需要做哪些修改啊，目前有下面两个问题

assert not self.training 应该可以直接忽略掉，这段是已经得到所有expert的routing scores之后算出topk，然后根据topk indices去取routing scores。中间部分都是在算topk indices，训练只要保证取出来scores也就是topk weight有梯度回传就行。

如果忽略掉的话，那整个MOE模块就不会更新参数了。这样的话，是load已有的模型还好，但是如果是想要借用该文件，从零开始train的话，就肯定不成了。
但是看deepseek V2的moe模块有相关的训练代码，在试能不能复用

不理解assert not self.training忽略掉，为什么整个MOE模块就不会更新参数，下面的DeepseekV3MoE forward在V2上看到对应train代码，看上去可以复用，应该要去掉Aux loss那一行

刚试了下，根据V2的 MOE的代码，修改后确实能跑。

Han-Huaqiao · 2025-02-11T06:03:07Z

assert not self.training 应该可以直接忽略掉，这段是已经得到所有expert的routing scores之后算出topk，然后根据topk indices去取routing scores。中间部分都是在算topk indices，训练只要保证取出来scores也就是topk weight有梯度回传就行。

如果忽略掉的话，那整个MOE模块就不会更新参数了。这样的话，是load已有的模型还好，但是如果是想要借用该文件，从零开始train的话，就肯定不成了。
但是看deepseek V2的moe模块有相关的训练代码，在试能不能复用

为啥忽略掉就不会更新参数，是这个实现中topk_weight没有梯度吗？我现在没资源，你是跑过发现这个情况吗？还有我这里只是说MoEGate，DeepseekV3MoE哪个moe_infer还在看

可以看下，DeepseekV3MoE的forward函数，根据它现在的代码，训练时，返回值y 实际没有使用 self.gate函数的返回值进行计算

Cccei000 · 2025-02-11T06:05:08Z

我们说的不是一件事

flyinghu123 · 2025-02-11T06:06:46Z

modeling 文件需要做哪些修改啊，目前有下面两个问题

assert not self.training 应该可以直接忽略掉，这段是已经得到所有expert的routing scores之后算出topk，然后根据topk indices去取routing scores。中间部分都是在算topk indices，训练只要保证取出来scores也就是topk weight有梯度回传就行。

如果忽略掉的话，那整个MOE模块就不会更新参数了。这样的话，是load已有的模型还好，但是如果是想要借用该文件，从零开始train的话，就肯定不成了。
但是看deepseek V2的moe模块有相关的训练代码，在试能不能复用

不理解assert not self.training忽略掉，为什么整个MOE模块就不会更新参数，下面的DeepseekV3MoE forward在V2上看到对应train代码，看上去可以复用，应该要去掉Aux loss那一行

刚试了下，根据V2的 MOE的代码，修改后确实能跑。

能分享下完整的修改吗，我裁剪模型测试，按照pr修改可以跑，下面的_set_z3_leaf_modules记得会有个报错

Han-Huaqiao · 2025-02-11T06:17:28Z

modeling 文件需要做哪些修改啊，目前有下面两个问题

assert not self.training 应该可以直接忽略掉，这段是已经得到所有expert的routing scores之后算出topk，然后根据topk indices去取routing scores。中间部分都是在算topk indices，训练只要保证取出来scores也就是topk weight有梯度回传就行。

如果忽略掉的话，那整个MOE模块就不会更新参数了。这样的话，是load已有的模型还好，但是如果是想要借用该文件，从零开始train的话，就肯定不成了。
但是看deepseek V2的moe模块有相关的训练代码，在试能不能复用

不理解assert not self.training忽略掉，为什么整个MOE模块就不会更新参数，下面的DeepseekV3MoE forward在V2上看到对应train代码，看上去可以复用，应该要去掉Aux loss那一行

刚试了下，根据V2的 MOE的代码，修改后确实能跑。

能分享下完整的修改吗，我裁剪模型测试，按照pr修改可以跑，下面的_set_z3_leaf_modules记得会有个报错

看了下，我和你的改动一致。但是我目前是采用了单机单卡跑的，修改了hidden_size和num_hidden_layers的参数配置。如下：

flyinghu123 · 2025-02-15T08:26:32Z

继续等，应该是deepspeed的optimizer check过程
哦哦，感觉不太正常，已经等了1h20min

再等等应该差不多，主要是deepspeed _check_for_duplicates用循环硬写的，慢点正常，不愿意等可以和上面一样，在load_model前面添加

from deepspeed.runtime.engine import DeepSpeedEngine
DeepSpeedEngine._check_for_duplicates = lambda self,optimizer:None

Mryangkaitong · 2025-02-15T09:08:42Z

继续等，应该是deepspeed的optimizer check过程
哦哦，感觉不太正常，已经等了1h20min

再等等应该差不多，主要是deepspeed _check_for_duplicates用循环硬写的，慢点正常，不愿意等可以和上面一样，在load_model前面添加
from deepspeed.runtime.engine import DeepSpeedEngine
DeepSpeedEngine._check_for_duplicates = lambda self,optimizer:None

大佬，加了这行代码后又重启了下，还是卡在同样的位置，显存和上面一样都一直是14619MiB，没有看到任何变化的迹象（如果是在不断装参数，起码可以看到变化），目前已经等了20min。

beichengus · 2025-02-15T13:03:49Z

好像在训练完成后保存模型的时候OOM了，我这里内存是2T，可以成功保存和merge lora吗，求问你们内存的峰值大概用到了多少

RechardWong · 2025-02-16T06:48:32Z

好像在训练完成后保存模型的时候OOM了，我这里内存是2T，可以成功保存和merge lora吗，求问你们内存的峰值大概用到了多少

我这也是单机2T的内存，同样是在训练完成的时候由于内存占用过高被系统kill了。

beichengus · 2025-02-16T08:20:04Z

modeling 文件需要做哪些修改啊，目前有下面两个问题

assert not self.training 应该可以直接忽略掉，这段是已经得到所有expert的routing scores之后算出topk，然后根据topk indices去取routing scores。中间部分都是在算topk indices，训练只要保证取出来scores也就是topk weight有梯度回传就行。

如果忽略掉的话，那整个MOE模块就不会更新参数了。这样的话，是load已有的模型还好，但是如果是想要借用该文件，从零开始train的话，就肯定不成了。
但是看deepseek V2的moe模块有相关的训练代码，在试能不能复用

不理解assert not self.training忽略掉，为什么整个MOE模块就不会更新参数，下面的DeepseekV3MoE forward在V2上看到对应train代码，看上去可以复用，应该要去掉Aux loss那一行

刚试了下，根据V2的 MOE的代码，修改后确实能跑。

能分享下完整的修改吗，我裁剪模型测试，按照pr修改可以跑，下面的_set_z3_leaf_modules记得会有个报错

看了下，我和你的改动一致。但是我目前是采用了单机单卡跑的，修改了hidden_size和num_hidden_layers的参数配置。如下：

你修改num_hidden_layers为几，正常这里超过4层应该还会存在float32 * bfloat16错误


8

有点奇怪，你有打印每层decoder输出的dtype吗

还没，不过我这的grad_norm目前输出是nan

我这里也是nan，你这个问题有解决吗

Harryjun · 2025-02-17T03:44:17Z

请问modeling文件需要修改什么吗？10机8卡A100依然会报显存OOM的错误

@jiefisher 你这个怎么样了？

jiefisher · 2025-02-17T05:09:35Z

请问modeling文件需要修改什么吗？10机8卡A100依然会报显存OOM的错误

@jiefisher 你这个怎么样了？

保存模型的时候内存爆了训练没问题

yxliu0903 · 2025-02-17T13:33:50Z

What does this PR do?

DeepSeekV3-671B-BF16 Lora Finetune

Fixes #6824 Fixes #6829

Before submitting

Did you read the contributor guideline?

Did you write any new necessary tests?

我用你的这个跑不起来，可以请问一下您是怎么跑的吗

xs1997zju · 2025-02-18T02:06:24Z

@hiyouga hi，这边有尝试过671B全量参数微调吗？用的几机配置呢？

ygxw0909 · 2025-02-18T03:04:17Z

这边有尝试过671B全量参数微调吗？用的几机配置呢？

@hiyouga 我也遇到了这个问题，我是12台机子 x 8卡，每台机子1T内存，保存模型时内存爆了，你上面提到你用10 x 8机子跑通过，能顺利保存模型吗

Skywuuuu · 2025-02-18T07:54:44Z

请问modeling文件需要修改什么吗？10机8卡A100依然会报显存OOM的错误

@jiefisher 你这个怎么样了？

保存模型的时候内存爆了训练没问题

@jiefisher 保存模型的时候OOM了，请问有解决办法吗？

xs1997zju · 2025-02-18T15:05:07Z

这边有尝试过671B全量参数微调吗？用的几机配置呢？

@hiyouga 我也遇到了这个问题，我是12台机子 x 8卡，每台机子1T内存，保存模型时内存爆了，你上面提到你用10 x 8机子跑通过，能顺利保存模型吗

@ygxw0909 你12机全量能训到几k长度呢？

xs1997zju · 2025-02-19T03:01:51Z

这边有尝试过671B全量参数微调吗？用的几机配置呢？

@hiyouga 我也遇到了这个问题，我是12台机子 x 8卡，每台机子1T内存，保存模型时内存爆了，你上面提到你用10 x 8机子跑通过，能顺利保存模型吗

@ygxw0909 12台zero3训练 671b 全量训练能带起来？

Han-Huaqiao · 2025-02-20T07:53:31Z

beichengus

我这边是国产的卡出现了nan，GPU这边我没遇到nan问题

ygxw0909 · 2025-02-21T03:18:44Z

@xs1997zju 可以的，我设的是4k，开了gradient checkpointing之后显存是完全够的，但速度。。。极慢。。。(我是Lora)

ygxw0909 · 2025-02-21T03:21:47Z

请问modeling文件需要修改什么吗？10机8卡A100依然会报显存OOM的错误

@jiefisher 你这个怎么样了？

保存模型的时候内存爆了训练没问题

@jiefisher 保存模型的时候OOM了，请问有解决办法吗？

有个办法，"stage3_gather_16bit_weights_on_model_save": false，ds里面设置一下，就不会把模型全部集中到master机的内存里了，但保存的文件会被切片。 @hiyouga lora训练时我设置了--save_only_model ，理论上应该只有adapter本身在保存时会占内存才对，但这里大家爆内存都是因为master机内存加载了整个BF16的R1

yxliu0903 · 2025-02-21T03:52:42Z

What does this PR do?

DeepSeekV3-671B-BF16 Lora Finetune

Fixes #6824 Fixes #6829

Before submitting

Did you read the contributor guideline?

Did you write any new necessary tests?

请问这个长度最大开多少呢，好像4096就会OOM，这个应该怎么解决呢

xs1997zju · 2025-02-21T06:15:56Z

这边有尝试过671B全量参数微调吗？用的几机配置呢？

@hiyouga 我也遇到了这个问题，我是12台机子 x 8卡，每台机子1T内存，保存模型时内存爆了，你上面提到你用10 x 8机子跑通过，能顺利保存模型吗

@ygxw0909 12台zero3训练 671b 全量训练能带起来？

@ygxw0909 12机，你是bf16， A800还是 fp8-H800? 你的单台机器内存有多少T?

ygxw0909 · 2025-02-23T06:42:55Z

这边有尝试过671B全量参数微调吗？用的几机配置呢？

@hiyouga 我也遇到了这个问题，我是12台机子 x 8卡，每台机子1T内存，保存模型时内存爆了，你上面提到你用10 x 8机子跑通过，能顺利保存模型吗

@ygxw0909 12台zero3训练 671b 全量训练能带起来？

@ygxw0909 12机，你是bf16， A800还是 fp8-H800? 你的单台机器内存有多少T?

哦哦不好意思之前的回复没看清，我是能跑通Lora，不是全参sft哈，我是A100，没有H卡QWQ

xs1997zju · 2025-02-24T06:28:23Z

这边有尝试过671B全量参数微调吗？用的几机配置呢？

@hiyouga 我也遇到了这个问题，我是12台机子 x 8卡，每台机子1T内存，保存模型时内存爆了，你上面提到你用10 x 8机子跑通过，能顺利保存模型吗

@ygxw0909 12台zero3训练 671b 全量训练能带起来？

@ygxw0909 12机，你是bf16， A800还是 fp8-H800? 你的单台机器内存有多少T?

哦哦不好意思之前的回复没看清，我是能跑通Lora，不是全参sft哈，我是A100，没有H卡QWQ

@hiyouga 你这跑过671B全量的不？

Tongmengfei · 2025-02-26T11:49:42Z

请问大家跑的是 deepseek_r1_671B 的模型吗，我在 lora 微调测试时，出现这个问题：raise ValueError(ValueError: Unknown quantization type, got fp8 - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'eetq', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao']

请问是不是需要先通过 fp8_cast_bf16.py 将 deepseek_R1_671B 模型从 FP8 格式，转成 BF16 格式再进行训练呢？

llama factory 目前是不是不支持 FP8 格式的模型训练呢？

matthew-hippocratic · 2025-02-26T22:26:29Z

Are you running the deepseek_r1_671B model? When I was fine-tuning and testing on lora, this problem occurred: raise ValueError(ValueError: Unknown quantization type, got fp8 - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'eetq', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao']

Do I need to convert the deepseek_R1_671B model from FP8 format to BF16 format using fp8_cast_bf16.py before training?

Does llama factory currently not support model training in FP8 format?

@Tongmengfei It doesn't support fp8 training. It's shown in the config but everyone is using the https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main version

Tongmengfei · 2025-02-27T05:25:01Z

Are you running the deepseek_r1_671B model? When I was fine-tuning and testing on lora, this problem occurred: raise ValueError(ValueError: Unknown quantization type, got fp8 - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'eetq', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao']
Do I need to convert the deepseek_R1_671B model from FP8 format to BF16 format using fp8_cast_bf16.py before training?
Does llama factory currently not support model training in FP8 format?

@Tongmengfei It doesn't support fp8 training. It's shown in the config but everyone is using the https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main version

请问你在使用 DeepSeek-V3-bf16 模型的时候，是否出现了以下的问题呢：
if p.startswith(start_prefix) and param_device_map[p[len(start_prefix) :]] == "disk"
KeyError: 'model.layers.61.self_attn.q_a_proj.weight'
看着似乎是 mtp 层的权重没有被识别

matthew-hippocratic · 2025-02-27T17:50:01Z

Are you running the deepseek_r1_671B model? When I was fine-tuning and testing on lora, this problem occurred: raise ValueError(ValueError: Unknown quantization type, got fp8 - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'eetq', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao']
Do I need to convert the deepseek_R1_671B model from FP8 format to BF16 format using fp8_cast_bf16.py before training?
Does llama factory currently not support model training in FP8 format?

@Tongmengfei It doesn't support fp8 training. It's shown in the config but everyone is using the https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main version

When using the DeepSeek-V3-bf16 model, did you encounter the following problem: if p.startswith(start_prefix) and param_device_map[p[len(start_prefix) :]] == "disk" KeyError: 'model.layers.61.self_attn.q_a_proj.weight' It seems that the weight of the mtp layer is not recognized

I did not get this error. LLaMA-Factory doesn't support MTP training so it should just ignore the model.layers.61 weights

ygxw0909 · 2025-02-28T10:09:54Z

Are you running the deepseek_r1_671B model? When I was fine-tuning and testing on lora, this problem occurred: raise ValueError(ValueError: Unknown quantization type, got fp8 - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'eetq', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao']
Do I need to convert the deepseek_R1_671B model from FP8 format to BF16 format using fp8_cast_bf16.py before training?
Does llama factory currently not support model training in FP8 format?

@Tongmengfei It doesn't support fp8 training. It's shown in the config but everyone is using the https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main version

When using the DeepSeek-V3-bf16 model, did you encounter the following problem: if p.startswith(start_prefix) and param_device_map[p[len(start_prefix) :]] == "disk" KeyError: 'model.layers.61.self_attn.q_a_proj.weight' It seems that the weight of the mtp layer is not recognized

I did not get this error. LLaMA-Factory doesn't support MTP training so it should just ignore the model.layers.61 weights

我在merge lora adapter，load模型时也遇到这个问题，所以该怎么解决

Tongmengfei · 2025-02-28T10:30:29Z

Are you running the deepseek_r1_671B model? When I was fine-tuning and testing on lora, this problem occurred: raise ValueError(ValueError: Unknown quantization type, got fp8 - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'eetq', 'hqq', 'compressed-tensors', 'fbgemm_fp8', 'torchao']
Do I need to convert the deepseek_R1_671B model from FP8 format to BF16 format using fp8_cast_bf16.py before training?
Does llama factory currently not support model training in FP8 format?

@Tongmengfei It doesn't support fp8 training. It's shown in the config but everyone is using the https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main version

When using the DeepSeek-V3-bf16 model, did you encounter the following problem: if p.startswith(start_prefix) and param_device_map[p[len(start_prefix) :]] == "disk" KeyError: 'model.layers.61.self_attn.q_a_proj.weight' It seems that the weight of the mtp layer is not recognized

I did not get this error. LLaMA-Factory doesn't support MTP training so it should just ignore the model.layers.61 weights

我在merge lora adapter，load模型时也遇到这个问题，所以该怎么解决

你好，上述问题我是在 gpu_num=4 的时候遇到的，当我把 gpu_num=8 扩大加载资源时，上述问题没有再出现；
如果你也遇到这个问题，可以试试这两种方案：
1）扩大 GPU 显卡资源
2）model.layers.61 weights 是 mtp 层，将这层忽略不加载

IshiKura-a · 2025-03-03T10:50:42Z

继续等，应该是deepspeed的optimizer check过程
哦哦，感觉不太正常，已经等了1h20min

再等等应该差不多，主要是deepspeed _check_for_duplicates用循环硬写的，慢点正常，不愿意等可以和上面一样，在load_model前面添加
from deepspeed.runtime.engine import DeepSpeedEngine
DeepSpeedEngine._check_for_duplicates = lambda self,optimizer:None
大佬，加了这行代码后又重启了下，还是卡在同样的位置，显存和上面一样都一直是14619MiB，没有看到任何变化的迹象（如果是在不断装参数，起码可以看到变化），目前已经等了20min。

Hi，我想请问一下这个问题解决了吗？

1245244103 · 2025-03-06T08:00:02Z

请问modeling文件需要修改什么吗？10机8卡A100依然会报显存OOM的错误

@jiefisher 你这个怎么样了？

保存模型的时候内存爆了训练没问题

@jiefisher 保存模型的时候OOM了，请问有解决办法吗？

有个办法，"stage3_gather_16bit_weights_on_model_save": false，ds里面设置一下，就不会把模型全部集中到master机的内存里了，但保存的文件会被切片。 @hiyouga lora训练时我设置了--save_only_model ，理论上应该只有adapter本身在保存时会占内存才对，但这里大家爆内存都是因为master机内存加载了整个BF16的R1

我按这个设置切片保存了，但权重文件无法用我之前的推理框架读取运行，是需要用deepspeed官方给的那个合并代码合并一下吗？

Tongmengfei · 2025-03-06T10:20:49Z

继续等，应该是deepspeed的optimizer check过程
哦哦，感觉不太正常，已经等了1h20min

再等等应该差不多，主要是deepspeed _check_for_duplicates用循环硬写的，慢点正常，不愿意等可以和上面一样，在load_model前面添加
from deepspeed.runtime.engine import DeepSpeedEngine
DeepSpeedEngine._check_for_duplicates = lambda self,optimizer:None
大佬，加了这行代码后又重启了下，还是卡在同样的位置，显存和上面一样都一直是14619MiB，没有看到任何变化的迹象（如果是在不断装参数，起码可以看到变化），目前已经等了20min。
Hi，我想请问一下这个问题解决了吗？

Tongmengfei · 2025-03-06T10:22:08Z

请问大家在训练 deepseek R1 671B 模型时，使用 deepspeed zero3 ，是否有遇到 Some NCCL operations have failed or timed out. 这个问题呢？

具体报错如下：
hgpu8077: [rank7]:[E305 21:36:52.669123121 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512650, OpType=_ALLGATHER_BASE, NumelIn=180224, NumelOut=2883584, Timeout(ms)=1800000) ran for 1800004 milliseconds before timing out.
hgpu8092: [rank10]:[E305 21:36:52.594230057 ProcessGroupNCCL.cpp:616] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512650, OpType=_ALLGATHER_BASE, NumelIn=360448, NumelOut=5767168, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
hgpu8077: [rank7]:[E305 21:36:52.669455337 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8092: [rank10]:[E305 21:36:52.594651945 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 10] Exception (either an error or timeout) detected by watchdog at work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8077: [rank7]:[E305 21:36:52.669469959 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 7] Timeout at NCCL work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8092: [rank10]:[E305 21:36:52.594665866 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 10] Timeout at NCCL work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8077: [rank7]:[E305 21:36:52.669477433 ProcessGroupNCCL.cpp:630] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
hgpu8092: [rank10]:[E305 21:36:52.594672482 ProcessGroupNCCL.cpp:630] [Rank 10] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

我用的配置是 ds_z3_config.json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

DeepSeekV3-671B-BF16 Lora Finetune

be21531

hiyouga self-requested a review February 8, 2025 16:53

hiyouga reviewed Feb 8, 2025

View reviewed changes

hiyouga added the pending This problem is yet to be addressed label Feb 8, 2025

DeepSeekV3-671B-BF16 Lora Finetune #6843

Are you sure you want to change the base?

DeepSeekV3-671B-BF16 Lora Finetune #6843

Conversation

xs1997zju commented Feb 7, 2025 • edited by hiyouga Loading

What does this PR do?

Before submitting

lxg2015 commented Feb 8, 2025

hiyouga Feb 8, 2025

Choose a reason for hiding this comment

xs1997zju Feb 10, 2025

Choose a reason for hiding this comment

flyinghu123 Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

xs1997zju commented Feb 10, 2025

Harryjun commented Feb 10, 2025

xs1997zju commented Feb 10, 2025

hiyouga commented Feb 10, 2025

xs1997zju commented Feb 10, 2025

xs1997zju commented Feb 10, 2025

xs1997zju commented Feb 10, 2025

Harryjun commented Feb 10, 2025

Harryjun commented Feb 10, 2025

xs1997zju commented Feb 10, 2025

Han-Huaqiao commented Feb 10, 2025 • edited Loading

xs1997zju commented Feb 10, 2025

Han-Huaqiao commented Feb 10, 2025

xs1997zju commented Feb 10, 2025

xs1997zju commented Feb 10, 2025

flyinghu123 commented Feb 11, 2025

Cccei000 commented Feb 11, 2025

Han-Huaqiao commented Feb 11, 2025 • edited Loading

Cccei000 commented Feb 11, 2025

flyinghu123 commented Feb 11, 2025

Han-Huaqiao commented Feb 11, 2025

Han-Huaqiao commented Feb 11, 2025

Cccei000 commented Feb 11, 2025

flyinghu123 commented Feb 11, 2025 • edited Loading

Han-Huaqiao commented Feb 11, 2025 • edited Loading

flyinghu123 commented Feb 15, 2025

Mryangkaitong commented Feb 15, 2025

beichengus commented Feb 15, 2025

RechardWong commented Feb 16, 2025

beichengus commented Feb 16, 2025

Harryjun commented Feb 17, 2025

jiefisher commented Feb 17, 2025

yxliu0903 commented Feb 17, 2025 • edited Loading

What does this PR do?

Before submitting

xs1997zju commented Feb 18, 2025

ygxw0909 commented Feb 18, 2025 • edited Loading

Skywuuuu commented Feb 18, 2025

xs1997zju commented Feb 18, 2025

xs1997zju commented Feb 19, 2025

Han-Huaqiao commented Feb 20, 2025

ygxw0909 commented Feb 21, 2025 • edited Loading

ygxw0909 commented Feb 21, 2025 • edited Loading

yxliu0903 commented Feb 21, 2025

What does this PR do?

Before submitting

xs1997zju commented Feb 21, 2025

ygxw0909 commented Feb 23, 2025

xs1997zju commented Feb 24, 2025

Tongmengfei commented Feb 26, 2025

matthew-hippocratic commented Feb 26, 2025 • edited Loading

Tongmengfei commented Feb 27, 2025

matthew-hippocratic commented Feb 27, 2025

ygxw0909 commented Feb 28, 2025

Tongmengfei commented Feb 28, 2025

IshiKura-a commented Mar 3, 2025

1245244103 commented Mar 6, 2025 • edited Loading

Tongmengfei commented Mar 6, 2025

Tongmengfei commented Mar 6, 2025

xs1997zju commented Feb 7, 2025 •

edited by hiyouga

Loading

flyinghu123 Feb 13, 2025 •

edited

Loading

Han-Huaqiao commented Feb 10, 2025 •

edited

Loading

Han-Huaqiao commented Feb 11, 2025 •

edited

Loading

flyinghu123 commented Feb 11, 2025 •

edited

Loading

Han-Huaqiao commented Feb 11, 2025 •

edited

Loading

yxliu0903 commented Feb 17, 2025 •

edited

Loading

ygxw0909 commented Feb 18, 2025 •

edited

Loading

ygxw0909 commented Feb 21, 2025 •

edited

Loading

ygxw0909 commented Feb 21, 2025 •

edited

Loading

matthew-hippocratic commented Feb 26, 2025 •

edited

Loading

1245244103 commented Mar 6, 2025 •

edited

Loading