-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: deepseek-r1 mutlti-node crash #13136
Comments
+1 |
1 similar comment
+1 |
+1 |
+1,在线等,挺急的 |
+1 |
cc @youkaichao if you have any suggestions but given this is a harder to reproduce nccl segfault, i recommend setting up fault tolerance for the service for now. |
@simon-mo thanks for your reply, How do I make the settings you recommend? Another question, if deepseek-r1 is started on 2 nodes, could you tell me how to create a profiler? I used torch profiler but it failed, thanks a lot |
seems to be a nccl error. can you please run the sanity check script at https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#incorrect-hardware-driver to see if NCCL works as expected first? |
Upgrade nccl version to 2.25.1 solved this problem |
I upgraded NCCL version to 2.25.1-1+cuda12.8 and fixed it, thank you |
@Sisphyus @YejinHwang909 Thanks a lot, will try it again. |
When I set NCCL_MIN_NCHANNELS=24 and NCCL_IB_QPS_PER_CONNECTION=8, the same error has occurred again. 😩 2*8 H20 |
multiple users reported that |
@youkaichao after upgradding nvidia-nccl to v2.25.1, it will report torch v2.5.1 reqiures nvidia-nccl v2.21, seems v2.25.1is incompatible with torch v2.5.1 which is using by vllm? |
you can install pytorch first, and only upgrade the nccl by |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
I use 2x8H100 to deploy the deepseek-r1 model in kubenet, but when I test the bbh test set with a concurrency count of 3, the service will run for a while and then crash. The nvidia-smi check will show that the gpu memory of a process has dropped from more than 60G to 2G, and the server log has the following error.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: