Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failed: internode_ll.cu:259, condition: num_recv_tokens != 0 #49

Open
abcdabcd987 opened this issue Mar 5, 2025 · 3 comments

Comments

@abcdabcd987
Copy link

abcdabcd987 commented Mar 5, 2025

Hi, I ran into the following assertion when running the low latency kernel test over 4 nodes.

$ RANK=0 MASTER_ADDR=192.168.0.117 WORLD_SIZE=4 python tests/test_low_latency.py                                                                  
Allocating buffer size: 2116.290944 MB ...                                                                                                                                                                         
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
[rank5]:[E305 23:13:27.395915667 ProcessGroupNCCL.cpp:1895] [PG ID 1 PG GUID 1 Rank 5] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure                             
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                                            
For debugging consider passing CUDA_LAUNCH_BLOCKING=1                                                                                                                                                              
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                                                
                                                                                                                                                                                                                   
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):                                                                                            
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f47afd7d788 in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libc10.so)                                                                                                                                                                                               
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f47afd26ea8 in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libc10.so)                                                                                                                                                           
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f47b017e3d2 in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)           
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f475f5ce176 in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                      
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa8 (0x7f475f5de658 in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                             
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x63d (0x7f475f5dfbfd in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                                  
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f475f5e0c4d in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                                 
frame #7: <unknown function> + 0xdc253 (0x7f474f8b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)                                                                                                                    
frame #8: <unknown function> + 0x94ac3 (0x7f47b0dc1ac3 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                         
frame #9: <unknown function> + 0x126850 (0x7f47b0e53850 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                        
                                                                                                                                                                                                                   
terminate called after throwing an instance of 'c10::DistBackendError'                                                                                                                                             
  what():  [PG ID 1 PG GUID 1 Rank 5] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure                                                                              
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                                            
For debugging consider passing CUDA_LAUNCH_BLOCKING=1                                                                                                                                                              
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                                                                                                 

I tried to build the kernels with and without DISABLE_AGGRESSIVE_PTX_INSTRS=1. It did not help.

GPU is H100. NIC is ConnectX-7 (MT_0000000838). The link layer is Infiniband.

Error is the same even if it's on a single node (WORLD_SIZE=1).

Do you have any clue on where things might be wrong? Thanks!

@abcdabcd987
Copy link
Author

test_low_latency.txt

Here's the full log with NVSHMEM_DEBUG=TRACE if it helps.

@LyricZhao
Copy link
Collaborator

Can you confirm the NVSHMEM patch is correctly installed? This seems something wrong with network adaptive routing (AR). Or you can try switch on/off the AR setting for your network.

cc @sphish

@LyricZhao
Copy link
Collaborator

Or could you please try modifying num_recv_tokens = ld_acquire_global(rdma_recv_count + local_expert_idx * num_ranks + src_rank); into ld_acquire_sys_global?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants