Assertion failed: internode_ll.cu:259, condition: num_recv_tokens != 0 #49

abcdabcd987 · 2025-03-05T23:22:08Z

Hi, I ran into the following assertion when running the low latency kernel test over 4 nodes.

$ RANK=0 MASTER_ADDR=192.168.0.117 WORLD_SIZE=4 python tests/test_low_latency.py                                                                  
Allocating buffer size: 2116.290944 MB ...                                                                                                                                                                         
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
Assertion failed: /mnt/fs/DeepEP/csrc/kernels/internode_ll.cu:259, condition: num_recv_tokens != 0                                                                                                                 
[rank5]:[E305 23:13:27.395915667 ProcessGroupNCCL.cpp:1895] [PG ID 1 PG GUID 1 Rank 5] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure                             
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                                            
For debugging consider passing CUDA_LAUNCH_BLOCKING=1                                                                                                                                                              
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                                                
                                                                                                                                                                                                                   
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):                                                                                            
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f47afd7d788 in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libc10.so)                                                                                                                                                                                               
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f47afd26ea8 in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libc10.so)                                                                                                                                                           
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f47b017e3d2 in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)           
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f475f5ce176 in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                      
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa8 (0x7f475f5de658 in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                             
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x63d (0x7f475f5dfbfd in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                                  
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f475f5e0c4d in /mnt/fs/conda/envs/DeepEP/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                                 
frame #7: <unknown function> + 0xdc253 (0x7f474f8b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)                                                                                                                    
frame #8: <unknown function> + 0x94ac3 (0x7f47b0dc1ac3 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                         
frame #9: <unknown function> + 0x126850 (0x7f47b0e53850 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                                        
                                                                                                                                                                                                                   
terminate called after throwing an instance of 'c10::DistBackendError'                                                                                                                                             
  what():  [PG ID 1 PG GUID 1 Rank 5] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure                                                                              
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                                            
For debugging consider passing CUDA_LAUNCH_BLOCKING=1                                                                                                                                                              
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I tried to build the kernels with and without DISABLE_AGGRESSIVE_PTX_INSTRS=1. It did not help.

GPU is H100. NIC is ConnectX-7 (MT_0000000838). The link layer is Infiniband.

Error is the same even if it's on a single node (WORLD_SIZE=1).

Do you have any clue on where things might be wrong? Thanks!

The text was updated successfully, but these errors were encountered:

abcdabcd987 · 2025-03-05T23:34:04Z

test_low_latency.txt

Here's the full log with NVSHMEM_DEBUG=TRACE if it helps.

LyricZhao · 2025-03-06T01:38:54Z

Can you confirm the NVSHMEM patch is correctly installed? This seems something wrong with network adaptive routing (AR). Or you can try switch on/off the AR setting for your network.

cc @sphish

LyricZhao · 2025-03-06T01:59:52Z

Or could you please try modifying num_recv_tokens = ld_acquire_global(rdma_recv_count + local_expert_idx * num_ranks + src_rank); into ld_acquire_sys_global?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion failed: internode_ll.cu:259, condition: num_recv_tokens != 0 #49

Assertion failed: internode_ll.cu:259, condition: num_recv_tokens != 0 #49

abcdabcd987 commented Mar 5, 2025 •

edited

Loading

abcdabcd987 commented Mar 5, 2025

LyricZhao commented Mar 6, 2025

LyricZhao commented Mar 6, 2025

Assertion failed: internode_ll.cu:259, condition: num_recv_tokens != 0 #49

Assertion failed: internode_ll.cu:259, condition: num_recv_tokens != 0 #49

Comments

abcdabcd987 commented Mar 5, 2025 • edited Loading

abcdabcd987 commented Mar 5, 2025

LyricZhao commented Mar 6, 2025

LyricZhao commented Mar 6, 2025

abcdabcd987 commented Mar 5, 2025 •

edited

Loading