You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can you confirm the NVSHMEM patch is correctly installed? This seems something wrong with network adaptive routing (AR). Or you can try switch on/off the AR setting for your network.
Or could you please try modifying num_recv_tokens = ld_acquire_global(rdma_recv_count + local_expert_idx * num_ranks + src_rank); into ld_acquire_sys_global?
Hi, I ran into the following assertion when running the low latency kernel test over 4 nodes.
I tried to build the kernels with and without
DISABLE_AGGRESSIVE_PTX_INSTRS=1
. It did not help.GPU is H100. NIC is ConnectX-7 (MT_0000000838). The link layer is Infiniband.
Error is the same even if it's on a single node (
WORLD_SIZE=1
).Do you have any clue on where things might be wrong? Thanks!
The text was updated successfully, but these errors were encountered: