Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed inference multi-node communication bug #14340

Open
leo4678 opened this issue Mar 6, 2025 · 2 comments
Open

distributed inference multi-node communication bug #14340

leo4678 opened this issue Mar 6, 2025 · 2 comments

Comments

@leo4678
Copy link

leo4678 commented Mar 6, 2025

i set NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME,then run run_cluster.sh

# master
bash run_cluster.sh vllm/vllm-openai xx.xx.xx.12 --head /root/path -e VLLM_HOST_IP=xx.xx.xx.12 -e NCCL_SOCKET_IFNAME=ens49f0 -e GLOO_SOCKET_IFNAME=ens49f0

# work
bash run_cluster.sh vllm/vllm-openai xx.xx.xx.12 --worker /root/path -e VLLM_HOST_IP=xx.xx.xx.15 -e NCCL_SOCKET_IFNAME=ens49f0 -e GLOO_SOCKET_IFNAME=ens49f0

then run NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=xx.xx.xx.12 test.py

result is

W0306 00:02:25.387000 473 torch/distributed/run.py:793] 
W0306 00:02:25.387000 473 torch/distributed/run.py:793] *****************************************
W0306 00:02:25.387000 473 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0306 00:02:25.387000 473 torch/distributed/run.py:793] *****************************************
[E306 00:03:07.842961975 socket.cpp:1011] [c10d] The client socket has timed out after 60000ms while trying to connect to (xx.xx.xx.12, 29400).
[W306 00:03:07.843593274 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host xx.xx.xx.12:29400 - retrying (try=0, timeout=60000ms, delay=35098ms): The client socket has timed out after 60000ms while trying to connect to (xx.xx.xx.12, 29400).
Exception raised from throwTimeoutError at ../torch/csrc/distributed/c10d/socket.cpp:1013 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbd9576c446 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x15e04c6 (0x7fbd809464c6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x6029d95 (0x7fbd8538fd95 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x6029f36 (0x7fbd8538ff36 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x602a3a4 (0x7fbd853903a4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x5fe8016 (0x7fbd8534e016 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::TCPStore(std::string, c10d::TCPStoreOptions const&) + 0x20c (0x7fbd85350f7c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xda37d9 (0x7fbd94d3c7d9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x4cc1e3 (0x7fbd944651e3 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #9: /usr/bin/python3() [0x57856d]
frame #10: _PyObject_MakeTpCall + 0x2db (0x547cdb in /usr/bin/python3)
frame #11: /usr/bin/python3() [0x5a60ae]
frame #12: _PyObject_Call + 0xed (0x58a5ad in /usr/bin/python3)
frame #13: /usr/bin/python3() [0x587a15]
frame #14: /usr/bin/python3() [0x5483a4]
frame #15: <unknown function> + 0x4ca90b (0x7fbd9446390b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #16: _PyObject_MakeTpCall + 0x2db (0x547cdb in /usr/bin/python3)
frame #17: _PyEval_EvalFrameDefault + 0x705 (0x5532e5 in /usr/bin/python3)
frame #18: _PyObject_FastCallDictTstate + 0x1d8 (0x54a6d8 in /usr/bin/python3)
frame #19: _PyObject_Call_Prepend + 0x59 (0x587c99 in /usr/bin/python3)
frame #20: /usr/bin/python3() [0x671a6d]
frame #21: _PyObject_Call + 0x93 (0x58a553 in /usr/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x54dd (0x5580bd in /usr/bin/python3)
frame #23: PyEval_EvalCode + 0x99 (0x626089 in /usr/bin/python3)
frame #24: /usr/bin/python3() [0x64c84b]
frame #25: /usr/bin/python3() [0x647ad6]
frame #26: /usr/bin/python3() [0x65fc35]
frame #27: _PyRun_SimpleFileObject + 0x1a5 (0x65f205 in /usr/bin/python3)
frame #28: _PyRun_AnyFileObject + 0x47 (0x65ee97 in /usr/bin/python3)
frame #29: Py_RunMain + 0x2e8 (0x658008 in /usr/bin/python3)
frame #30: Py_BytesMain + 0x2d (0x61143d in /usr/bin/python3)
frame #31: <unknown function> + 0x29d90 (0x7fbd9635dd90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #32: __libc_start_main + 0x80 (0x7fbd9635de40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: _start + 0x25 (0x6112b5 in /usr/bin/python3)

[E306 00:04:31.007128890 socket.cpp:1011] [c10d] The client socket has timed out after 60000ms while trying to connect to (xx.xx.xx.12, 29400).
[E306 00:04:31.007413747 TCPStore.cpp:346] [c10d] TCP client failed to connect/validate to host xx.xx.xx.12:29400 - timed out (try=1, timeout=60000ms): The client socket has timed out after 60000ms while trying to connect to (xx.xx.xx.12, 29400).
Exception raised from throwTimeoutError at ../torch/csrc/distributed/c10d/socket.cpp:1013 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbd9576c446 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x15e04c6 (0x7fbd809464c6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x6029d95 (0x7fbd8538fd95 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x6029f36 (0x7fbd8538ff36 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x602a3a4 (0x7fbd853903a4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x5fe8016 (0x7fbd8534e016 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::TCPStore(std::string, c10d::TCPStoreOptions const&) + 0x20c (0x7fbd85350f7c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xda37d9 (0x7fbd94d3c7d9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x4cc1e3 (0x7fbd944651e3 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #9: /usr/bin/python3() [0x57856d]
frame #10: _PyObject_MakeTpCall + 0x2db (0x547cdb in /usr/bin/python3)
frame #11: /usr/bin/python3() [0x5a60ae]
frame #12: _PyObject_Call + 0xed (0x58a5ad in /usr/bin/python3)
frame #13: /usr/bin/python3() [0x587a15]
frame #14: /usr/bin/python3() [0x5483a4]
frame #15: <unknown function> + 0x4ca90b (0x7fbd9446390b in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #16: _PyObject_MakeTpCall + 0x2db (0x547cdb in /usr/bin/python3)
frame #17: _PyEval_EvalFrameDefault + 0x705 (0x5532e5 in /usr/bin/python3)
frame #18: _PyObject_FastCallDictTstate + 0x1d8 (0x54a6d8 in /usr/bin/python3)
frame #19: _PyObject_Call_Prepend + 0x59 (0x587c99 in /usr/bin/python3)
frame #20: /usr/bin/python3() [0x671a6d]
frame #21: _PyObject_Call + 0x93 (0x58a553 in /usr/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x54dd (0x5580bd in /usr/bin/python3)
frame #23: PyEval_EvalCode + 0x99 (0x626089 in /usr/bin/python3)
frame #24: /usr/bin/python3() [0x64c84b]
frame #25: /usr/bin/python3() [0x647ad6]
frame #26: /usr/bin/python3() [0x65fc35]
frame #27: _PyRun_SimpleFileObject + 0x1a5 (0x65f205 in /usr/bin/python3)
frame #28: _PyRun_AnyFileObject + 0x47 (0x65ee97 in /usr/bin/python3)
frame #29: Py_RunMain + 0x2e8 (0x658008 in /usr/bin/python3)
frame #30: Py_BytesMain + 0x2d (0x61143d in /usr/bin/python3)
frame #31: <unknown function> + 0x29d90 (0x7fbd9635dd90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #32: __libc_start_main + 0x80 (0x7fbd9635de40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #33: _start + 0x25 (0x6112b5 in /usr/bin/python3)

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 159, in _create_tcp_store
    store = TCPStore(
            ^^^^^^^^^
torch.distributed.DistNetworkError: The client socket has timed out after 60000ms while trying to connect to (xx.xx.xx.12, 29400).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    rdzv_handler=rdzv_registry.get_rendezvous_handler(rdzv_parameters),
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/rendezvous/registry.py", line 71, in get_rendezvous_handler
    return handler_registry.create_handler(params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/rendezvous/api.py", line 365, in create_handler
    handler = creator(params)
              ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/rendezvous/registry.py", line 41, in _create_c10d_handler
    backend, store = create_backend(params)
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 257, in create_backend
    store = _create_tcp_store(params)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 183, in _create_tcp_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

ray status is ok

Node status
---------------------------------------------------------------
Active:
 1 node_3ac81a1a5811692c1453abcb4134db744f4e54da56f1a8b08e92331e
 1 node_01f7663c1735266e66143ee685fc0a408584d25d916f3703b1087e32
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/352.0 CPU
 0.0/16.0 GPU
 0B/3.91TiB memory
 0B/19.46GiB object_store_memory

Demands:
 (no resource demands)

my env

Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.4
Libc version: glibc-2.35

Python version: 3.12.9 (main, Feb  5 2025, 08:49:00) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-208-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA H20
GPU 1: NVIDIA H20
GPU 2: NVIDIA H20
GPU 3: NVIDIA H20
GPU 4: NVIDIA H20
GPU 5: NVIDIA H20
GPU 6: NVIDIA H20
GPU 7: NVIDIA H20

Nvidia driver version: 550.54.14
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8462Y+
CPU family:                         6
Model:                              143
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          2
Stepping:                           8
Frequency boost:                    enabled
CPU max MHz:                        2801.0000
CPU min MHz:                        800.0000
BogoMIPS:                           5600.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear pconfig flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          3 MiB (64 instances)
L1i cache:                          2 MiB (64 instances)
L2 cache:                           128 MiB (64 instances)
L3 cache:                           120 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-31,64-95
NUMA node1 CPU(s):                  32-63,96-127
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] flashinfer-python==0.2.1.post1+cu124torch2.5
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.49.0
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.7.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7  NIC0  NIC1  NIC2  CPU Affinity  NUMA Affinity GPU NUMA ID
GPU0   X  NV18  NV18  NV18  NV18  NV18  NV18  NV18  SYS SYS SYS 0-31,64-95  0 N/A
GPU1  NV18   X  NV18  NV18  NV18  NV18  NV18  NV18  SYS SYS SYS 0-31,64-95  0 N/A
GPU2  NV18  NV18   X  NV18  NV18  NV18  NV18  NV18  SYS SYS SYS 0-31,64-95  0 N/A
GPU3  NV18  NV18  NV18   X  NV18  NV18  NV18  NV18  SYS SYS SYS 0-31,64-95  0 N/A
GPU4  NV18  NV18  NV18  NV18   X  NV18  NV18  NV18  SYS SYS SYS 32-63,96-127  1 N/A
GPU5  NV18  NV18  NV18  NV18  NV18   X  NV18  NV18  SYS SYS SYS 32-63,96-127  1 N/A
GPU6  NV18  NV18  NV18  NV18  NV18  NV18   X  NV18  SYS SYS SYS 32-63,96-127  1 N/A
GPU7  NV18  NV18  NV18  NV18  NV18  NV18  NV18   X  SYS SYS SYS 32-63,96-127  1 N/A
NIC0  SYS SYS SYS SYS SYS SYS SYS SYS  X  PIX PHB       
NIC1  SYS SYS SYS SYS SYS SYS SYS SYS PIX  X  PHB       
NIC2  SYS SYS SYS SYS SYS SYS SYS SYS PHB PHB  X        

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526
NCCL_VERSION=2.17.1-1
NCCL_SOCKET_IFNAME=ens49f0
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
NVIDIA_CUDA_END_OF_LIFE=1
CUDA_VERSION=12.1.0
LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
VLLM_HOST_IP=xx.xx.xx.12
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

run ip addr show

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens13f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 10:ff:e0:0b:4a:46 brd ff:ff:ff:ff:ff:ff
3: ens13f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 10:ff:e0:0b:4a:47 brd ff:ff:ff:ff:ff:ff
4: ens49f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether e8:eb:d3:53:7c:60 brd ff:ff:ff:ff:ff:ff
    inet xx.xx.xx.12/24 brd xx.xx.xx.255 scope global ens49f0
       valid_lft forever preferred_lft forever
    inet6 fe80::eaeb:d3ff:fe53:7c60/64 scope link 
       valid_lft forever preferred_lft forever
5: ens49f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether e8:eb:d3:53:7c:61 brd ff:ff:ff:ff:ff:ff
6: ens53: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 98:03:9b:a3:cc:8c brd ff:ff:ff:ff:ff:ff
    inet6 fe80::9a03:9bff:fea3:cc8c/64 scope link 
       valid_lft forever preferred_lft forever
8: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:42:55:fe:f9 brd ff:ff:ff:ff:ff:ff
    inet xx.xx.xx.1/16 brd xx.xx.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:42ff:fe55:fef9/64 scope link 
       valid_lft forever preferred_lft forever
10: vethb2407d0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
    link/ether 52:2a:e6:b5:22:4c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::502a:e6ff:feb5:224c/64 scope link 
       valid_lft forever preferred_lft forever

Originally posted by @leo4678 in #6775

@youkaichao
Copy link
Member

please check the doc https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#incorrect-network-setup

A multi-node environment is more complicated than a single-node one. If you see errors such as torch.distributed.DistNetworkError, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:

In the first node, run NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py.

In the second node, run NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py.

Adjust --nproc-per-node, --nnodes, and --node-rank according to your setup, being sure to execute different commands (with different --node-rank) on different nodes.

@leo4678
Copy link
Author

leo4678 commented Mar 6, 2025

It has been resolved, 3 mistakes were made

  1. master node, rdzv_endpoint must be 127.0.0.1
  2. run simultaneously on the master node and the worker node
    master: NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1 test.py
    worker: NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=xxxx.12 test.py
  3. nproc-per-node,it means the number of GPU cards per node, not the total number of nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants