You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A multi-node environment is more complicated than a single-node one. If you see errors such as torch.distributed.DistNetworkError, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
In the first node, run NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py.
In the second node, run NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py.
Adjust --nproc-per-node, --nnodes, and --node-rank according to your setup, being sure to execute different commands (with different --node-rank) on different nodes.
Originally posted by @leo4678 in #6775
The text was updated successfully, but these errors were encountered: