Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

Open
1 task done
gaocegege opened this issue Jan 28, 2025 · 8 comments
Open
1 task done
Labels
feature request New feature or request

Comments

@gaocegege
Copy link
Contributor

gaocegege commented Jan 28, 2025

🚀 The feature, motivation and pitch

We currently support Ray-based distributed inference, which requires Ray. This issue requests multi-node support for torch.distributed.

Usage Example:

# Server 1
vllm serve model_tag --nnodes 2 --rank 0 --dist-init-addr 192.168.0.1:5000 

# Server 2
vllm serve model_tag --nnodes 2 --rank 1 --dist-init-addr 192.168.0.2:5000

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@gaocegege gaocegege added the feature request New feature or request label Jan 28, 2025
@plops655
Copy link

plops655 commented Feb 2, 2025

Why not simply use TorchTrainer in the RayTrain library?

@gaocegege
Copy link
Contributor Author

Why not simply use TorchTrainer in the RayTrain library?

I aim to simplify the deployment of multi-node inference using the vLLM Production Stack instead of configuring a Ray cluster on Kubernetes. I'm concerned that TorchTrainer may not be beneficial for this purpose.

@tsaoyu
Copy link

tsaoyu commented Feb 10, 2025

I am up for this proposal, due to the Ray setup requires knowledge is huge if there is anything wrong with it. Providing a Ray free version for those who just want inference and Ray based SPMD version for advanced users such as OpenRLHF is valid.

@Jeffwan
Copy link
Contributor

Jeffwan commented Feb 10, 2025

Yeah, this is reasonable, I raise an similar issue earlier. #3902

@gaocegege
Copy link
Contributor Author

I’ll give it a try, though I don’t have much time to dedicate to it. We could adopt a design similar to this PR. The key difference is that workers (excluding rank 0) should enter a loop and wait for inputs from the driver (rank 0 worker).

@plops655
Copy link

I am working on this and have a question. The main API for distributing multi-node inference using pytorch is FSDP. However, FSDP manually shards model data across GPUs by taking the full model as input.

Having to manually shard models seems to be orthogonal to our current implementation of multi-node inference using Ray and multiprocessing (for single-node).

I did not look at the Ray distributed executor yet, but when looking over the mp_distributed_executor, I noticed that memory sharding of the model happened at a much lower level. In _init_executor, we call _run_workers("load_model", ...), calls load_model in gpu_model_runner.py which calls load_model in (WLOG) the ShardedStateLoader class in loader.py

I am assuming we want to use FSDP for multi-node inference, but the architecture will be very different than Ray-based distributed inference.

Am I overthinking this?

@gaocegege
Copy link
Contributor Author

From my perspective, I do not think we could use FSDP, since we call workers to load the model.

@gaocegege
Copy link
Contributor Author

gaocegege commented Feb 23, 2025

After chatting with @youkaichao, we agreed that using torchrun might be a better fit for launching processes compared to vllm

torchrun will look like #3902 (comment)

# single node, multi-gpu
torchrun --nproc-per-node=n python -m vllm.entrypoints.openai.api_server $args

# multi node, on node 0
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args
# multi node, on node 1
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args

torchrun has a robust ecosystem and is a well-established launcher. For instance, it supports different backends like c10d and etcd as rdzv backends, making it highly versatile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants