[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

gaocegege · 2025-01-28T14:34:51Z

🚀 The feature, motivation and pitch

We currently support Ray-based distributed inference, which requires Ray. This issue requests multi-node support for torch.distributed.

Usage Example:

# Server 1
vllm serve model_tag --nnodes 2 --rank 0 --dist-init-addr 192.168.0.1:5000 

# Server 2
vllm serve model_tag --nnodes 2 --rank 1 --dist-init-addr 192.168.0.2:5000

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

plops655 · 2025-02-02T23:30:47Z

Why not simply use TorchTrainer in the RayTrain library?

gaocegege · 2025-02-07T11:25:04Z

Why not simply use TorchTrainer in the RayTrain library?

I aim to simplify the deployment of multi-node inference using the vLLM Production Stack instead of configuring a Ray cluster on Kubernetes. I'm concerned that TorchTrainer may not be beneficial for this purpose.

tsaoyu · 2025-02-10T08:27:21Z

I am up for this proposal, due to the Ray setup requires knowledge is huge if there is anything wrong with it. Providing a Ray free version for those who just want inference and Ray based SPMD version for advanced users such as OpenRLHF is valid.

Jeffwan · 2025-02-10T23:58:18Z

Yeah, this is reasonable, I raise an similar issue earlier. #3902

gaocegege · 2025-02-20T01:40:27Z

I’ll give it a try, though I don’t have much time to dedicate to it. We could adopt a design similar to this PR. The key difference is that workers (excluding rank 0) should enter a loop and wait for inputs from the driver (rank 0 worker).

plops655 · 2025-02-23T01:44:21Z

I am working on this and have a question. The main API for distributing multi-node inference using pytorch is FSDP. However, FSDP manually shards model data across GPUs by taking the full model as input.

Having to manually shard models seems to be orthogonal to our current implementation of multi-node inference using Ray and multiprocessing (for single-node).

I did not look at the Ray distributed executor yet, but when looking over the mp_distributed_executor, I noticed that memory sharding of the model happened at a much lower level. In _init_executor, we call _run_workers("load_model", ...), calls load_model in gpu_model_runner.py which calls load_model in (WLOG) the ShardedStateLoader class in loader.py

I am assuming we want to use FSDP for multi-node inference, but the architecture will be very different than Ray-based distributed inference.

Am I overthinking this?

gaocegege · 2025-02-23T01:54:37Z

From my perspective, I do not think we could use FSDP, since we call workers to load the model.

gaocegege · 2025-02-23T01:57:43Z

After chatting with @youkaichao, we agreed that using torchrun might be a better fit for launching processes compared to vllm

torchrun will look like #3902 (comment)

# single node, multi-gpu
torchrun --nproc-per-node=n python -m vllm.entrypoints.openai.api_server $args

# multi node, on node 0
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args
# multi node, on node 1
torchrun --nnodes 2 --nproc-per-node=n --rdzv_backend=c10d --rdzv_endpoint=${node_0_ip:port} python -m vllm.entrypoints.openai.api_server $args

torchrun has a robust ecosystem and is a well-established launcher. For instance, it supports different backends like c10d and etcd as rdzv backends, making it highly versatile.

gaocegege added the feature request New feature or request label Jan 28, 2025

gaocegege mentioned this issue Feb 19, 2025

Discussion: Pipeline parallelism support vllm-project/production-stack#101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

gaocegege commented Jan 28, 2025 •

edited

Loading

plops655 commented Feb 2, 2025

gaocegege commented Feb 7, 2025

tsaoyu commented Feb 10, 2025

Jeffwan commented Feb 10, 2025

gaocegege commented Feb 20, 2025

plops655 commented Feb 23, 2025

gaocegege commented Feb 23, 2025

gaocegege commented Feb 23, 2025 •

edited

Loading

[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

[Feature]: Support torch.distributed as the runtime for multi-node inference #12511

Comments

gaocegege commented Jan 28, 2025 • edited Loading

🚀 The feature, motivation and pitch

Usage Example:

Alternatives

Additional context

Before submitting a new issue...

plops655 commented Feb 2, 2025

gaocegege commented Feb 7, 2025

tsaoyu commented Feb 10, 2025

Jeffwan commented Feb 10, 2025

gaocegege commented Feb 20, 2025

plops655 commented Feb 23, 2025

gaocegege commented Feb 23, 2025

gaocegege commented Feb 23, 2025 • edited Loading

gaocegege commented Jan 28, 2025 •

edited

Loading

gaocegege commented Feb 23, 2025 •

edited

Loading