-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Support torch.distributed as the runtime for multi-node inference #12511
Comments
Why not simply use TorchTrainer in the RayTrain library? |
I aim to simplify the deployment of multi-node inference using the vLLM Production Stack instead of configuring a Ray cluster on Kubernetes. I'm concerned that TorchTrainer may not be beneficial for this purpose. |
I am up for this proposal, due to the Ray setup requires knowledge is huge if there is anything wrong with it. Providing a Ray free version for those who just want inference and Ray based SPMD version for advanced users such as OpenRLHF is valid. |
Yeah, this is reasonable, I raise an similar issue earlier. #3902 |
I’ll give it a try, though I don’t have much time to dedicate to it. We could adopt a design similar to this PR. The key difference is that workers (excluding rank 0) should enter a loop and wait for inputs from the driver (rank 0 worker). |
I am working on this and have a question. The main API for distributing multi-node inference using pytorch is FSDP. However, FSDP manually shards model data across GPUs by taking the full model as input. Having to manually shard models seems to be orthogonal to our current implementation of multi-node inference using Ray and multiprocessing (for single-node). I did not look at the Ray distributed executor yet, but when looking over the mp_distributed_executor, I noticed that memory sharding of the model happened at a much lower level. In _init_executor, we call _run_workers("load_model", ...), calls load_model in gpu_model_runner.py which calls load_model in (WLOG) the ShardedStateLoader class in loader.py I am assuming we want to use FSDP for multi-node inference, but the architecture will be very different than Ray-based distributed inference. Am I overthinking this? |
From my perspective, I do not think we could use FSDP, since we call workers to load the model. |
After chatting with @youkaichao, we agreed that using torchrun might be a better fit for launching processes compared to vllm torchrun will look like #3902 (comment)
torchrun has a robust ecosystem and is a well-established launcher. For instance, it supports different backends like c10d and etcd as rdzv backends, making it highly versatile. |
🚀 The feature, motivation and pitch
We currently support Ray-based distributed inference, which requires Ray. This issue requests multi-node support for
torch.distributed
.Usage Example:
Alternatives
No response
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: