-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP, RFC] Production Stack on Ray Serve #195
Comments
Note that Ray Serve just open-sourced their LLM serving stack at here: https://github.com/ray-project/ray/tree/master/python/ray/llm. The design is similar but comes without smart routing. We will keep the original design but also try to reuse some code from the link. |
hi @Hanchenli, do you have the roadmap or design doc for ray llm? |
By the way, do we have a plan on SkyPilot? |
Not currently. We do have plan to support on Terraform to support multiple cloud. |
This is an issue explaining the upcoming Production Stack on Ray Serve structure.
The router will be a DeploymentHandle with fastAPI set as ingress for OpenAI API compatibility.
The inference nodes will each initialize with a subprocess running an vllm-lmcache OpenAI compatible server. The current design has the following functions:
report_status which returns the server status including model_name.
streaming_response which returns a streaming response to router for post request such as "v1/completion"
The session based routing will be implemented with a dictionary in the router.
The text was updated successfully, but these errors were encountered: