Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Disaggregated Prefilling (experimental) #7

Closed
gaocegege opened this issue Jan 23, 2025 · 5 comments
Closed

feat: Support Disaggregated Prefilling (experimental) #7

gaocegege opened this issue Jan 23, 2025 · 5 comments

Comments

@gaocegege
Copy link
Collaborator

Thanks for the project.

The documentation at https://docs.vllm.ai/en/latest/features/disagg_prefill.html introduces a proxy server along with prefill and decode instances. I am uncertain whether the proxy server overlaps with the router in this project.

However, I am confident that it is not compatible with the Helm chart. Ideally, a Kubernetes Custom Resource Definition (CRD) should be implemented instead of a Helm chart to accommodate more complex deployment configurations.

Just raising this for discussion—it shouldn't be considered a high priority at this time.

@ApostaC
Copy link
Collaborator

ApostaC commented Jan 23, 2025

Thanks @gaocegege ! We are currently discussing the potential solutions with @KuntaiDu (the main contributor of vLLM disagg prefill functionality).

One potential solution is to integrate the proxy server functionality into the router, so it does not need extra k8s-level configurations.

@ApostaC
Copy link
Collaborator

ApostaC commented Jan 23, 2025

Will create an RFC issue once we have a more concrete design.

@gaocegege
Copy link
Collaborator Author

Thanks, I am closing this since there will be a RFC.

@KuntaiDu
Copy link
Collaborator

@gaocegege The router in vLLM is currently not overlapping with the existing router inside this project, but in the future we will overwrite the router in vLLM to make it interact with the router of this project.

The router's job is roughly into 2 layers:

  • Global router, which handles global request orchestration (e.g. fault tolerance / service discovery / prefix-cache-aware routing) and should be implemented by exterior code (e.g. this project).
  • Local router, which handles the inference of one single request and vLLM project provides an example implementation. But ofc local router needs to get some information from global router, so we will overwrite this router in the future.

@gaocegege
Copy link
Collaborator Author

Sounds reasonable. Thanks for your explanation! I am excited about this RFC then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants