Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is prefix-aware routing implemented in this repo? #19

Closed
tdoublep opened this issue Jan 24, 2025 · 8 comments
Closed

Is prefix-aware routing implemented in this repo? #19

tdoublep opened this issue Jan 24, 2025 · 8 comments

Comments

@tdoublep
Copy link
Member

It says "(WIP) prefix-aware routing" in the README here:
https://github.com/vllm-project/production-stack/tree/main/src/router

A very quick scan through the Python files there, I can't see anything.

Can the results in the blog be reproduced using the code that is currently in this repo?

@gaocegege
Copy link
Collaborator

From what I’ve seen in the vllm-dev.slack.com discussions, this feature might roll out in the next few weeks. I’m guessing the performance boost could come from session-id-based routing, which seems similar to how prefix caching works in regular QA tasks—basically, requests from the same session share the same prefix. Just my two cents, though.

@KuntaiDu
Copy link
Collaborator

In the current results shown in the blogpost, we use Session-ID-based routing, which can achieve most of the prefix caching potential in general chatting applications. We are working on the prefix-aware routing (it's going to be in the roadmap). cc @ApostaC

@gaocegege
Copy link
Collaborator

I'm curious about how to implement this feature. Will we maintain a radix tree in the router to simulate the KV cache status, similar to the SGLang router?

@ApostaC
Copy link
Collaborator

ApostaC commented Jan 30, 2025

I'm curious about how to implement this feature. Will we maintain a radix tree in the router to simulate the KV cache status, similar to the SGLang router?

Good question. I think an alternative way is to simulate the "page" and "eviction" logic at a coarser granularity.
@KuntaiDu is right now working on the design, and there should be an RFC soon.

@gaocegege
Copy link
Collaborator

Good question. I think an alternative way is to simulate the "page" and "eviction" logic at a coarser granularity.

Makes a lot of sense. Will keep an eye on it.

@KuntaiDu
Copy link
Collaborator

KuntaiDu commented Feb 4, 2025

Please check #59 for the initial design of prefix-aware routing

@gaocegege
Copy link
Collaborator

I think we could close this issue since we've got the RFC to keep track of it. What do you think?

@tdoublep
Copy link
Member Author

tdoublep commented Feb 7, 2025

Sure thing.

@tdoublep tdoublep closed this as completed Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants