-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is prefix-aware routing implemented in this repo? #19
Comments
From what I’ve seen in the vllm-dev.slack.com discussions, this feature might roll out in the next few weeks. I’m guessing the performance boost could come from session-id-based routing, which seems similar to how prefix caching works in regular QA tasks—basically, requests from the same session share the same prefix. Just my two cents, though. |
In the current results shown in the blogpost, we use Session-ID-based routing, which can achieve most of the prefix caching potential in general chatting applications. We are working on the prefix-aware routing (it's going to be in the roadmap). cc @ApostaC |
I'm curious about how to implement this feature. Will we maintain a radix tree in the router to simulate the KV cache status, similar to the SGLang router? |
Good question. I think an alternative way is to simulate the "page" and "eviction" logic at a coarser granularity. |
Makes a lot of sense. Will keep an eye on it. |
Please check #59 for the initial design of prefix-aware routing |
I think we could close this issue since we've got the RFC to keep track of it. What do you think? |
Sure thing. |
It says "(WIP) prefix-aware routing" in the README here:
https://github.com/vllm-project/production-stack/tree/main/src/router
A very quick scan through the Python files there, I can't see anything.
Can the results in the blog be reproduced using the code that is currently in this repo?
The text was updated successfully, but these errors were encountered: