You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The prefix-cache aware version we are implementing is a little bit different from Preble, while we also borrow some ideas of their metadata design.
We use hash blocks or linked hash blocks rather than radix tree. Technically, we ignore the efficiency issues at this moment.
vLLM already have the local "logical" tree instead, in that case, we do not need it anymore.
We should consider the "load-aware" part in the new PR, Load cost calculation is kind of essential and should be used with prefix-cache aware routing together. While, they implements load-aware in 3 steps. a. historical load cost b. eviction cost c. current request cost. We can implement many load-aware strategies as we can, some existing like least-request, least-kv-cache etc all fits under this category and can be reused as well
Use Case
No response
Proposed Solution
No response
The text was updated successfully, but these errors were encountered:
🚀 Feature Description and Motivation
Preble (https://arxiv.org/abs/2407.00023) did solid work on prefix-cache and load-aware routing.
The prefix-cache aware version we are implementing is a little bit different from Preble, while we also borrow some ideas of their metadata design.
load-aware
in 3 steps. a. historical load cost b. eviction cost c. current request cost. We can implement manyload-aware
strategies as we can, some existing like least-request, least-kv-cache etc all fits under this category and can be reused as wellUse Case
No response
Proposed Solution
No response
The text was updated successfully, but these errors were encountered: