You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, existing large language model (LLM) serving engines that execute multi-turn conversations are inefficient as they need to repeatedly compute the key-value (KV) caches of historical tokens, resulting in high serving costs.
In some inference engine solutions such as vLLM, features like prefix-cache have been implemented. However, the challenge of fully reusing KV caches across multi-turn conversations remains unresolved. The cache hit rate is still low since there is no special handling on the routing side. Technically, there are few methods to improve the cache hit rate:
Utilize a distributed KV Cache store. Even if the local instance does not have the cache, it can fetch from remote instances.
Implement smart routing algorithms. This should enable cache-aware load balancing and try to schedule the second turn through n turn to the instance that already has the existing cache.
These two options are orthogonal and can be used simultaneously.
Suppose that if routing considers locality, there are a few drawbacks:
The volume may grow rapidly and occupy the KV cache on a single host.
The cold start would be long if the pod is terminated (due to service upgrade, downscaling, or failure, etc).
Determining how to design routing and allocate KV cache is a challenge. Let's use this RFC to track it.
🚀 Feature Description and Motivation
Currently, existing large language model (LLM) serving engines that execute multi-turn conversations are inefficient as they need to repeatedly compute the key-value (KV) caches of historical tokens, resulting in high serving costs.
In some inference engine solutions such as vLLM, features like prefix-cache have been implemented. However, the challenge of fully reusing KV caches across multi-turn conversations remains unresolved. The cache hit rate is still low since there is no special handling on the routing side. Technically, there are few methods to improve the cache hit rate:
These two options are orthogonal and can be used simultaneously.
Suppose that if routing considers locality, there are a few drawbacks:
Determining how to design routing and allocate KV cache is a challenge. Let's use this RFC to track it.
Paper list
CachedAttention] (https://arxiv.org/pdf/2403.19708v3)
Use Case
I want multi-turn conversation could be persisted in the KV cache and long prompt do not need to be processed repeatly.
Proposed Solution
TODO:
The text was updated successfully, but these errors were encountered: