[RFC] Support KV-cache reuse within same multi-turn conversation window #99

Jeffwan · 2024-08-27T06:57:00Z

🚀 Feature Description and Motivation

Currently, existing large language model (LLM) serving engines that execute multi-turn conversations are inefficient as they need to repeatedly compute the key-value (KV) caches of historical tokens, resulting in high serving costs.

In some inference engine solutions such as vLLM, features like prefix-cache have been implemented. However, the challenge of fully reusing KV caches across multi-turn conversations remains unresolved. The cache hit rate is still low since there is no special handling on the routing side. Technically, there are few methods to improve the cache hit rate:

Utilize a distributed KV Cache store. Even if the local instance does not have the cache, it can fetch from remote instances.
Implement smart routing algorithms. This should enable cache-aware load balancing and try to schedule the second turn through n turn to the instance that already has the existing cache.

These two options are orthogonal and can be used simultaneously.

Suppose that if routing considers locality, there are a few drawbacks:

The volume may grow rapidly and occupy the KV cache on a single host.
The cold start would be long if the pod is terminated (due to service upgrade, downscaling, or failure, etc).

Determining how to design routing and allocate KV cache is a challenge. Let's use this RFC to track it.

Paper list

[Cost-Efficient Large Language Model Serving for Multi-turn Conversations with
CachedAttention] (https://arxiv.org/pdf/2403.19708v3)
https://lmsys.org/blog/2024-12-04-sglang-v0-4/

Use Case

I want multi-turn conversation could be persisted in the KV cache and long prompt do not need to be processed repeatly.

Proposed Solution

TODO:

Jeffwan added this to the v0.2.0 milestone Aug 27, 2024

Jeffwan added kind/enhancement New feature or request area/distributed labels Aug 29, 2024

Jeffwan assigned slhuang Aug 30, 2024

Jeffwan modified the milestones: v0.2.0, v0.3.0 Dec 3, 2024

Jeffwan assigned Jeffwan and unassigned slhuang Jan 10, 2025

Jeffwan assigned varungup90 and DwyaneShi Jan 29, 2025

Jeffwan mentioned this issue Feb 5, 2025

Consider to support delay scheduling in Gateway #606

Open

Jeffwan mentioned this issue Feb 18, 2025

v0.3.0 roadmap #698

Open

41 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Support KV-cache reuse within same multi-turn conversation window #99

[RFC] Support KV-cache reuse within same multi-turn conversation window #99

Jeffwan commented Aug 27, 2024 •

edited

Loading

[RFC] Support KV-cache reuse within same multi-turn conversation window #99

[RFC] Support KV-cache reuse within same multi-turn conversation window #99

Comments

Jeffwan commented Aug 27, 2024 • edited Loading

🚀 Feature Description and Motivation

Use Case

Proposed Solution

Jeffwan commented Aug 27, 2024 •

edited

Loading