Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Support KV-cache reuse within same multi-turn conversation window #99

Open
Tracked by #698
Jeffwan opened this issue Aug 27, 2024 · 0 comments
Open
Tracked by #698
Assignees
Labels
Milestone

Comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 27, 2024

🚀 Feature Description and Motivation

Currently, existing large language model (LLM) serving engines that execute multi-turn conversations are inefficient as they need to repeatedly compute the key-value (KV) caches of historical tokens, resulting in high serving costs.

In some inference engine solutions such as vLLM, features like prefix-cache have been implemented. However, the challenge of fully reusing KV caches across multi-turn conversations remains unresolved. The cache hit rate is still low since there is no special handling on the routing side. Technically, there are few methods to improve the cache hit rate:

  1. Utilize a distributed KV Cache store. Even if the local instance does not have the cache, it can fetch from remote instances.
  2. Implement smart routing algorithms. This should enable cache-aware load balancing and try to schedule the second turn through n turn to the instance that already has the existing cache.

These two options are orthogonal and can be used simultaneously.

Suppose that if routing considers locality, there are a few drawbacks:

  • The volume may grow rapidly and occupy the KV cache on a single host.
  • The cold start would be long if the pod is terminated (due to service upgrade, downscaling, or failure, etc).

Determining how to design routing and allocate KV cache is a challenge. Let's use this RFC to track it.

Paper list

Use Case

I want multi-turn conversation could be persisted in the KV cache and long prompt do not need to be processed repeatly.

Proposed Solution

TODO:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants