Skip to content

Commit

Permalink
[Feat] Add remote shared storage with LMCache (#188)
Browse files Browse the repository at this point in the history
* add remote server

Signed-off-by: YaoJiayi <[email protected]>

* add yaml

Signed-off-by: YaoJiayi <[email protected]>

* add tutorial md

Signed-off-by: YaoJiayi <[email protected]>

* fix replica count

Signed-off-by: YaoJiayi <[email protected]>

* Update 06-remote-shared-kv-cache.md

Signed-off-by: YaoJiayi <[email protected]>

* add signiture

Signed-off-by: YaoJiayi <[email protected]>

---------

Signed-off-by: YaoJiayi <[email protected]>
  • Loading branch information
YaoJiayi authored Mar 1, 2025
1 parent 0c30f10 commit 09d5c10
Show file tree
Hide file tree
Showing 7 changed files with 288 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,5 @@ helm/examples

# version files
src/vllm_router/_version.py

/tutorials/assets/private.yaml
17 changes: 17 additions & 0 deletions helm/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,15 @@ limits:
{{- end }}
{{- end }}

{{/*
Define labels for cache server and its service
*/}}
{{- define "chart.cacheserverLabels" -}}
{{- with .Values.cacheserverSpec.labels -}}
{{ toYaml . }}
{{- end }}
{{- end }}

{{/*
Define helper function to convert labels to a comma separated list
*/}}
Expand All @@ -140,3 +149,11 @@ limits:
{{- $result = "," -}}
{{- end -}}
{{- end -}}


{{/*
Define helper function to format remote cache url
*/}}
{{- define "cacheserver.formatRemoteUrl" -}}
lm://{{ .service_name }}:{{ .port }}
{{- end -}}
52 changes: 52 additions & 0 deletions helm/templates/deployment-cache-server.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
{{- if .Values.cacheserverSpec -}}
apiVersion: apps/v1
kind: Deployment
metadata:
name: "{{ .Release.Name }}-deployment-cache-server"
namespace: {{ .Release.Namespace }}
labels:
{{- include "chart.cacheserverLabels" . | nindent 4 }}
spec:
replicas: 1
selector:
matchLabels:
{{- include "chart.cacheserverLabels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "chart.cacheserverLabels" . | nindent 8 }}
spec:
containers:
- name: "lmcache-server"
image: "{{ required "Required value 'cacheserverSpec.repository' must be defined !" .Values.cacheserverSpec.repository }}:{{ required "Required value 'cacheserverSpec.tag' must be defined !" .Values.cacheserverSpec.tag }}"
command:
- "lmcache_experimental_server"
- "0.0.0.0"
- "{{ .Values.cacheserverSpec.containerPort }}"
{{- if .Values.cacheserverSpec.resources }}
resources:
{{- if .Values.cacheserverSpec.resources.requests }}
requests:
cpu: "{{ .Values.cacheserverSpec.resources.requests.cpu }}"
memory: "{{ .Values.cacheserverSpec.resources.requests.memory }}"
{{- end }}
{{- if .Values.cacheserverSpec.resources.limits }}
limits:
cpu: "{{ .Values.cacheserverSpec.resources.limits.cpu }}"
memory: "{{ .Values.cacheserverSpec.resources.limits.memory }}"
{{- end }}
{{- end }}
ports:
- name: "caserver-cport"
containerPort: {{ .Values.cacheserverSpec.containerPort }}
imagePullPolicy: IfNotPresent

# TODO(Jiayi): add health check for lmcache server
# livenessProbe:
# initialDelaySeconds: 30
# periodSeconds: 5
# failureThreshold: 3
# httpGet:
# path: /health
# port: {{ .Values.cacheserverSpec.containerPort }}
{{- end -}}
6 changes: 6 additions & 0 deletions helm/templates/deployment-vllm-multi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,12 @@ spec:
- name: LMCACHE_MAX_LOCAL_DISK_SIZE
value: "{{ $modelSpec.lmcacheConfig.diskOffloadingBufferSize }}"
{{- end }}
{{- if .Values.cacheserverSpec }}
- name: LMCACHE_REMOTE_URL
value: "{{ include "cacheserver.formatRemoteUrl" (dict "service_name" (print .Release.Name "-cache-server-service") "port" .Values.cacheserverSpec.servicePort) }}"
- name: LMCACHE_REMOTE_SERDE
value: "{{ .Values.cacheserverSpec.serde }}"
{{- end }}
{{- end }}
{{- if .Values.servingEngineSpec.configs }}
envFrom:
Expand Down
18 changes: 18 additions & 0 deletions helm/templates/service-cache-server.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{{- if .Values.cacheserverSpec -}}
apiVersion: v1
kind: Service
metadata:
name: "{{ .Release.Name }}-cache-server-service"
namespace: {{ .Release.Namespace }}
labels:
{{- include "chart.cacheserverLabels" . | nindent 4 }}
spec:
type: ClusterIP
ports:
- name: "cacheserver-sport"
port: {{ .Values.cacheserverSpec.servicePort }}
targetPort: {{ .Values.cacheserverSpec.containerPort }}
protocol: TCP
selector:
{{- include "chart.cacheserverLabels" . | nindent 4 }}
{{- end -}}
139 changes: 139 additions & 0 deletions tutorials/06-remote-shared-kv-cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Tutorial: Shared Remote KV Cache Storage with LMCache

## Introduction

This tutorial demonstrates how to enable remote KV cache storage using LMCache in a vLLM deployment. Remote KV cache sharing moves large KV caches from GPU memory to a remote shared storage, enabling more KV cache hits and potentially making the deployment more fault tolerant.
vLLM Production Stack uses LMCache for remote KV cache sharing. For more details, see the [LMCache GitHub repository](https://github.com/LMCache/LMCache).

## Table of Contents

1. [Prerequisites](#prerequisites)
2. [Step 1: Configuring Remote KV Cache Storage](#step-1-configuring-kv-cache-shared-storage)
3. [Step 2: Deploying the Helm Chart](#step-2-deploying-the-helm-chart)
4. [Step 3: Verifying the Installation](#step-3-verifying-the-installation)
5. [Benchmark the Performance Gain of Remote Shared Storage (Work in Progress)](#benchmark-the-performance-gain-of-remote-shared-storage-work-in-progress)

## Prerequisites

- Completion of the following tutorials:
- [00-install-kubernetes-env.md](00-install-kubernetes-env.md)
- [01-minimal-helm-installation.md](01-minimal-helm-installation.md)
- [02-basic-vllm-config.md](02-basic-vllm-config.md)
- A Kubernetes environment with GPU support.

## Step 1: Configuring KV Cache Shared Storage

Locate the file `tutorials/assets/values-06-remote-shared-storage.yaml` with the following content:

```yaml
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "mistral"
repository: "lmcache/vllm-openai"
tag: "latest"
modelURL: "mistralai/Mistral-7B-Instruct-v0.2"
replicaCount: 2
requestCPU: 10
requestMemory: "40Gi"
requestGPU: 1
pvcStorage: "50Gi"
vllmConfig:
enableChunkedPrefill: false
enablePrefixCaching: false
maxModelLen: 16384

lmcacheConfig:
enabled: true
cpuOffloadingBufferSize: "20"

hf_token: <YOUR HF TOKEN>

cacheserverSpec:
replicaCount: 1
containerPort: 8080
servicePort: 81
serde: "naive"

repository: "lmcache/vllm-openai"
tag: "latest"
resources:
requests:
cpu: "4"
memory: "8G"
limits:
cpu: "4"
memory: "10G"

labels:
environment: "cacheserver"
release: "cacheserver"

```

> **Note:** Replace `<YOUR HF TOKEN>` with your actual Hugging Face token.
The `CacheserverSpec` starts a remote shared KV cache storage.

## Step 2: Deploying the Helm Chart

Deploy the Helm chart using the customized values file:

```bash
sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-06-shared-storage.yaml
```

## Step 3: Verifying the Installation

1. Check the pod logs to verify LMCache is active:

```bash
sudo kubectl get pods
```

Identify the pod name for the vLLM deployment (e.g., `vllm-mistral-deployment-vllm-xxxx-xxxx`). Then run:

```bash
sudo kubectl logs -f <pod-name>
```

Look for entries in the log indicating LMCache is enabled and operational. An example output (indicating KV cache is stored) is:

```plaintext
INFO 01-21 20:16:58 lmcache_connector.py:41] Initializing LMCacheConfig under kv_transfer_config kv_connector='LMCacheConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579
INFO LMCache: Creating LMCacheEngine instance vllm-instance [2025-01-21 20:16:58,732] -- /usr/local/lib/python3.12/dist-packages/lmcache/experimental/cache_engine.py:237
```

2. Forward the router service port to access the stack locally:

```bash
sudo kubectl port-forward svc/vllm-router-service 30080:80
```

3. Send a request to the stack and observe the logs:

```bash
curl -X POST http://localhost:30080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"prompt": "Explain the significance of KV cache in language models.",
"max_tokens": 10
}'
```

Expected output:

The response from the stack should contain the completion result, and the logs should show LMCache activity, for example:

```plaintext
DEBUG LMCache: Store skips 0 tokens and then stores 13 tokens [2025-01-21 20:23:45,113] -- /usr/local/lib/python3.12/dist-packages/lmcache/integration/vllm/vllm_adapter.py:490
```

## Benchmark the Performance Gain of Remote Shared Storage (Work in Progress)

In this section, we will benchmark the performance improvement when using LMCache for remote KV cache shared storage. Stay tuned for updates.

## Conclusion

This tutorial demonstrated how to enable a shared KV cache storage across multiple vllm nodes in a vLLM deployment using LMCache. By storing KV cache to a remote shared storage, you can improve KV cache hit rate and potentially make the deployment more fault tolerant. Explore further configurations to tailor LMCache to your workloads.
54 changes: 54 additions & 0 deletions tutorials/assets/values-06-shared-storage.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "mistral"
repository: "lmcache/vllm-openai"
tag: "latest"
modelURL: "mistralai/Mistral-7B-Instruct-v0.2"
replicaCount: 2
requestCPU: 10
requestMemory: "40Gi"
requestGPU: 1
pvcStorage: "50Gi"
vllmConfig:
enableChunkedPrefill: false
enablePrefixCaching: false
maxModelLen: 16384

lmcacheConfig:
enabled: true
cpuOffloadingBufferSize: "20"

hf_token: <YOUR HF TOKEN>

cacheserverSpec:
# -- Number of replicas
replicaCount: 1

# -- Container port
containerPort: 8080

# -- Service port
servicePort: 81

# -- Serializer/Deserializer type
serde: "naive"

# -- Cache server image (reusing the vllm image)
repository: "lmcache/vllm-openai"
tag: "latest"

# TODO (Jiayi): please adjust this once we have evictor
# -- router resource requests and limits
resources:
requests:
cpu: "4"
memory: "8G"
limits:
cpu: "4"
memory: "10G"

# -- Customized labels for the cache server deployment
labels:
environment: "cacheserver"
release: "cacheserver"

0 comments on commit 09d5c10

Please sign in to comment.