Support distributed kv cache orchestration #583

Jeffwan · 2025-01-21T09:01:33Z

Pull Request Description

Support distributed kv cache orchestration #570

Introduce a new API called KVCache, it will create the etcd, cache dataplane and necessary service for vLLM engine to communicate with
This version is an initial version. Some API design ideas are from vineyard operator. We did a few different things.

Simplify the original design, vineyard has many API we do not need. We just like to keep a slim version.
Improve the reliability, original v6d use template framework instead of managing resource natively which brings stability problem, for example, if etcd instance goes down or being deleted, the controller won't create a new one. this is unaccepted from HA perspective
We Add gpu and workload affinity, this is from my original change in v6d Add SchedulingConfig for Enhanced GPU and Pod Affinity Scheduling aibrix/v6d#7

Example

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: KVCache
metadata:
  name: aibrix-deepseek-33b-kvcache
  namespace: aibrix-system
  annotations:
    kvcache.orchestration.aibrix.ai/node-affinity-gpu-type: NVIDIA-L20
    kvcache.orchestration.aibrix.ai/pod-affinity-workload: aibrix-deepseek-33b
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  cacheSpec:
    image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vineyardd:20241120
    imagePullPolicy: IfNotPresent

it will create following resources

cache deployment

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
test-aibrix-model-deepseek-coder-33b-kvcache   1/1     1            1           8m54s

individual pod

test-aibrix-model-deepseek-coder-33b-kvcache-etcd-0            1/1     Running             0            51m

services

NAME                                                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
test-aibrix-model-deepseek-coder-33b-kvcache-etcd-0         ClusterIP   10.96.153.146    <none>        2379/TCP,2380/TCP   49m
test-aibrix-model-deepseek-coder-33b-kvcache-etcd-service   ClusterIP   10.100.223.243   <none>        2379/TCP            49m
test-aibrix-model-deepseek-coder-33b-kvcache-rpc            ClusterIP   10.108.217.57    <none>        9600/TCP            49m

Related Issues

Resolves: #570

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

pkg/controller/kvcache/kvcache_controller.go

Signed-off-by: Jiaxin Shan <[email protected]>

varungup90 · 2025-01-21T22:12:27Z

pkg/controller/kvcache/kvcache_controller.go

+func needsUpdateDeployment(deployment *appsv1.Deployment, found *appsv1.Deployment) bool {
+	imageChanged := false
+	for i, container := range found.Spec.Template.Spec.Containers {
+		if len(deployment.Spec.Template.Spec.Containers) > i {


we can also have a scenario where number of containers are not same. Second, better to have a second loop to match by container name rather than depending on the container index (removes the assumption part). This can be a future TODO.

These are great suggestions. Sure, I create a new issue to track your proposed improvements #587.

Jeffwan · 2025-01-21T23:42:42Z

I will merge this issue now and finish Varun's suggestion in later PRs.

* Add kv cache api or distributed kv orchestration * Update the kvcache api spec * Add KV Cache controller initial implementation * Support affinity Node & Pod settings * Adjust manifest to orchestration folders * fix the ci check * Address review feedback * Update code based on rebase refactor * Fix the linter issue --------- Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan requested a review from varungup90 January 21, 2025 19:07

varungup90 reviewed Jan 21, 2025

View reviewed changes

pkg/controller/kvcache/kvcache_controller.go Outdated Show resolved Hide resolved

Jeffwan added 8 commits January 21, 2025 14:00

Add kv cache api or distributed kv orchestration

c4b1d77

Signed-off-by: Jiaxin Shan <[email protected]>

Update the kvcache api spec

d3d568a

Signed-off-by: Jiaxin Shan <[email protected]>

Add KV Cache controller initial implementation

2b16031

Signed-off-by: Jiaxin Shan <[email protected]>

Support affinity Node & Pod settings

4cf00bb

Signed-off-by: Jiaxin Shan <[email protected]>

Adjust manifest to orchestration folders

d38818c

Signed-off-by: Jiaxin Shan <[email protected]>

fix the ci check

2e11405

Signed-off-by: Jiaxin Shan <[email protected]>

Address review feedback

525665b

Update code based on rebase refactor

787c672

Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan force-pushed the jiaxin/kv-cache-orchestrator branch from 9d497e0 to 787c672 Compare January 21, 2025 22:05

varungup90 approved these changes Jan 21, 2025

View reviewed changes

Fix the linter issue

dc2545f

Jeffwan merged commit ece609e into main Jan 21, 2025
10 checks passed

Jeffwan deleted the jiaxin/kv-cache-orchestrator branch January 21, 2025 23:43

Jeffwan mentioned this pull request Jan 31, 2025

[Doc] feature description for distributed kv cache #623

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support distributed kv cache orchestration #583

Support distributed kv cache orchestration #583

Jeffwan commented Jan 21, 2025 •

edited

Loading

varungup90 Jan 21, 2025

Jeffwan Jan 21, 2025

Jeffwan commented Jan 21, 2025

Support distributed kv cache orchestration #583

Support distributed kv cache orchestration #583

Conversation

Jeffwan commented Jan 21, 2025 • edited Loading

Pull Request Description

Example

Related Issues

Pull Request Title Format

Submission Checklist

varungup90 Jan 21, 2025

Choose a reason for hiding this comment

Jeffwan Jan 21, 2025

Choose a reason for hiding this comment

Jeffwan commented Jan 21, 2025

Jeffwan commented Jan 21, 2025 •

edited

Loading