In this tutorial, you will learn how to load a model from a Persistent Volume (PV) in Kubernetes to optimize deployment performance. The steps include creating a PV, matching it using pvcMatchLabels
, and deploying the Helm chart to utilize the PV. You will also verify the setup by examining the contents and measuring performance improvements.
- Prerequisites
- Step 1: Creating a Persistent Volume
- Step 2: Deploying with Helm Using the PV
- Step 3: Verifying the Deployment
- A running Kubernetes cluster with GPU support.
- Completion of previous tutorials:
- Basic understanding of Kubernetes PV and PVC concepts.
-
Locate the persistent Volume manifest file at
tutorials/assets/pv-03.yaml
) with the following content:apiVersion: v1 kind: PersistentVolume metadata: name: test-vllm-pv labels: model: "llama3-pv" spec: capacity: storage: 50Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain storageClassName: standard hostPath: path: /data/llama3
Note: You can change the path specified in the
hostPath
field to any valid directory on your Kubernetes node. -
Apply the manifest:
sudo kubectl apply -f tutorials/assets/pv-03.yaml
-
Verify the PV is created:
sudo kubectl get pv
Expected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS AGE test-vllm-pv 50Gi RWO Retain Available standard 2m
-
Locate the example values file at
tutorials/assets/values-03-match-pv.yaml
with the following content:servingEngineSpec: modelSpec: - name: "llama3" repository: "vllm/vllm-openai" tag: "latest" modelURL: "meta-llama/Llama-3.1-8B-Instruct" replicaCount: 1 requestCPU: 10 requestMemory: "16Gi" requestGPU: 1 pvcStorage: "50Gi" pvcMatchLabels: model: "llama3-pv" vllmConfig: maxModelLen: 4096 hf_token: <YOUR HF TOKEN>
Explanation: The
pvcMatchLabels
field specifies the labels to match an existing Persistent Volume. In this example, it ensures that the deployment uses the PV with the labelmodel: "llama3-pv"
. This provides a way to link a specific PV to your application.Note: Make sure to replace
<YOUR_HF_TOKEN>
with your actual Hugging Face token in the yaml. -
Deploy the Helm chart:
helm install vllm vllm/vllm-stack -f tutorials/assets/values-03-match-pv.yaml
-
Verify the deployment:
sudo kubectl get pods
Expected output:
NAME READY STATUS RESTARTS AGE vllm-deployment-router-xxxx-xxxx 1/1 Running 0 1m vllm-llama3-deployment-vllm-xxxx-xxxx 1/1 Running 0 1m
-
Check the contents of the host directory:
-
If using a standard Kubernetes node:
sudo ls /data/llama3
-
If using Minikube, access the Minikube VM and check the path:
sudo minikube ssh ls /data/llama3/hub
Expected output:
You should see the model files loaded into the directory:
models--meta-llama--Llama-3.1-8B-Instruct version.txt
-
-
Uninstall and reinstall the deployment to observe faster startup:
sudo helm uninstall vllm sudo kubectl delete -f tutorials/assets/pv-03.yaml && sudo kubectl apply -f tutorials/assets/pv-03.yaml helm install vllm vllm/vllm-stack -f tutorials/assets/values-03-match-pv.yaml
- During the second installation, the serving engine starts faster because the model files are already loaded into the Persistent Volume.
In this tutorial, you learned how to utilize a Persistent Volume to store model weights for a vLLM serving engine. This approach optimizes deployment performance and demonstrates the benefits of Kubernetes storage resources. Continue exploring advanced configurations in future tutorials.