Tutorial: Loading Model Weights from Persistent Volume

Introduction

In this tutorial, you will learn how to load a model from a Persistent Volume (PV) in Kubernetes to optimize deployment performance. The steps include creating a PV, matching it using pvcMatchLabels, and deploying the Helm chart to utilize the PV. You will also verify the setup by examining the contents and measuring performance improvements.

Prerequisites

A running Kubernetes cluster with GPU support.
Completion of previous tutorials:
Basic understanding of Kubernetes PV and PVC concepts.

Step 1: Creating a Persistent Volume

Locate the persistent Volume manifest file at tutorials/assets/pv-03.yaml) with the following content:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-vllm-pv
  labels:
    model: "llama3-pv"
spec:
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: standard
  hostPath:
    path: /data/llama3

Note: You can change the path specified in the hostPath field to any valid directory on your Kubernetes node.

Apply the manifest:

sudo kubectl apply -f tutorials/assets/pv-03.yaml

Verify the PV is created:

sudo kubectl get pv

Expected output:

NAME           CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   AGE
test-vllm-pv   50Gi       RWO            Retain           Available           standard       2m

Step 2: Deploying with Helm Using the PV

Locate the example values file at tutorials/assets/values-03-match-pv.yaml with the following content:

servingEngineSpec:
  modelSpec:
  - name: "llama3"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"
    replicaCount: 1

    requestCPU: 10
    requestMemory: "16Gi"
    requestGPU: 1

    pvcStorage: "50Gi"
    pvcMatchLabels:
      model: "llama3-pv"

    vllmConfig:
      maxModelLen: 4096

    hf_token: <YOUR HF TOKEN>

Explanation: The pvcMatchLabels field specifies the labels to match an existing Persistent Volume. In this example, it ensures that the deployment uses the PV with the label model: "llama3-pv". This provides a way to link a specific PV to your application.

Note: Make sure to replace <YOUR_HF_TOKEN> with your actual Hugging Face token in the yaml.

Deploy the Helm chart:

helm install vllm vllm/vllm-stack -f tutorials/assets/values-03-match-pv.yaml

Verify the deployment:

sudo kubectl get pods

Expected output:

NAME                                             READY   STATUS    RESTARTS   AGE
vllm-deployment-router-xxxx-xxxx             1/1     Running   0          1m
vllm-llama3-deployment-vllm-xxxx-xxxx        1/1     Running   0          1m

Step 3: Verifying the Deployment

Check the contents of the host directory:
- If using a standard Kubernetes node:
```
sudo ls /data/llama3
```
- If using Minikube, access the Minikube VM and check the path:
```
sudo minikube ssh
ls /data/llama3/hub
```
Expected output:

You should see the model files loaded into the directory:
```
models--meta-llama--Llama-3.1-8B-Instruct  version.txt
```

Uninstall and reinstall the deployment to observe faster startup:

sudo helm uninstall vllm
sudo kubectl delete -f tutorials/assets/pv-03.yaml && sudo kubectl apply -f tutorials/assets/pv-03.yaml
helm install vllm vllm/vllm-stack -f tutorials/assets/values-03-match-pv.yaml

Explanation

During the second installation, the serving engine starts faster because the model files are already loaded into the Persistent Volume.

Conclusion

In this tutorial, you learned how to utilize a Persistent Volume to store model weights for a vLLM serving engine. This approach optimizes deployment performance and demonstrates the benefits of Kubernetes storage resources. Continue exploring advanced configurations in future tutorials.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03-load-model-from-pv.md

03-load-model-from-pv.md

Tutorial: Loading Model Weights from Persistent Volume

Introduction

Table of Contents

Prerequisites

Step 1: Creating a Persistent Volume

Step 2: Deploying with Helm Using the PV

Step 3: Verifying the Deployment

Explanation

Conclusion

Files

03-load-model-from-pv.md

Latest commit

History

03-load-model-from-pv.md

File metadata and controls

Tutorial: Loading Model Weights from Persistent Volume

Introduction

Table of Contents

Prerequisites

Step 1: Creating a Persistent Volume

Step 2: Deploying with Helm Using the PV

Step 3: Verifying the Deployment

Explanation

Conclusion