This tutorial guides you through the basic configurations required to deploy a vLLM serving engine in a Kubernetes environment with GPU support. You will learn how to specify the model details, set up necessary environment variables (like HF_TOKEN
), and launch the vLLM serving engine.
- Prerequisites
- Step 1: Preparing the Configuration File
- Step 2: Applying the Configuration
- Step 3: Verifying the Deployment
- Step 4 (Optional): Multi-GPU Deployment
- A Kubernetes environment with GPU support, as set up in the 00-install-kubernetes-env tutorial.
- Helm installed on your system.
- Access to a HuggingFace token (
HF_TOKEN
).
- Locate the example configuration file
tutorials/assets/values-02-basic-config.yaml
. - Open the file and update the following fields:
- Write your actual huggingface token in
hf_token: <YOUR HF TOKEN>
in the yaml file.
- Write your actual huggingface token in
name
: The unique identifier for your model deployment.repository
: The Docker repository containing the model's serving engine image.tag
: Specifies the version of the model image to use.modelURL
: The URL pointing to the model on Hugging Face or another hosting service.replicaCount
: The number of replicas for the deployment, allowing scaling for load.requestCPU
: The amount of CPU resources requested per replica.requestMemory
: Memory allocation for the deployment; sufficient memory is required to load the model.requestGPU
: Specifies the number of GPUs to allocate for the deployment.pvcStorage
: Defines the Persistent Volume Claim size for model storage.vllmConfig
: Contains model-specific configurations:enableChunkedPrefill
: Optimizes performance by prefetching model chunks.enablePrefixCaching
: Speeds up response times for common prefixes in queries.maxModelLen
: The maximum sequence length the model can handle.dtype
: Data type for computations, e.g.,bfloat16
for faster performance on modern GPUs.extraArgs
: Additional arguments passed to the vLLM engine for fine-tuning behavior.
hf_token
: The Hugging Face token for authenticating with the Hugging Face model hub.env
: Extra environment variables to pass to the model-serving engine.
servingEngineSpec:
modelSpec:
- name: "llama3"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
replicaCount: 1
requestCPU: 10
requestMemory: "16Gi"
requestGPU: 1
pvcStorage: "50Gi"
vllmConfig:
enableChunkedPrefill: false
enablePrefixCaching: false
maxModelLen: 16384
dtype: "bfloat16"
extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]
hf_token: <YOUR HF TOKEN>
Deploy the configuration using Helm:
helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/vllm-stack -f tutorials/assets/values-02-basic-config.yaml
Expected output:
You should see output indicating the successful deployment of the Helm chart:
Release "vllm" has been deployed. Happy Helming!
NAME: vllm
LAST DEPLOYED: <timestamp>
NAMESPACE: default
STATUS: deployed
REVISION: 1
-
Check the status of the pods:
sudo kubectl get pods
Expected output:
You should see the following pods:
NAME READY STATUS RESTARTS AGE pod/vllm-deployment-router-xxxx-xxxx 1/1 Running 0 3m23s vllm-llama3-deployment-vllm-xxxx-xxxx 1/1 Running 0 3m23s
- The
vllm-deployment-router
pod acts as the router, managing requests and routing them to the appropriate model-serving pod. - The
vllm-llama3-deployment-vllm
pod serves the actual model for inference.
- The
-
Verify the service is exposed correctly:
sudo kubectl get services
Expected output:
Ensure there are services for both the serving engine and the router:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE vllm-engine-service ClusterIP 10.103.98.170 <none> 80/TCP 4m vllm-router-service ClusterIP 10.103.110.107 <none> 80/TCP 4m
- The
vllm-engine-service
exposes the serving engine. - The
vllm-router-service
handles routing and load balancing across model-serving pods.
- The
-
Test the health endpoint:
curl http://<SERVICE_IP>/health
Replace
<SERVICE_IP>
with the external IP of the service. If everything is configured correctly, you will get:{"status":"healthy"}
Please refer to Step 3 in the 01-minimal-helm-installation tutorial for querying the deployed vLLM service.
So far, you have configured and deployment vLLM serving engine with a single GPU. You may also deploy a serving engine on multiple GPUs with the following example configuration snippet:
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "llama3"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
replicaCount: 1
requestCPU: 10
requestMemory: "16Gi"
requestGPU: 2
pvcStorage: "50Gi"
pvcAccessMode:
- ReadWriteOnce
vllmConfig:
enableChunkedPrefill: false
enablePrefixCaching: false
maxModelLen: 4096
tensorParallelSize: 2
dtype: "bfloat16"
extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]
hf_token: <YOUR HF TOKEN>
shmSize: "20Gi"
Note that only tensor parallelism is supported for now. The field shmSize
has to be configured if you are requesting requestGPU
to be more than one, to enable appropriate shared memory across multiple processes used to run tensor parallelism.
In this tutorial, you configured and deployed a vLLM serving engine with GPU support (both on a single GPU or multiple GPUs) in a Kubernetes environment. You also learned how to verify its deployment and ensure it is running as expected. For further customization, refer to the values.yaml
file and Helm chart documentation.