This tutorial guides you through setting up horizontal pod autoscaling (HPA) for vLLM deployments using Prometheus metrics. By the end of this tutorial, you'll have a vLLM deployment that automatically scales based on the number of waiting requests in the queue.
- A working vLLM deployment on Kubernetes (follow 01-minimal-helm-installation)
- Kubernetes environment with 2 GPUs
kubectl
andhelm
installed- Basic understanding of Kubernetes and metrics
Follow the instructions in 02-basic-vllm-config.md to install the vLLM Production Stack with a single Pod.
The observability stack is based on kube-prometheus-stack and includes Prometheus, Grafana, and other monitoring tools.
# Navigate to the observability directory
cd production-stack/observability
# Install the observability stack
sudo bash install.sh
The Prometheus Adapter is automatically configured during installation to export vLLM metrics. The key metric we'll use for autoscaling in this tutorial is vllm_num_requests_waiting
.
You can learn more about the Prometheus Adapter in the Prometheus Adapter README.
Check if the metrics are being exported correctly:
# Check if the metric is available
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | grep vllm_num_requests_waiting -C 10
# Get the current value of the metric
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/metrics/vllm_num_requests_waiting | jq
Expected output should show the metric and its current value:
{
"items": [
{
"describedObject": {
"kind": "Namespace",
"name": "default",
"apiVersion": "/v1"
},
"metricName": "vllm_num_requests_waiting",
"value": "0"
}
]
}
Locate the file assets/hpa-10.yaml with the following content:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama3-deployment-vllm # Name of the deployment to scale
minReplicas: 1
maxReplicas: 2
metrics:
- type: Object
object:
metric:
name: vllm_num_requests_waiting
describedObject:
apiVersion: v1
kind: Namespace
name: default # The namespace where the metric is collected
target:
type: Value
value: 1 # Scale up if the metric exceeds 1
Apply the HPA to your Kubernetes cluster:
kubectl apply -f assets/hpa-10.yaml
Explanation of the HPA configuration:
minReplicas
: The minimum number of replicas to scale down tomaxReplicas
: The maximum number of replicas to scale up tometric
: The metric to scale ontarget
: The target value of the metric
The above HPA will:
- Maintain between 1 and 2 replicas
- Scale up when there are more than 1 requests waiting in the queue
- Scale down when the queue length decreases
Monitor the HPA status:
kubectl get hpa vllm-hpa -w
The output should show the HPA status and the current number of replicas.
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
vllm-hpa Deployment/vllm-llama3-deployment-vllm 0/1 1 2 1 34s
We provide a load test script in assets/example-10-load-generator.py to test the autoscaling.
# In the production-stack/tutorials directory
kubectl port-forward svc/vllm-engine-service 30080:80 &
python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 10000
You should see the HPA scale up the number of replicas to 2 and there is a new vLLM pod created.
To remove the observability stack and HPA:
# Remove HPA
kubectl delete -f assets/hpa-10.yaml
# Uninstall observability stack (in the production-stack/tutorials directory)
cd ../observability # Go back to the observability directory
sudo bash uninstall.sh
- Support CRD based HPA configuration