Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Document, Feat] basic HPA support and tutorials #209

Merged
merged 2 commits into from
Mar 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions observability/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,67 @@ sudo kubectl --namespace monitoring port-forward prometheus-kube-prom-stack-kube
Open the webpage at `http://<IP of your node>:3000` to access the Grafana web page. The default user name is `admin` and the password can be configured in `values.yaml` (default is `prom-operator`).

Import the dashboard using the `vllm-dashboard.json` in this folder.

## Use Prometheus Adapter to export vLLM metrics

The vLLM router can export metrics to Prometheus using the [Prometheus Adapter](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-adapter).
When running the `install.sh` script, the Prometheus Adapter will be installed and configured to export the vLLM metrics.

We provide a minimal example of how to use the Prometheus Adapter to export vLLM metrics. See [prom-adapter.yaml](prom-adapter.yaml) for more details.

The exported metrics can be used for different purposes, such as horizontal scaling of the vLLM deployments.

To verify the metrics are being exported, you can use the following command:

```bash
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/metrics | jq | grep vllm_num_requests_waiting -C 10
```

You should see something like the following:

```json
{
"name": "namespaces/vllm_num_requests_waiting",
"singularName": "",
"namespaced": false,
"kind": "MetricValueList",
"verbs": [
"get"
]
}
```

The following command will show the current value of the metric:

```bash
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/metrics/vllm_num_requests_waiting | jq
```

The output should look like the following:

```json
{
"kind": "MetricValueList",
"apiVersion": "custom.metrics.k8s.io/v1beta1",
"metadata": {},
"items": [
{
"describedObject": {
"kind": "Namespace",
"name": "default",
"apiVersion": "/v1"
},
"metricName": "vllm_num_requests_waiting",
"timestamp": "2025-03-02T01:56:01Z",
"value": "0",
"selector": null
}
]
}
```

## Uninstall the observability stack

```bash
sudo bash uninstall.sh
```
9 changes: 7 additions & 2 deletions observability/install.sh
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
#!/bin/bash

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm upgrade --install kube-prom-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
-f values.yaml
-f kube-prom-stack.yaml --wait

helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
-f "$SCRIPT_DIR/prom-adapter.yaml"
64 changes: 2 additions & 62 deletions observability/values.yaml → observability/kube-prom-stack.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -91,11 +91,9 @@ prometheusOperator:
prometheus:
enabled: true

## Starter Kit components service monitors
#
# Uncomment the following section to enable emojivoto service monitoring
# Monitor vLLM pods using ServiceMonitor
additionalServiceMonitors:
- name: "test-vllm-monitor2"
- name: "vllm-monitor"
selector:
matchLabels:
app.kubernetes.io/managed-by: Helm
Expand All @@ -106,61 +104,3 @@ prometheus:
- default
endpoints:
- port: "service-port"

# # Uncomment the following section to enable ingress-nginx service monitoring
# - name: "ingress-nginx-monitor"
# selector:
# matchLabels:
# app.kubernetes.io/name: ingress-nginx
# namespaceSelector:
# matchNames:
# - ingress-nginx
# endpoints:
# - port: "metrics"

# # Uncomment the following section to enable Loki service monitoring
# - name: "loki-monitor"
# selector:
# matchLabels:
# app: loki
# release: loki
# namespaceSelector:
# matchNames:
# - loki-stack
# endpoints:
# - port: "http-metrics"

# # Uncomment the following section to enable Promtail service monitoring
# - name: "promtail-monitor"
# selector:
# matchLabels:
# app: promtail
# release: loki
# namespaceSelector:
# matchNames:
# - loki-stack
# endpoints:
# - port: "http-metrics"

## Prometheus StorageSpec for persistent data
## ref: https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/storage.md
##
# prometheusSpec:
# affinity:
# nodeAffinity:
# preferredDuringSchedulingIgnoredDuringExecution:
# - weight: 1
# preference:
# matchExpressions:
# - key: preferred
# operator: In
# values:
# - observability
# storageSpec:
# volumeClaimTemplate:
# spec:
# storageClassName: do-block-storage
# accessModes: ["ReadWriteOnce"]
# resources:
# requests:
# storage: 5Gi
20 changes: 20 additions & 0 deletions observability/prom-adapter.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
loglevel: 1

prometheus:
url: http://kube-prom-stack-kube-prome-prometheus.monitoring.svc
port: 9090

rules:
default: true
custom:

# Example metric to export for HPA
- seriesQuery: '{__name__=~"^vllm:num_requests_waiting$"}'
resources:
overrides:
namespace:
resource: "namespace"
name:
matches: ""
as: "vllm_num_requests_waiting"
metricsQuery: sum by(namespace) (vllm:num_requests_waiting)
3 changes: 3 additions & 0 deletions observability/uninstall.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash
helm uninstall prometheus-adapter -n monitoring
helm uninstall -n monitoring kube-prom-stack
5 changes: 0 additions & 5 deletions observability/upgrade.sh

This file was deleted.

177 changes: 177 additions & 0 deletions tutorials/10-horizontal-autoscaling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# Tutorial: Scale Your vLLM Deployments Using the vLLM Production Stack

## Introduction

This tutorial guides you through setting up horizontal pod autoscaling (HPA) for vLLM deployments using Prometheus metrics. By the end of this tutorial, you'll have a vLLM deployment that automatically scales based on the number of waiting requests in the queue.

## Table of Contents

- [Introduction](#introduction)
- [Table of Contents](#table-of-contents)
- [Prerequisites](#prerequisites)
- [Steps](#steps)
- [1. Install the Production Stack with a single Pod](#1-install-the-production-stack-with-a-single-pod)
- [2. Deploy the Observability Stack](#2-deploy-the-observability-stack)
- [3. Configure Prometheus Adapter](#3-configure-prometheus-adapter)
- [4. Verify Metrics Export](#4-verify-metrics-export)
- [5. Test the Autoscaling](#5-test-the-autoscaling)
- [6. Cleanup](#6-cleanup)

## Prerequisites

1. A working vLLM deployment on Kubernetes (follow [01-minimal-helm-installation](01-minimal-helm-installation.md))
2. Kubernetes environment with 2 GPUs
3. `kubectl` and `helm` installed
4. Basic understanding of Kubernetes and metrics

## Steps

### 1. Install the Production Stack with a single Pod

Follow the instructions in [02-basic-vllm-config.md](02-basic-vllm-config.md) to install the vLLM Production Stack with a single Pod.

### 2. Deploy the Observability Stack

The observability stack is based on kube-prometheus-stack and includes Prometheus, Grafana, and other monitoring tools.

```bash
# Navigate to the observability directory
cd production-stack/observability

# Install the observability stack
sudo bash install.sh
```

### 3. Configure Prometheus Adapter

The Prometheus Adapter is automatically configured during installation to export vLLM metrics. The key metric we'll use for autoscaling in this tutorial is `vllm_num_requests_waiting`.

You can learn more about the Prometheus Adapter in the [Prometheus Adapter README](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-adapter).

### 4. Verify Metrics Export

Check if the metrics are being exported correctly:

```bash
# Check if the metric is available
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | grep vllm_num_requests_waiting -C 10

# Get the current value of the metric
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/metrics/vllm_num_requests_waiting | jq
```

Expected output should show the metric and its current value:

```json
{
"items": [
{
"describedObject": {
"kind": "Namespace",
"name": "default",
"apiVersion": "/v1"
},
"metricName": "vllm_num_requests_waiting",
"value": "0"
}
]
}
```

### 5. Set Up Horizontal Pod Autoscaling

Locate the file [assets/hpa-10.yaml](assets/hpa-10.yaml) with the following content:

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama3-deployment-vllm # Name of the deployment to scale
minReplicas: 1
maxReplicas: 2
metrics:
- type: Object
object:
metric:
name: vllm_num_requests_waiting
describedObject:
apiVersion: v1
kind: Namespace
name: default # The namespace where the metric is collected
target:
type: Value
value: 1 # Scale up if the metric exceeds 1
```

Apply the HPA to your Kubernetes cluster:

```bash
kubectl apply -f assets/hpa-10.yaml
```

Explanation of the HPA configuration:

- `minReplicas`: The minimum number of replicas to scale down to
- `maxReplicas`: The maximum number of replicas to scale up to
- `metric`: The metric to scale on
- `target`: The target value of the metric

The above HPA will:

- Maintain between 1 and 2 replicas
- Scale up when there are more than 1 requests waiting in the queue
- Scale down when the queue length decreases

### 5. Test the Autoscaling

Monitor the HPA status:

```bash
kubectl get hpa vllm-hpa -w
```

The output should show the HPA status and the current number of replicas.

```plaintext
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
vllm-hpa Deployment/vllm-llama3-deployment-vllm 0/1 1 2 1 34s
```

We provide a load test script in [assets/example-10-load-generator.py](assets/example-10-load-generator.py) to test the autoscaling.

```bash
# In the production-stack/tutorials directory
kubectl port-forward svc/vllm-engine-service 30080:80 &
python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 10000
```

You should see the HPA scale up the number of replicas to 2 and there is a new vLLM pod created.

### 6. Cleanup

To remove the observability stack and HPA:

```bash
# Remove HPA
kubectl delete -f assets/hpa-10.yaml

# Uninstall observability stack (in the production-stack/tutorials directory)
cd ../observability # Go back to the observability directory
sudo bash uninstall.sh
```

## Upcoming Features for HPA in vLLM Production Stack

- Support CRD based HPA configuration

## Additional Resources

- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
- [Prometheus Adapter Documentation](https://github.com/kubernetes-sigs/prometheus-adapter)
- [vLLM Production Stack Repository](https://github.com/vllm-project/production-stack)
Loading