vllm-project · YuhanLiu11 · Mar 2, 2025 · Mar 2, 2025 · Mar 2, 2025
diff --git a/observability/README.md b/observability/README.md
@@ -41,3 +41,67 @@ sudo kubectl --namespace monitoring port-forward prometheus-kube-prom-stack-kube
 Open the webpage at `http://<IP of your node>:3000` to access the Grafana web page. The default user name is `admin` and the password can be configured in `values.yaml` (default is `prom-operator`).
 
 Import the dashboard using the `vllm-dashboard.json` in this folder.
+
+## Use Prometheus Adapter to export vLLM metrics
+
+The vLLM router can export metrics to Prometheus using the [Prometheus Adapter](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-adapter).
+When running the `install.sh` script, the Prometheus Adapter will be installed and configured to export the vLLM metrics.
+
+We provide a minimal example of how to use the Prometheus Adapter to export vLLM metrics. See [prom-adapter.yaml](prom-adapter.yaml) for more details.
+
+The exported metrics can be used for different purposes, such as horizontal scaling of the vLLM deployments.
+
+To verify the metrics are being exported, you can use the following command:
+
+```bash
+kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/metrics | jq | grep vllm_num_requests_waiting -C 10
+```
+
+You should see something like the following:
+
+```json
+    {
+      "name": "namespaces/vllm_num_requests_waiting",
+      "singularName": "",
+      "namespaced": false,
+      "kind": "MetricValueList",
+      "verbs": [
+        "get"
+      ]
+    }
+```
+
+The following command will show the current value of the metric:
+
+```bash
+kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/metrics/vllm_num_requests_waiting | jq
+```
+
+The output should look like the following:
+
+```json
+{
+  "kind": "MetricValueList",
+  "apiVersion": "custom.metrics.k8s.io/v1beta1",
+  "metadata": {},
+  "items": [
+    {
+      "describedObject": {
+        "kind": "Namespace",
+        "name": "default",
+        "apiVersion": "/v1"
+      },
+      "metricName": "vllm_num_requests_waiting",
+      "timestamp": "2025-03-02T01:56:01Z",
+      "value": "0",
+      "selector": null
+    }
+  ]
+}
+```
+
+## Uninstall the observability stack
+
+```bash
+sudo bash uninstall.sh
+```
diff --git a/observability/install.sh b/observability/install.sh
@@ -1,7 +1,12 @@
 #!/bin/bash
-
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+
 helm upgrade --install kube-prom-stack prometheus-community/kube-prometheus-stack \
   --namespace monitoring \
   --create-namespace \
-  -f values.yaml
+  -f kube-prom-stack.yaml --wait
+
+helm install prometheus-adapter prometheus-community/prometheus-adapter \
+    --namespace monitoring \
+    -f "$SCRIPT_DIR/prom-adapter.yaml"
diff --git a/observability/values.yaml → observability/kube-prom-stack.yaml b/observability/values.yaml → observability/kube-prom-stack.yaml
@@ -91,11 +91,9 @@ prometheusOperator:
 prometheus:
   enabled: true
 
-  ## Starter Kit components service monitors
-  #
-  # Uncomment the following section to enable emojivoto service monitoring
+  # Monitor vLLM pods using ServiceMonitor
   additionalServiceMonitors:
-    - name: "test-vllm-monitor2"
+    - name: "vllm-monitor"
       selector:
         matchLabels:
           app.kubernetes.io/managed-by: Helm
@@ -106,61 +104,3 @@ prometheus:
           - default
       endpoints:
         - port: "service-port"
-
-  # # Uncomment the following section to enable ingress-nginx service monitoring
-  #   - name: "ingress-nginx-monitor"
-  #     selector:
-  #       matchLabels:
-  #         app.kubernetes.io/name: ingress-nginx
-  #     namespaceSelector:
-  #       matchNames:
-  #         - ingress-nginx
-  #     endpoints:
-  #       - port: "metrics"
-
-  # # Uncomment the following section to enable Loki service monitoring
-  #   - name: "loki-monitor"
-  #     selector:
-  #       matchLabels:
-  #         app: loki
-  #         release: loki
-  #     namespaceSelector:
-  #       matchNames:
-  #         - loki-stack
-  #     endpoints:
-  #       - port: "http-metrics"
-
-  # # Uncomment the following section to enable Promtail service monitoring
-  #   - name: "promtail-monitor"
-  #     selector:
-  #       matchLabels:
-  #         app: promtail
-  #         release: loki
-  #     namespaceSelector:
-  #       matchNames:
-  #         - loki-stack
-  #     endpoints:
-  #       - port: "http-metrics"
-
-  ## Prometheus StorageSpec for persistent data
-  ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/storage.md
-  ##
-  # prometheusSpec:
-  #   affinity:
-  #     nodeAffinity:
-  #       preferredDuringSchedulingIgnoredDuringExecution:
-  #       - weight: 1
-  #         preference:
-  #           matchExpressions:
-  #           - key: preferred
-  #             operator: In
-  #             values:
-  #             - observability
-    # storageSpec:
-    #   volumeClaimTemplate:
-    #     spec:
-    #       storageClassName: do-block-storage
-    #       accessModes: ["ReadWriteOnce"]
-    #       resources:
-    #         requests:
-    #           storage: 5Gi
diff --git a/observability/prom-adapter.yaml b/observability/prom-adapter.yaml
@@ -0,0 +1,20 @@
+loglevel: 1
+
+prometheus:
+  url: http://kube-prom-stack-kube-prome-prometheus.monitoring.svc
+  port: 9090
+
+rules:
+  default: true
+  custom:
+
+  # Example metric to export for HPA
+  - seriesQuery: '{__name__=~"^vllm:num_requests_waiting$"}'
+    resources:
+      overrides:
+        namespace:
+          resource: "namespace"
+    name:
+      matches: ""
+      as: "vllm_num_requests_waiting"
+    metricsQuery: sum by(namespace) (vllm:num_requests_waiting)
diff --git a/observability/uninstall.sh b/observability/uninstall.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+helm uninstall prometheus-adapter -n monitoring
+helm uninstall -n monitoring kube-prom-stack
diff --git a/observability/upgrade.sh b/observability/upgrade.sh
diff --git a/tutorials/10-horizontal-autoscaling.md b/tutorials/10-horizontal-autoscaling.md
@@ -0,0 +1,177 @@
+# Tutorial: Scale Your vLLM Deployments Using the vLLM Production Stack
+
+## Introduction
+
+This tutorial guides you through setting up horizontal pod autoscaling (HPA) for vLLM deployments using Prometheus metrics. By the end of this tutorial, you'll have a vLLM deployment that automatically scales based on the number of waiting requests in the queue.
+
+## Table of Contents
+
+- [Introduction](#introduction)
+- [Table of Contents](#table-of-contents)
+- [Prerequisites](#prerequisites)
+- [Steps](#steps)
+  - [1. Install the Production Stack with a single Pod](#1-install-the-production-stack-with-a-single-pod)
+  - [2. Deploy the Observability Stack](#2-deploy-the-observability-stack)
+  - [3. Configure Prometheus Adapter](#3-configure-prometheus-adapter)
+  - [4. Verify Metrics Export](#4-verify-metrics-export)
+  - [5. Test the Autoscaling](#5-test-the-autoscaling)
+  - [6. Cleanup](#6-cleanup)
+
+## Prerequisites
+
+1. A working vLLM deployment on Kubernetes (follow [01-minimal-helm-installation](01-minimal-helm-installation.md))
+2. Kubernetes environment with 2 GPUs
+3. `kubectl` and `helm` installed
+4. Basic understanding of Kubernetes and metrics
+
+## Steps
+
+### 1. Install the Production Stack with a single Pod
+
+Follow the instructions in [02-basic-vllm-config.md](02-basic-vllm-config.md) to install the vLLM Production Stack with a single Pod.
+
+### 2. Deploy the Observability Stack
+
+The observability stack is based on kube-prometheus-stack and includes Prometheus, Grafana, and other monitoring tools.
+
+```bash
+# Navigate to the observability directory
+cd production-stack/observability
+
+# Install the observability stack
+sudo bash install.sh
+```
+
+### 3. Configure Prometheus Adapter
+
+The Prometheus Adapter is automatically configured during installation to export vLLM metrics. The key metric we'll use for autoscaling in this tutorial is `vllm_num_requests_waiting`.
+
+You can learn more about the Prometheus Adapter in the [Prometheus Adapter README](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-adapter).
+
+### 4. Verify Metrics Export
+
+Check if the metrics are being exported correctly:
+
+```bash
+# Check if the metric is available
+kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | grep vllm_num_requests_waiting -C 10
+
+# Get the current value of the metric
+kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/metrics/vllm_num_requests_waiting | jq
+```
+
+Expected output should show the metric and its current value:
+
+```json
+{
+  "items": [
+    {
+      "describedObject": {
+        "kind": "Namespace",
+        "name": "default",
+        "apiVersion": "/v1"
+      },
+      "metricName": "vllm_num_requests_waiting",
+      "value": "0"
+    }
+  ]
+}
+```
+
+### 5. Set Up Horizontal Pod Autoscaling
+
+Locate the file [assets/hpa-10.yaml](assets/hpa-10.yaml) with the following content:
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: vllm-hpa
+  namespace: default
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: vllm-llama3-deployment-vllm # Name of the deployment to scale
+  minReplicas: 1
+  maxReplicas: 2
+  metrics:
+  - type: Object
+    object:
+      metric:
+        name: vllm_num_requests_waiting
+      describedObject:
+        apiVersion: v1
+        kind: Namespace
+        name: default   # The namespace where the metric is collected
+      target:
+        type: Value
+        value: 1  # Scale up if the metric exceeds 1
+```
+
+Apply the HPA to your Kubernetes cluster:
+
+```bash
+kubectl apply -f assets/hpa-10.yaml
+```
+
+Explanation of the HPA configuration:
+
+- `minReplicas`: The minimum number of replicas to scale down to
+- `maxReplicas`: The maximum number of replicas to scale up to
+- `metric`: The metric to scale on
+- `target`: The target value of the metric
+
+The above HPA will:
+
+- Maintain between 1 and 2 replicas
+- Scale up when there are more than 1 requests waiting in the queue
+- Scale down when the queue length decreases
+
+### 5. Test the Autoscaling
+
+Monitor the HPA status:
+
+```bash
+kubectl get hpa vllm-hpa -w
+```
+
+The output should show the HPA status and the current number of replicas.
+
+```plaintext
+NAME       REFERENCE                                     TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
+vllm-hpa   Deployment/vllm-llama3-deployment-vllm   0/1       1         2         1          34s
+```
+
+We provide a load test script in [assets/example-10-load-generator.py](assets/example-10-load-generator.py) to test the autoscaling.
+
+```bash
+# In the production-stack/tutorials directory
+kubectl port-forward svc/vllm-engine-service 30080:80 &
+python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 10000
+```
+
+You should see the HPA scale up the number of replicas to 2 and there is a new vLLM pod created.
+
+### 6. Cleanup
+
+To remove the observability stack and HPA:
+
+```bash
+# Remove HPA
+kubectl delete -f assets/hpa-10.yaml
+
+# Uninstall observability stack (in the production-stack/tutorials directory)
+cd ../observability # Go back to the observability directory
+sudo bash uninstall.sh
+```
+
+## Upcoming Features for HPA in vLLM Production Stack
+
+- Support CRD based HPA configuration
+
+## Additional Resources
+
+- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
+- [Prometheus Adapter Documentation](https://github.com/kubernetes-sigs/prometheus-adapter)
+- [vLLM Production Stack Repository](https://github.com/vllm-project/production-stack)