Merge branch 'main' into kuntai-router

vllm-project · Mar 5, 2025 · 8e39751 · 8e39751
2 parents a6693c3 + cb5ebb2
commit 8e39751
Show file tree

Hide file tree

Showing 200 changed files with 10,461 additions and 1,172 deletions.
diff --git a/.github/curl-01-minimal-example.sh b/.github/curl-01-minimal-example.sh
@@ -1,8 +1,5 @@
 #!/bin/bash
 
-# Curl and save output
-[ ! -d "output-01-minimal-example" ] && mkdir output-01-minimal-example
-chmod -R 777 output-01-minimal-example
 # shellcheck disable=SC2034  # result_model appears unused. Verify it or export it.
 result_model=$(curl -s http://"$1":"$2"/v1/models | tee output-01-minimal-example/models-01-minimal-example.json)
 # shellcheck disable=SC2034  # result_query appears unused. Verify it or export it.

diff --git a/.github/curl-02-two-pods.sh b/.github/curl-02-two-pods.sh
@@ -1,8 +1,5 @@
 #!/bin/bash
 
-# Curl and save output
-[ ! -d "output-02-two-pods" ] && mkdir output-02-two-pods
-chmod -R 777 output-02-two-pods
 # shellcheck disable=SC2034  # result_model appears unused. Verify it or export it.
 result_model=$(curl -s http://"$1":"$2"/v1/models | tee output-02-two-pods/models-02-two-pods.json)
 # shellcheck disable=SC2034  # result_query appears unused. Verify it or export it.

diff --git a/.github/curl-04-multiple-models.sh b/.github/curl-04-multiple-models.sh
@@ -1,8 +1,5 @@
 #!/bin/bash
 
-# Curl and save output
-[ ! -d "output-04-multiple-models" ] && mkdir output-04-multiple-models
-chmod -R 777 output-04-multiple-models
 # shellcheck disable=SC2034  # result_model appears unused. Verify it or export it.
 result_model=$(curl -s http://"$1":"$2"/v1/models | tee output-04-multiple-models/models-04-multiple-models.json)
 

diff --git a/.github/port-forward.sh b/.github/port-forward.sh
@@ -8,6 +8,17 @@ fi
 
 echo "Waiting for all llmstack pods to be in Running state..."
 
+# Save output
+VAR="${1#curl-}"
+[ ! -d "output-$VAR" ] && mkdir "output-$VAR"
+chmod -R 777 "output-$VAR"
+
+# Print router logs
+POD_NAME=$(kubectl get pods --no-headers -o custom-columns=":metadata.name" | grep '^vllm-deployment-router')
+kubectl wait --for=condition=ready pod/"$POD_NAME" --timeout=120s
+sudo kubectl logs -f "$POD_NAME" 2>&1 | sudo tee "output-$VAR/router.log" &
+
+
 # Loop to check if all llmstack-related pods are in the Running state
 while true; do
     # Get all pods containing "vllm" in their name and extract their STATUS column

diff --git a/.github/workflows/functionality-helm-chart.yml b/.github/workflows/functionality-helm-chart.yml
@@ -5,13 +5,13 @@ on:
     branches:
       - main
     paths:
-      - '.github/workflows/**'
+      - '.github/**'
       - '**.py'
       - 'setup.py'
       - 'helm/**'
   pull_request:
     paths:
-      - '.github/workflows/**'
+      - '.github/**'
       - '**.py'
       - 'setup.py'
       - 'helm/**'
@@ -32,7 +32,7 @@ jobs:
           DOCKER_BUILDKIT: 1
         run: |
           cd ${{ github.workspace }}
-          sudo docker build -t localhost:5000/git-act-router -f docker/Dockerfile .
+          sudo docker build --build-arg INSTALL_SENTENCE_TRANSFORMERS=false -t localhost:5000/git-act-router -f docker/Dockerfile .
           sudo docker push localhost:5000/git-act-router
           sudo sysctl fs.protected_regular=0
           sudo minikube image load localhost:5000/git-act-router

diff --git a/.gitignore b/.gitignore
@@ -97,3 +97,5 @@ helm/examples
 
 # version files
 src/vllm_router/_version.py
+
+/tutorials/assets/private.yaml
diff --git a/README.md b/README.md
@@ -1,8 +1,10 @@
 # vLLM Production Stack: reference stack for production vLLM deployment
 
+| [**Blog**](https://lmcache.github.io) | [**Production-Stack Slack Channel**](https://vllm-dev.slack.com/archives/C089SMEAKRA) | [**LMCache Slack**](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ) | [**Interest Form**](https://forms.gle/mQfQDUXbKfp2St1z7) | [**Official Email**]([email protected]) |
+
 ## Latest News
 
-- ✨ Join us at #production-stack channel of vLLM [slack](https://slack.vllm.ai/), LMCache [slack](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ), or fill out this [interest form](https://forms.gle/wSoeNpncmPVdXppg8) for a chat!
+- ✨ Cloud Deployment Tutorials for Lambda Labs, AWS EKS, Google GCP are out! [Link](https://github.com/vllm-project/production-stack/blob/main/tutorials)
 - 🛤️ 2025 Q1 Road Map Released! Join the discussion [here](https://github.com/vllm-project/production-stack/issues/26)!
 - 🔥 vLLM Production Stack is released! Checkout our [release blogs](https://blog.lmcache.ai/2025-01-21-stack-release) [01-22-2025]
 
@@ -17,7 +19,7 @@
 ## Step-By-Step Tutorials
 
 0. How To [*Install Kubernetes (kubectl, helm, minikube, etc)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md)?
-1. How to [*Deploy Production Stack on Major Cloud Platforms (AWS, GCP, Azure)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/cloud_deployments)?
+1. How to [*Deploy Production Stack on Major Cloud Platforms (AWS, GCP, Lambda Labs)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/cloud_deployments)?
 2. How To [*Setup a Minimal vLLM Production Stack*](https://github.com/vllm-project/production-stack/blob/main/tutorials/01-minimal-helm-installation.md)?
 3. How To [*Customize vLLM Configs (optional)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/02-basic-vllm-config.md)?
 4. How to [*Load Your LLM Weights*](https://github.com/vllm-project/production-stack/blob/main/tutorials/03-load-model-from-pv.md)?
@@ -117,7 +119,7 @@ We welcome and value any contributions and collaborations.  Please check out [CO
 
 ## License
 
-This project is licensed under the MIT License. See the `LICENSE` file for details.
+This project is licensed under Apache License 2.0. See the `LICENSE` file for details.
 
 ---
 

diff --git a/benchmarks/multi-round-qa/README.md b/benchmarks/multi-round-qa/README.md
@@ -0,0 +1,111 @@
+# Benchmarking vLLM Production Stack Performance with multi-round QA
+
+## Overview
+
+This repository contains benchmarking tools for evaluating vLLM Production Stack's performance (e.g., latency, throughput). The initial focus of this benchmark is on the multi-round QA (Question Answering) use case. The script `multi-round-qa.py` simulates multiple users interacting with a language model concurrently for multiple rounds, allowing you to analyze the serving engine's throughput and latency.
+
+The overall workflow of this script is shown below ![Illustration](multi-round.png)
+
+## Setup
+
+Installing the required packages needed to run the benchmark by:
+
+```bash
+pip install -r requirements.txt
+```
+
+## Running benchmarks
+
+To run the multi-round QA benchmark, use the following example command:
+
+```python
+python3 multi-round-qa.py \
+    --num-users 10 \
+    --num-rounds 5 \
+    --qps 0.5 \
+    --shared-system-prompt 1000 \
+    --user-history-prompt 2000 \
+    --answer-len 100 \
+    --model meta-llama/Llama-3.1-8B-Instruct \
+    --base-url http://localhost:30080/v1
+```
+
+Use ctrl-C to terminate the benchmark at any time, and the the script will write each request's detailed stats to `summary.csv`.
+
+Note: the above command requires there is a serving engine with the `meta-llama/Llama-3.1-8B-Instruct` model served locally at ``http://localhost:30080/v1``. Here's an example command to launch the serving engine with vLLM Production Stack:
+
+```bash
+helm repo add vllm https://vllm-project.github.io/production-stack
+helm install vllm vllm/vllm-stack -f model.yaml
+```
+
+And then do port-forwarding with the following command:
+
+```bash
+sudo kubectl port-forward svc/vllm-router-service 30080:80
+```
+
+### Explanation of the arguments
+
+#### Configuring the workload
+
+- `--num-users <int>`: The maximum number of concurrent users in the system (N in the above figure).
+- `--num-rounds <int>`: The number of rounds per user (M in the above figure).
+- `--qps <float>`: The overall queries per second (QPS) rate for the system.
+- `--shared-system-prompt <int>`: Length of the system prompt shared across all users (in tokens).
+- `--user-history-prompt <int>`: Length of the user-specific context (simulating existing chat history) (in tokens).
+- `--answer-len <int>`: Length of the answer expected (in tokens).
+- `--init-user-id <int>`: The initial user ID to start the benchmark (default = 0). This is useful when you want to resume the benchmark from a specific user ID or avoid serving engine caching the request from previous runs
+- `--request-with-user-id`: If this option is present, the script will include the user ID in the request header.
+- `--sharegpt`: If this option is present, the script will use ShareGPT workload instead of dummy context.
+
+_Note:_ If you use ShareGPT dataset, the length of the answer expected (in tokens) will be determined by the min value of the dataset response and `--answer-len`. You also need to follow the instructions in **ShareGPT Datasets** first.
+
+#### Configuring the serving engine connection
+
+- `--model <str>`: The model name (e.g., `mistralai/Mistral-7B-Instruct-v0.2`).
+- `--base-url <str>`: The URL endpoint for the language model server.
+
+#### Configuring the experiment (Optional)
+
+- `--output <str>`: The csv file to dump the detailed stats for each query (default = summary.csv)
+- `--log-interval <float>`: Time between each performance summary log in seconds (default = 30)
+- `--time <float>`: Total time to run the experiment (default = forever)
+
+#### Processing previous outputs only (Optional)
+
+- `--process-summary <filename>`: if this option is present, the script will only process the existing output csv and print out the summary without running any experiment.
+
+## Benchmark Metrics
+
+- **Queries Per Second (QPS)**: The average number of queries processed by the model per second.
+- **Average Prompt Throughput**: Tokens generated in the prompt per second.
+- **Average Generation Throughput**: Tokens generated as part of the response per second.
+- **Average TTFT (Time to First Token)**: Average time taken for the model to generate the first token of a response.
+
+## ShareGPT Datasets
+
+1. Download and prepare the ShareGPT dataset
+   You can specify the proportion of data to process by providing a number between 0 and 1 as an argument to the script.
+
+   ```bash
+   bash prepare_sharegpt_data.sh 1
+   ```
+
+   In this example, 1 indicates processing 100% of the dataset. You can adjust this value as needed.
+
+2. Run the benchmark
+   Example:
+
+   ```bash
+   python3 multi-round-qa.py \
+       --num-users 10 \
+       --num-rounds 5 \
+       --qps 0.3 \
+       --shared-system-prompt 1000 \
+       --user-history-prompt 2000 \
+       --answer-len 100 \
+       --model mistralai/Mistral-7B-Instruct-v0.2 \
+       --base-url http://localhost:8000/v1 \
+       --sharegpt
+   ```
diff --git a/benchmarks/multi-round-qa/model.yaml b/benchmarks/multi-round-qa/model.yaml
@@ -0,0 +1,29 @@
+servingEngineSpec:
+  runtimeClassName: ""
+  modelSpec:
+  - name: "llama3"
+    repository: "lmcache/vllm-openai"
+    tag: "latest"
+    modelURL: "meta-llama/Llama-3.1-8B-Instruct"
+    replicaCount: 1
+
+    requestCPU: 10
+    requestMemory: "50Gi"
+    requestGPU: 1
+
+    pvcStorage: "50Gi"
+    pvcAccessMode:
+      - ReadWriteOnce
+
+    vllmConfig:
+      enableChunkedPrefill: false
+      enablePrefixCaching: false
+      maxModelLen: 4096
+      dtype: "bfloat16"
+      extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]
+
+    lmcacheConfig:
+      enabled: true
+      cpuOffloadingBufferSize: "30"
+
+    hf_token: <YOUR HUGGINGFACE TOKEN>
Original file line number	Diff line number	Diff line change
Expand Up		@@ -97,3 +97,5 @@ helm/examples

		# version files
		src/vllm_router/_version.py

		/tutorials/assets/private.yaml