Skip to content

Commit

Permalink
Merge branch 'main' into kuntai-router
Browse files Browse the repository at this point in the history
  • Loading branch information
KuntaiDu committed Mar 5, 2025
2 parents a6693c3 + cb5ebb2 commit 8e39751
Show file tree
Hide file tree
Showing 200 changed files with 10,461 additions and 1,172 deletions.
3 changes: 0 additions & 3 deletions .github/curl-01-minimal-example.sh
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
#!/bin/bash

# Curl and save output
[ ! -d "output-01-minimal-example" ] && mkdir output-01-minimal-example
chmod -R 777 output-01-minimal-example
# shellcheck disable=SC2034 # result_model appears unused. Verify it or export it.
result_model=$(curl -s http://"$1":"$2"/v1/models | tee output-01-minimal-example/models-01-minimal-example.json)
# shellcheck disable=SC2034 # result_query appears unused. Verify it or export it.
Expand Down
3 changes: 0 additions & 3 deletions .github/curl-02-two-pods.sh
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
#!/bin/bash

# Curl and save output
[ ! -d "output-02-two-pods" ] && mkdir output-02-two-pods
chmod -R 777 output-02-two-pods
# shellcheck disable=SC2034 # result_model appears unused. Verify it or export it.
result_model=$(curl -s http://"$1":"$2"/v1/models | tee output-02-two-pods/models-02-two-pods.json)
# shellcheck disable=SC2034 # result_query appears unused. Verify it or export it.
Expand Down
3 changes: 0 additions & 3 deletions .github/curl-04-multiple-models.sh
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
#!/bin/bash

# Curl and save output
[ ! -d "output-04-multiple-models" ] && mkdir output-04-multiple-models
chmod -R 777 output-04-multiple-models
# shellcheck disable=SC2034 # result_model appears unused. Verify it or export it.
result_model=$(curl -s http://"$1":"$2"/v1/models | tee output-04-multiple-models/models-04-multiple-models.json)

Expand Down
11 changes: 11 additions & 0 deletions .github/port-forward.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,17 @@ fi

echo "Waiting for all llmstack pods to be in Running state..."

# Save output
VAR="${1#curl-}"
[ ! -d "output-$VAR" ] && mkdir "output-$VAR"
chmod -R 777 "output-$VAR"

# Print router logs
POD_NAME=$(kubectl get pods --no-headers -o custom-columns=":metadata.name" | grep '^vllm-deployment-router')
kubectl wait --for=condition=ready pod/"$POD_NAME" --timeout=120s
sudo kubectl logs -f "$POD_NAME" 2>&1 | sudo tee "output-$VAR/router.log" &


# Loop to check if all llmstack-related pods are in the Running state
while true; do
# Get all pods containing "vllm" in their name and extract their STATUS column
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/functionality-helm-chart.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ on:
branches:
- main
paths:
- '.github/workflows/**'
- '.github/**'
- '**.py'
- 'setup.py'
- 'helm/**'
pull_request:
paths:
- '.github/workflows/**'
- '.github/**'
- '**.py'
- 'setup.py'
- 'helm/**'
Expand All @@ -32,7 +32,7 @@ jobs:
DOCKER_BUILDKIT: 1
run: |
cd ${{ github.workspace }}
sudo docker build -t localhost:5000/git-act-router -f docker/Dockerfile .
sudo docker build --build-arg INSTALL_SENTENCE_TRANSFORMERS=false -t localhost:5000/git-act-router -f docker/Dockerfile .
sudo docker push localhost:5000/git-act-router
sudo sysctl fs.protected_regular=0
sudo minikube image load localhost:5000/git-act-router
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,5 @@ helm/examples

# version files
src/vllm_router/_version.py

/tutorials/assets/private.yaml
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# vLLM Production Stack: reference stack for production vLLM deployment

| [**Blog**](https://lmcache.github.io) | [**Production-Stack Slack Channel**](https://vllm-dev.slack.com/archives/C089SMEAKRA) | [**LMCache Slack**](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ) | [**Interest Form**](https://forms.gle/mQfQDUXbKfp2St1z7) | [**Official Email**]([email protected]) |

## Latest News

-Join us at #production-stack channel of vLLM [slack](https://slack.vllm.ai/), LMCache [slack](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ), or fill out this [interest form](https://forms.gle/wSoeNpncmPVdXppg8) for a chat!
-Cloud Deployment Tutorials for Lambda Labs, AWS EKS, Google GCP are out! [Link](https://github.com/vllm-project/production-stack/blob/main/tutorials)
- 🛤️ 2025 Q1 Road Map Released! Join the discussion [here](https://github.com/vllm-project/production-stack/issues/26)!
- 🔥 vLLM Production Stack is released! Checkout our [release blogs](https://blog.lmcache.ai/2025-01-21-stack-release) [01-22-2025]

Expand All @@ -17,7 +19,7 @@
## Step-By-Step Tutorials

0. How To [*Install Kubernetes (kubectl, helm, minikube, etc)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md)?
1. How to [*Deploy Production Stack on Major Cloud Platforms (AWS, GCP, Azure)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/cloud_deployments)?
1. How to [*Deploy Production Stack on Major Cloud Platforms (AWS, GCP, Lambda Labs)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/cloud_deployments)?
2. How To [*Setup a Minimal vLLM Production Stack*](https://github.com/vllm-project/production-stack/blob/main/tutorials/01-minimal-helm-installation.md)?
3. How To [*Customize vLLM Configs (optional)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/02-basic-vllm-config.md)?
4. How to [*Load Your LLM Weights*](https://github.com/vllm-project/production-stack/blob/main/tutorials/03-load-model-from-pv.md)?
Expand Down Expand Up @@ -117,7 +119,7 @@ We welcome and value any contributions and collaborations. Please check out [CO

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.
This project is licensed under Apache License 2.0. See the `LICENSE` file for details.

---

Expand Down
111 changes: 111 additions & 0 deletions benchmarks/multi-round-qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Benchmarking vLLM Production Stack Performance with multi-round QA

## Overview

This repository contains benchmarking tools for evaluating vLLM Production Stack's performance (e.g., latency, throughput). The initial focus of this benchmark is on the multi-round QA (Question Answering) use case. The script `multi-round-qa.py` simulates multiple users interacting with a language model concurrently for multiple rounds, allowing you to analyze the serving engine's throughput and latency.

The overall workflow of this script is shown below ![Illustration](multi-round.png)

## Setup

Installing the required packages needed to run the benchmark by:

```bash
pip install -r requirements.txt
```

## Running benchmarks

To run the multi-round QA benchmark, use the following example command:

```python
python3 multi-round-qa.py \
--num-users 10 \
--num-rounds 5 \
--qps 0.5 \
--shared-system-prompt 1000 \
--user-history-prompt 2000 \
--answer-len 100 \
--model meta-llama/Llama-3.1-8B-Instruct \
--base-url http://localhost:30080/v1
```

Use ctrl-C to terminate the benchmark at any time, and the the script will write each request's detailed stats to `summary.csv`.

Note: the above command requires there is a serving engine with the `meta-llama/Llama-3.1-8B-Instruct` model served locally at ``http://localhost:30080/v1``. Here's an example command to launch the serving engine with vLLM Production Stack:

```bash
helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/vllm-stack -f model.yaml
```

And then do port-forwarding with the following command:

```bash
sudo kubectl port-forward svc/vllm-router-service 30080:80
```

### Explanation of the arguments

#### Configuring the workload

- `--num-users <int>`: The maximum number of concurrent users in the system (N in the above figure).
- `--num-rounds <int>`: The number of rounds per user (M in the above figure).
- `--qps <float>`: The overall queries per second (QPS) rate for the system.
- `--shared-system-prompt <int>`: Length of the system prompt shared across all users (in tokens).
- `--user-history-prompt <int>`: Length of the user-specific context (simulating existing chat history) (in tokens).
- `--answer-len <int>`: Length of the answer expected (in tokens).
- `--init-user-id <int>`: The initial user ID to start the benchmark (default = 0). This is useful when you want to resume the benchmark from a specific user ID or avoid serving engine caching the request from previous runs
- `--request-with-user-id`: If this option is present, the script will include the user ID in the request header.
- `--sharegpt`: If this option is present, the script will use ShareGPT workload instead of dummy context.

_Note:_ If you use ShareGPT dataset, the length of the answer expected (in tokens) will be determined by the min value of the dataset response and `--answer-len`. You also need to follow the instructions in **ShareGPT Datasets** first.

#### Configuring the serving engine connection

- `--model <str>`: The model name (e.g., `mistralai/Mistral-7B-Instruct-v0.2`).
- `--base-url <str>`: The URL endpoint for the language model server.

#### Configuring the experiment (Optional)

- `--output <str>`: The csv file to dump the detailed stats for each query (default = summary.csv)
- `--log-interval <float>`: Time between each performance summary log in seconds (default = 30)
- `--time <float>`: Total time to run the experiment (default = forever)

#### Processing previous outputs only (Optional)

- `--process-summary <filename>`: if this option is present, the script will only process the existing output csv and print out the summary without running any experiment.

## Benchmark Metrics

- **Queries Per Second (QPS)**: The average number of queries processed by the model per second.
- **Average Prompt Throughput**: Tokens generated in the prompt per second.
- **Average Generation Throughput**: Tokens generated as part of the response per second.
- **Average TTFT (Time to First Token)**: Average time taken for the model to generate the first token of a response.

## ShareGPT Datasets

1. Download and prepare the ShareGPT dataset
You can specify the proportion of data to process by providing a number between 0 and 1 as an argument to the script.

```bash
bash prepare_sharegpt_data.sh 1
```

In this example, 1 indicates processing 100% of the dataset. You can adjust this value as needed.

2. Run the benchmark
Example:

```bash
python3 multi-round-qa.py \
--num-users 10 \
--num-rounds 5 \
--qps 0.3 \
--shared-system-prompt 1000 \
--user-history-prompt 2000 \
--answer-len 100 \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--base-url http://localhost:8000/v1 \
--sharegpt
```
29 changes: 29 additions & 0 deletions benchmarks/multi-round-qa/model.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "llama3"
repository: "lmcache/vllm-openai"
tag: "latest"
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
replicaCount: 1

requestCPU: 10
requestMemory: "50Gi"
requestGPU: 1

pvcStorage: "50Gi"
pvcAccessMode:
- ReadWriteOnce

vllmConfig:
enableChunkedPrefill: false
enablePrefixCaching: false
maxModelLen: 4096
dtype: "bfloat16"
extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]

lmcacheConfig:
enabled: true
cpuOffloadingBufferSize: "30"

hf_token: <YOUR HUGGINGFACE TOKEN>
Loading

0 comments on commit 8e39751

Please sign in to comment.