-
Notifications
You must be signed in to change notification settings - Fork 82
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into kuntai-router
- Loading branch information
Showing
200 changed files
with
10,461 additions
and
1,172 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -97,3 +97,5 @@ helm/examples | |
|
||
# version files | ||
src/vllm_router/_version.py | ||
|
||
/tutorials/assets/private.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,10 @@ | ||
# vLLM Production Stack: reference stack for production vLLM deployment | ||
|
||
| [**Blog**](https://lmcache.github.io) | [**Production-Stack Slack Channel**](https://vllm-dev.slack.com/archives/C089SMEAKRA) | [**LMCache Slack**](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ) | [**Interest Form**](https://forms.gle/mQfQDUXbKfp2St1z7) | [**Official Email**]([email protected]) | | ||
|
||
## Latest News | ||
|
||
- ✨ Join us at #production-stack channel of vLLM [slack](https://slack.vllm.ai/), LMCache [slack](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ), or fill out this [interest form](https://forms.gle/wSoeNpncmPVdXppg8) for a chat! | ||
- ✨ Cloud Deployment Tutorials for Lambda Labs, AWS EKS, Google GCP are out! [Link](https://github.com/vllm-project/production-stack/blob/main/tutorials) | ||
- 🛤️ 2025 Q1 Road Map Released! Join the discussion [here](https://github.com/vllm-project/production-stack/issues/26)! | ||
- 🔥 vLLM Production Stack is released! Checkout our [release blogs](https://blog.lmcache.ai/2025-01-21-stack-release) [01-22-2025] | ||
|
||
|
@@ -17,7 +19,7 @@ | |
## Step-By-Step Tutorials | ||
|
||
0. How To [*Install Kubernetes (kubectl, helm, minikube, etc)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md)? | ||
1. How to [*Deploy Production Stack on Major Cloud Platforms (AWS, GCP, Azure)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/cloud_deployments)? | ||
1. How to [*Deploy Production Stack on Major Cloud Platforms (AWS, GCP, Lambda Labs)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/cloud_deployments)? | ||
2. How To [*Setup a Minimal vLLM Production Stack*](https://github.com/vllm-project/production-stack/blob/main/tutorials/01-minimal-helm-installation.md)? | ||
3. How To [*Customize vLLM Configs (optional)*](https://github.com/vllm-project/production-stack/blob/main/tutorials/02-basic-vllm-config.md)? | ||
4. How to [*Load Your LLM Weights*](https://github.com/vllm-project/production-stack/blob/main/tutorials/03-load-model-from-pv.md)? | ||
|
@@ -117,7 +119,7 @@ We welcome and value any contributions and collaborations. Please check out [CO | |
|
||
## License | ||
|
||
This project is licensed under the MIT License. See the `LICENSE` file for details. | ||
This project is licensed under Apache License 2.0. See the `LICENSE` file for details. | ||
|
||
--- | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
# Benchmarking vLLM Production Stack Performance with multi-round QA | ||
|
||
## Overview | ||
|
||
This repository contains benchmarking tools for evaluating vLLM Production Stack's performance (e.g., latency, throughput). The initial focus of this benchmark is on the multi-round QA (Question Answering) use case. The script `multi-round-qa.py` simulates multiple users interacting with a language model concurrently for multiple rounds, allowing you to analyze the serving engine's throughput and latency. | ||
|
||
The overall workflow of this script is shown below  | ||
|
||
## Setup | ||
|
||
Installing the required packages needed to run the benchmark by: | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Running benchmarks | ||
|
||
To run the multi-round QA benchmark, use the following example command: | ||
|
||
```python | ||
python3 multi-round-qa.py \ | ||
--num-users 10 \ | ||
--num-rounds 5 \ | ||
--qps 0.5 \ | ||
--shared-system-prompt 1000 \ | ||
--user-history-prompt 2000 \ | ||
--answer-len 100 \ | ||
--model meta-llama/Llama-3.1-8B-Instruct \ | ||
--base-url http://localhost:30080/v1 | ||
``` | ||
|
||
Use ctrl-C to terminate the benchmark at any time, and the the script will write each request's detailed stats to `summary.csv`. | ||
|
||
Note: the above command requires there is a serving engine with the `meta-llama/Llama-3.1-8B-Instruct` model served locally at ``http://localhost:30080/v1``. Here's an example command to launch the serving engine with vLLM Production Stack: | ||
|
||
```bash | ||
helm repo add vllm https://vllm-project.github.io/production-stack | ||
helm install vllm vllm/vllm-stack -f model.yaml | ||
``` | ||
|
||
And then do port-forwarding with the following command: | ||
|
||
```bash | ||
sudo kubectl port-forward svc/vllm-router-service 30080:80 | ||
``` | ||
|
||
### Explanation of the arguments | ||
|
||
#### Configuring the workload | ||
|
||
- `--num-users <int>`: The maximum number of concurrent users in the system (N in the above figure). | ||
- `--num-rounds <int>`: The number of rounds per user (M in the above figure). | ||
- `--qps <float>`: The overall queries per second (QPS) rate for the system. | ||
- `--shared-system-prompt <int>`: Length of the system prompt shared across all users (in tokens). | ||
- `--user-history-prompt <int>`: Length of the user-specific context (simulating existing chat history) (in tokens). | ||
- `--answer-len <int>`: Length of the answer expected (in tokens). | ||
- `--init-user-id <int>`: The initial user ID to start the benchmark (default = 0). This is useful when you want to resume the benchmark from a specific user ID or avoid serving engine caching the request from previous runs | ||
- `--request-with-user-id`: If this option is present, the script will include the user ID in the request header. | ||
- `--sharegpt`: If this option is present, the script will use ShareGPT workload instead of dummy context. | ||
|
||
_Note:_ If you use ShareGPT dataset, the length of the answer expected (in tokens) will be determined by the min value of the dataset response and `--answer-len`. You also need to follow the instructions in **ShareGPT Datasets** first. | ||
|
||
#### Configuring the serving engine connection | ||
|
||
- `--model <str>`: The model name (e.g., `mistralai/Mistral-7B-Instruct-v0.2`). | ||
- `--base-url <str>`: The URL endpoint for the language model server. | ||
|
||
#### Configuring the experiment (Optional) | ||
|
||
- `--output <str>`: The csv file to dump the detailed stats for each query (default = summary.csv) | ||
- `--log-interval <float>`: Time between each performance summary log in seconds (default = 30) | ||
- `--time <float>`: Total time to run the experiment (default = forever) | ||
|
||
#### Processing previous outputs only (Optional) | ||
|
||
- `--process-summary <filename>`: if this option is present, the script will only process the existing output csv and print out the summary without running any experiment. | ||
|
||
## Benchmark Metrics | ||
|
||
- **Queries Per Second (QPS)**: The average number of queries processed by the model per second. | ||
- **Average Prompt Throughput**: Tokens generated in the prompt per second. | ||
- **Average Generation Throughput**: Tokens generated as part of the response per second. | ||
- **Average TTFT (Time to First Token)**: Average time taken for the model to generate the first token of a response. | ||
|
||
## ShareGPT Datasets | ||
|
||
1. Download and prepare the ShareGPT dataset | ||
You can specify the proportion of data to process by providing a number between 0 and 1 as an argument to the script. | ||
|
||
```bash | ||
bash prepare_sharegpt_data.sh 1 | ||
``` | ||
|
||
In this example, 1 indicates processing 100% of the dataset. You can adjust this value as needed. | ||
|
||
2. Run the benchmark | ||
Example: | ||
|
||
```bash | ||
python3 multi-round-qa.py \ | ||
--num-users 10 \ | ||
--num-rounds 5 \ | ||
--qps 0.3 \ | ||
--shared-system-prompt 1000 \ | ||
--user-history-prompt 2000 \ | ||
--answer-len 100 \ | ||
--model mistralai/Mistral-7B-Instruct-v0.2 \ | ||
--base-url http://localhost:8000/v1 \ | ||
--sharegpt | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
servingEngineSpec: | ||
runtimeClassName: "" | ||
modelSpec: | ||
- name: "llama3" | ||
repository: "lmcache/vllm-openai" | ||
tag: "latest" | ||
modelURL: "meta-llama/Llama-3.1-8B-Instruct" | ||
replicaCount: 1 | ||
|
||
requestCPU: 10 | ||
requestMemory: "50Gi" | ||
requestGPU: 1 | ||
|
||
pvcStorage: "50Gi" | ||
pvcAccessMode: | ||
- ReadWriteOnce | ||
|
||
vllmConfig: | ||
enableChunkedPrefill: false | ||
enablePrefixCaching: false | ||
maxModelLen: 4096 | ||
dtype: "bfloat16" | ||
extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"] | ||
|
||
lmcacheConfig: | ||
enabled: true | ||
cpuOffloadingBufferSize: "30" | ||
|
||
hf_token: <YOUR HUGGINGFACE TOKEN> |
Oops, something went wrong.