Skip to content

Commit

Permalink
Make markdownlint happy
Browse files Browse the repository at this point in the history
Signed-off-by: Harry Mellor <[email protected]>
  • Loading branch information
hmellor committed Jan 29, 2025
1 parent e8a26da commit c953ce2
Show file tree
Hide file tree
Showing 12 changed files with 262 additions and 222 deletions.
5 changes: 5 additions & 0 deletions .markdownlint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
MD013: false # line-length
MD028: false # no-blanks-blockquote
MD029: # ol-prefix
style: ordered
MD033: false # no-inline-html
13 changes: 4 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
# vLLM Production Stack: reference stack for production vLLM deployment


**vLLM Production Stack** project provides a reference implementation on how to build an inference stack on top of vLLM, which allows you to:

- 🚀 Scale from single vLLM instance to distributed vLLM deployment without changing any application code
- 💻 Monitor the through a web dashboard
- 😄 Enjoy the performance benefits brought by request routing and KV cache offloading

## Latest News:
## Latest News

- 🔥 vLLM Production Stack is released! Checkout our [release blogs](https://blog.lmcache.ai/2025-01-21-stack-release) [01-22-2025]
- ✨Join us at #production-stack channel of vLLM [slack](https://slack.vllm.ai/), LMCache [slack](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ), or fill out this [interest form](https://forms.gle/wSoeNpncmPVdXppg8) for a chat!
Expand All @@ -20,7 +19,6 @@ The stack is set up using [Helm](https://helm.sh/docs/), and contains the follow
- **Request router**: Directs requests to appropriate backends based on routing keys or session IDs to maximize KV cache reuse.
- **Observability stack**: monitors the metrics of the backends through [Prometheus](https://github.com/prometheus/prometheus) + [Grafana](https://grafana.com/)


<img src="https://github.com/user-attachments/assets/8f05e7b9-0513-40a9-9ba9-2d3acca77c0c" alt="Architecture of the stack" width="800"/>

## Roadmap
Expand All @@ -42,6 +40,7 @@ We are actively working on this project and will release the following features
### Deployment

vLLM Production Stack can be deployed via helm charts. Clone the repo to local and execute the following commands for a minimal deployment:

```bash
git clone https://github.com/vllm-project/production-stack.git
cd production-stack/
Expand All @@ -55,21 +54,18 @@ To validate the installation and and send query to the stack, refer to [this tut

For more information about customizing the helm chart, please refer to [values.yaml](https://github.com/vllm-project/production-stack/blob/main/helm/values.yaml) and our other [tutorials](https://github.com/vllm-project/production-stack/tree/main/tutorials).


### Uninstall

```bash
sudo helm uninstall vllm
```


## Grafana Dashboard

### Features

The Grafana dashboard provides the following insights:


1. **Available vLLM Instances**: Displays the number of healthy instances.
2. **Request Latency Distribution**: Visualizes end-to-end request latency.
3. **Time-to-First-Token (TTFT) Distribution**: Monitors response times for token generation.
Expand Down Expand Up @@ -98,7 +94,6 @@ The router ensures efficient request distribution among backends. It supports:
- Session-ID based routing
- (WIP) prefix-aware routing


## Contributing

Contributions are welcome! Please follow the standard GitHub flow:
Expand All @@ -109,12 +104,12 @@ Contributions are welcome! Please follow the standard GitHub flow:

We use `pre-commit` for formatting, it is installed as follows:

```console
```bash
pip install -r requirements-lint.txt
pre-commit install
```

> You can read more about `pre-commit` at https://pre-commit.com.
> You can read more about `pre-commit` at <https://pre-commit.com>.
## License

Expand Down
4 changes: 2 additions & 2 deletions helm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

This helm chart lets users deploy multiple serving engines and a router into the Kubernetes cluster.

## Key features:
## Key features

- Support running multiple serving engines with multiple different models
- Load the model weights directly from the existing PersistentVolumes

## Prerequisites

1. A running Kubernetes cluster with GPU. (You can set it up through `minikube`: https://minikube.sigs.k8s.io/docs/tutorials/nvidia/)
1. A running Kubernetes cluster with GPU. (You can set it up through `minikube`: <https://minikube.sigs.k8s.io/docs/tutorials/nvidia/>)
2. [Helm](https://helm.sh/docs/intro/install/)

## Install the helm chart
Expand Down
3 changes: 2 additions & 1 deletion src/tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ MODEL = "meta-llama/Llama-3.1-8B-Instruct"
```

Then, execute the following command in terminal:

```bash
python3 test-openai.py
```
Expand All @@ -30,7 +31,7 @@ The `perftest/` folder contains the performance test scripts for the router. Spe
- `run-server.sh` and `run-multi-server.sh`: launches one or multiple mock-up OpenAI API server
- `clean-up.sh`: kills the mock-up OpenAI API server processes.

### Example router performance test:
### Example router performance test

Here's an example setup of running the router performance test:

Expand Down
3 changes: 2 additions & 1 deletion src/vllm_router/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@ docker build -t <image_name>:<tag> -f docker/Dockerfile .
## Example commands to run the router

**Example 1:** running the router locally at port 8000 in front of multiple serving engines:
```

```bash
python3 router.py --port 800 \
--service-discovery static \
--static-backends "http://localhost:9001,http://localhost:9002,http://localhost:9003" \
Expand Down
21 changes: 5 additions & 16 deletions tutorials/00-install-kubernetes-env.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@

This tutorial guides you through the process of setting up a Kubernetes environment on a GPU-enabled server. We will install and configure `kubectl`, `helm`, and `minikube`, ensuring GPU compatibility for workloads requiring accelerated computing. By the end of this tutorial, you will have a fully functional Kubernetes environment ready for deploy the vLLM Production Stack.

---

## Table of Contents

- [Introduction](#introduction)
Expand All @@ -17,8 +15,6 @@ This tutorial guides you through the process of setting up a Kubernetes environm
- [Step 3: Installing Minikube with GPU Support](#step-3-installing-minikube-with-gpu-support)
- [Step 4: Verifying GPU Configuration](#step-4-verifying-gpu-configuration)

---

## Prerequisites

Before you begin, ensure the following:
Expand All @@ -35,8 +31,6 @@ Before you begin, ensure the following:
- A Linux-based operating system (e.g., Ubuntu 20.04 or later).
- Basic understanding of Linux shell commands.

---

## Steps

### Step 1: Installing kubectl
Expand Down Expand Up @@ -71,8 +65,6 @@ Before you begin, ensure the following:
Client Version: v1.32.1
```

---

### Step 2: Installing Helm

1. Execute the script `install-helm.sh`:
Expand All @@ -99,8 +91,6 @@ Before you begin, ensure the following:
version.BuildInfo{Version:"v3.17.0", GitCommit:"301108edc7ac2a8ba79e4ebf5701b0b6ce6a31e4", GitTreeState:"clean", GoVersion:"go1.23.4"}
```

---

### Step 3: Installing Minikube with GPU Support

1. Execute the script `install-minikube-cluster.sh`:
Expand All @@ -116,6 +106,7 @@ Before you begin, ensure the following:

3. **Expected Output:**
If everything goes smoothly, you should see the example output like following:

```plaintext
😄 minikube v1.35.0 on Ubuntu 22.04 (kvm/amd64)
❗ minikube skips various validations when --force is supplied; this may lead to unexpected behavior
Expand All @@ -135,8 +126,6 @@ Before you begin, ensure the following:
TEST SUITE: None
```
---
### Step 4: Verifying GPU Configuration
1. Ensure Minikube is running:
Expand All @@ -145,7 +134,7 @@ Before you begin, ensure the following:
sudo minikube status
```
Expected Output:
Expected output:
```plaintext
minikube
Expand All @@ -162,7 +151,7 @@ Before you begin, ensure the following:
sudo kubectl describe nodes | grep -i gpu
```
Expected Output:
Expected output:
```plaintext
nvidia.com/gpu: 1
Expand All @@ -181,12 +170,12 @@ Before you begin, ensure the following:
sudo kubectl logs gpu-test
```
You should see the nvidia-smi output from the terminal
---
You should see the nvidia-smi output from the terminal
## Conclusion
By following this tutorial, you have successfully set up a Kubernetes environment with GPU support on your server. You are now ready to deploy and test vLLM Production Stack on Kubernetes. For further configuration and workload-specific setups, consult the official documentation for `kubectl`, `helm`, and `minikube`.
What's next:
- [01-minimal-helm-installation](https://github.com/vllm-project/production-stack/blob/main/tutorials/01-minimal-helm-installation.md)
43 changes: 35 additions & 8 deletions tutorials/01-minimal-helm-installation.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# Tutorial: Minimal Setup of the vLLM Production Stack

## Introduction

This tutorial guides you through a minimal setup of the vLLM Production Stack using one vLLM instance with the `facebook/opt-125m` model. By the end of this tutorial, you will have a working deployment of vLLM on a Kubernetes environment with GPU.

## Table of Contents

- [Introduction](#introduction)
- [Table of Contents](#table-of-contents)
- [Prerequisites](#prerequisites)
Expand All @@ -12,11 +14,12 @@ This tutorial guides you through a minimal setup of the vLLM Production Stack us
- [2. Validate Installation](#2-validate-installation)
- [3. Send a Query to the Stack](#3-send-a-query-to-the-stack)
- [3.1. Forward the Service Port](#31-forward-the-service-port)
- [3.2. Query the OpenAI-Compatible API](#32-query-the-openai-compatible-api)
- [3.2. Query the OpenAI-Compatible API to list the available models](#32-query-the-openai-compatible-api-to-list-the-available-models)
- [3.3. Query the OpenAI Completion Endpoint](#33-query-the-openai-completion-endpoint)
- [4. Uninstall](#4-uninstall)

## Prerequisites

1. A Kubernetes environment with GPU support. If not set up, follow the [00-install-kubernetes-env](00-install-kubernetes-env.md) guide.
2. Helm installed. Refer to the [install-helm.sh](install-helm.sh) script for instructions.
3. kubectl installed. Refer to the [install-kubectl.sh](install-kubectl.sh) script for instructions.
Expand All @@ -27,7 +30,8 @@ This tutorial guides you through a minimal setup of the vLLM Production Stack us

### 1. Deploy vLLM Instance

#### Step 1.1: Use Predefined Configuration
#### 1.1: Use Predefined Configuration

The vLLM Production Stack repository provides a predefined configuration file, `values-minimal-example.yaml`, located at `tutorials/assets/values-minimal-example.yaml`. This file contains the following content:

```yaml
Expand All @@ -48,6 +52,7 @@ servingEngineSpec:
```
Explanation of the key fields:
- **`modelSpec`**: Defines the model configuration, including:
- `name`: A name for the model deployment.
- `repository`: Docker repository hosting the model image.
Expand All @@ -58,47 +63,63 @@ Explanation of the key fields:
- **`requestGPU`**: Specifies the number of GPUs required.
- **`pvcStorage`**: Allocates persistent storage for the model.

#### Step 1.2: Deploy the Helm Chart
#### 1.2: Deploy the Helm Chart

Deploy the Helm chart using the predefined configuration file:

```bash
helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/production-stack -f tutorials/assets/values-minimal-example.yaml
```

Explanation of the command:

- `vllm` in the first command: The Helm repository.
- `vllm` in the second command: The name of the Helm release.
- `-f tutorials/assets/values-minimal-example.yaml`: Specifies the predefined configuration file.

### 2. Validate Installation

#### Step 2.1: Monitor Deployment Status
#### 2.1: Monitor Deployment Status

Monitor the deployment status using:

```bash
sudo kubectl get pods
```

Expected output:

- Pods for the `vllm` deployment should transition to `Ready` and the `Running` state.
```

```plaintext
NAME READY STATUS RESTARTS AGE
vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 0 2m38s
vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 0 2m38s
```

_Note_: It may take some time for the containers to download the Docker images and LLM weights.

### 3. Send a Query to the Stack

#### Step 3.1: Forward the Service Port
#### 3.1: Forward the Service Port

Expose the `vllm-router-service` port to the host machine:

```bash
sudo kubectl port-forward svc/vllm-router-service 30080:80
```

#### Step 3.2: Query the OpenAI-Compatible API to list the available models
#### 3.2: Query the OpenAI-Compatible API to list the available models

Test the stack's OpenAI-compatible API by querying the available models:

```bash
curl -o- http://localhost:30080/models
```

Expected output:

```json
{
"object": "list",
Expand All @@ -114,8 +135,10 @@ Expected output:
}
```

#### Step 3.3: Query the OpenAI Completion Endpoint
#### 3.3: Query the OpenAI Completion Endpoint

Send a query to the OpenAI `/completion` endpoint to generate a completion for a prompt:

```bash
curl -X POST http://localhost:30080/completions \
-H "Content-Type: application/json" \
Expand All @@ -125,7 +148,9 @@ curl -X POST http://localhost:30080/completions \
"max_tokens": 10
}'
```

Expected output:

```json
{
"id": "completion-id",
Expand All @@ -141,11 +166,13 @@ Expected output:
]
}
```

This demonstrates the model generating a continuation for the provided prompt.

### 4. Uninstall

To remove the deployment, run:

```bash
sudo helm uninstall vllm
```
Loading

0 comments on commit c953ce2

Please sign in to comment.