Skip to content

Commit

Permalink
Manually Enable LoRA Adapters using existing Router and vLLM deployme…
Browse files Browse the repository at this point in the history
…nt (#206)

* enable lora in helm chart

Signed-off-by: Chen Wang <[email protected]>

* add values yaml for lora

Signed-off-by: Chen Wang <[email protected]>

* fix format issues

Signed-off-by: Chen Wang <[email protected]>

* add values yaml for lora

    edit cb23434 fix format issues
    edit 03d8e5f add values yaml for lora

Signed-off-by: Chen Wang <[email protected]>

* fix comments

Signed-off-by: Chen Wang <[email protected]>

* pre-commit check

Signed-off-by: Chen Wang <[email protected]>

---------

Signed-off-by: Chen Wang <[email protected]>
  • Loading branch information
wangchen615 authored Mar 2, 2025
1 parent 774c964 commit 4119d15
Show file tree
Hide file tree
Showing 3 changed files with 264 additions and 0 deletions.
3 changes: 3 additions & 0 deletions helm/templates/deployment-vllm-multi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ spec:
- "0.0.0.0"
- "--port"
- {{ include "chart.container-port" . | quote }}
{{- if $modelSpec.enableLoRA }}
- "--enable-lora"
{{- end }}
{{- with $modelSpec.vllmConfig }}
{{- if hasKey . "enableChunkedPrefill" }}
- "--enable-chunked-prefill"
Expand Down
213 changes: 213 additions & 0 deletions tutorials/09-lora-enabled-installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
# Tutorial: Setting up vLLM with Llama-2 and LoRA Support

## Introduction

This tutorial guides you through setting up the vLLM Production Stack with Llama-2-7b and LoRA adapter support. This setup enables you to use and switch between different LoRA adapters at runtime.

## Prerequisites

1. All prerequisites from the [minimal installation tutorial](01-minimal-helm-installation.md)
2. A Hugging Face account with access to Llama-2-7b
3. Accepted terms for meta-llama/Llama-2-7b-hf on Hugging Face
4. A valid Hugging Face token

## Steps

### 1. Set up Hugging Face Credentials

First, create a Kubernetes secret with your Hugging Face token:

```bash
kubectl create secret generic huggingface-credentials \
--from-literal=HUGGING_FACE_HUB_TOKEN=your_token_here
```

### 2. Deploy vLLM Instance with LoRA Support

#### 2.1: Create Configuration File

Locate the file under path tutorial/assets/values-07-lora-enabled.yaml with the following content:

```yaml
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "llama2-7b"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "meta-llama/Llama-2-7b-hf"

# Enable LoRA support
enableLoRA: true

# Mount Hugging Face credentials and configure LoRA settings
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-credentials
key: HUGGING_FACE_HUB_TOKEN
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
value: "True"

replicaCount: 1

# Resource requirements for Llama-2-7b
requestCPU: 8
requestMemory: "32Gi"
requestGPU: 1

# Optional: Configure storage for LoRA weights
volumes:
- name: lora-storage
emptyDir: {}
volumeMounts:
- name: lora-storage
mountPath: "/lora-weights"
```
#### 2.2: Deploy the Helm Chart
```bash
helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm-lora ./helm -f tutorials/assets/values-07-lora-enabled.yaml
```

### 3. Using LoRA Adapters

#### 3.1: Download LoRA Adapters

First, download a LoRA adapter from HuggingFace to your persistent volume:

```bash
# Get into the vLLM pod
kubectl exec -it $(kubectl get pods | grep vllm-lora-llama2-7b-deployment-vllm | awk '{print $1}') -- bash

# Inside the pod, download the adapter using Python
mkdir -p /data/lora-adapters
cd /data/lora-adapters
python3 -c "
from huggingface_hub import snapshot_download
adapter_id = 'yard1/llama-2-7b-sql-lora-test' # Example SQL adapter
sql_lora_path = snapshot_download(
repo_id=adapter_id,
local_dir='./sql-lora',
token=__import__('os').environ['HUGGING_FACE_HUB_TOKEN']
)
"

# Verify the adapter files are downloaded
ls -l /data/lora-adapters/sql-lora
```

#### 3.2: Access the vLLM API

Set up port forwarding to access the vLLM API:

```bash
# Setup port-forward to the vLLM service
kubectl port-forward svc/vllm-lora-router-service 8000:80

# In a new terminal, verify the connection
curl http://localhost:8000/v1/models
```

Note: The service forwards port 80 internally to port 8000 in the pod, so we map local port 8000 to the service's port 80.

#### 3.3: List and Load Models

In a new terminal, forward the port to the vLLM service instead of the router service, as the engine service is where the LoRA adapter is loaded.

```bash
kubectl port-forward svc/vllm-lora-engine-service 8001:80
```

```bash
# List available models before loading adapter
curl http://localhost:8001/v1/models

# Load the SQL LoRA adapter
curl -X POST http://localhost:8001/v1/load_lora_adapter \
-H "Content-Type: application/json" \
-d '{
"lora_name": "sql_adapter",
"lora_path": "/data/lora-adapters/sql-lora"
}'
```

#### 3.4: Generate Text with LoRA

Make inference requests specifying the LoRA adapter:

```bash
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-hf",
"prompt": "Write a SQL query to select all users who have made a purchase in the last 30 days",
"max_tokens": 100,
"temperature": 0.7,
"lora_adapter": "sql_adapter"
}'
```

#### 3.5: Unload a LoRA Adapter

When finished, you can unload the adapter:

```bash
curl -X POST http://localhost:8001/v1/unload_lora_adapter \
-H "Content-Type: application/json" \
-d '{
"lora_name": "sql_adapter"
}'
```

Note: Remember to keep the port-forward terminal running while making these requests. You can stop it with Ctrl+C when you're done.

### 4. Monitoring and Validation

Monitor the deployment status:

```bash
kubectl get pods
```

Expected output should show the pods running:

```plaintext
NAME READY STATUS RESTARTS AGE
vllm-lora-deployment-router-xxxxxx-yyyy 1/1 Running 0 2m38s
vllm-lora-llama2-7b-deployment-xxxxxx-yyyy 1/1 Running 0 2m38s
```

### 5. Troubleshooting

Common issues and solutions:

1. **Hugging Face Authentication**:
- Verify your token is correctly set in the Kubernetes secret
- Check pod logs for authentication errors

2. **Resource Issues**:
- Ensure your cluster has sufficient GPU memory
- Monitor GPU utilization using `nvidia-smi`

3. **LoRA Loading Issues**:
- Verify LoRA weights are in the correct format
- Check pod logs for adapter loading errors

### 6. Cleanup

To remove the deployment:

```bash
helm uninstall vllm-lora
kubectl delete secret huggingface-credentials
```

## Additional Resources

- [vLLM LoRA Documentation](https://docs.vllm.ai)
- [Llama-2 Model Card](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- [LoRA Paper](https://arxiv.org/abs/2106.09685)
48 changes: 48 additions & 0 deletions tutorials/assets/values-07-lora-enabled.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "llama2-7b"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "meta-llama/Llama-2-7b-hf"
enableLoRA: true

# Option 1: Direct token
# hf_token: "your_huggingface_token_here"

# OR Option 2: Secret reference
hf_token:
secretName: "huggingface-credentials"
secretKey: "HUGGING_FACE_HUB_TOKEN"

# Other vLLM configs if needed
vllmConfig:
maxModelLen: 4096
dtype: "bfloat16"

# Mount Hugging Face credentials and configure LoRA settings
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-credentials
key: HUGGING_FACE_HUB_TOKEN
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
value: "True"

replicaCount: 1

# Resource requirements for Llama-2-7b
requestCPU: 8
requestMemory: "32Gi"
requestGPU: 1

pvcStorage: "10Gi"
pvcAccessMode:
- ReadWriteOnce

# Add longer startup probe settings
startupProbe:
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 120 # Allow up to 1 hour for startup

0 comments on commit 4119d15

Please sign in to comment.