Skip to content

Latest commit

 

History

History
187 lines (145 loc) · 6.44 KB

08-benchmark-multi-round-qa-multi-gpu.md

File metadata and controls

187 lines (145 loc) · 6.44 KB

Tutorial: Multi-Round QA Benchmark (Multi-GPU)

Introduction

This tutorial provides a step-by-step guide to setting up and running benchmarks for comparing vLLM Production Stack, Naive Kubernetes, and AIBrix, with multi-round QA benchmark on 8 A100 GPUs (gpu_8x_a100_80gb_sxm4 ) from Lambda Labs.

Table of Contents

Prerequisites

Step 1: Running Benchmarks with vLLM Production Stack

First, start a vLLM Production Stack server.

To begin with, create a stack.yaml configuration file:

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "llama3"
    repository: "lmcache/vllm-openai"
    tag: "latest"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"
    replicaCount: 8
    requestCPU: 10
    requestMemory: "150Gi"
    requestGPU: 1
    pvcStorage: "50Gi"
    pvcAccessMode:
      - ReadWriteOnce
    vllmConfig:
      enableChunkedPrefill: false
      enablePrefixCaching: false
      maxModelLen: 32000
      dtype: "bfloat16"
      extraArgs: ["--disable-log-requests", "--swap-space", 0]
    lmcacheConfig:
      enabled: true
      cpuOffloadingBufferSize: "120"
    hf_token: <YOUR_HUGGINGFACE_TOKEN>

routerSpec:
  resources:
  requests:
    cpu: "2"
    memory: "8G"
  limits:
    cpu: "2"
    memory: "8G"
  routingLogic: "session"
  sessionKey: "x-user-id"

Deploy the vLLM Production Stack server by:

sudo helm repo add vllm https://vllm-project.github.io/production-stack
sudo helm install vllm vllm/vllm-stack -f stack.yaml

Then you can verify the pod readiness:

kubectl get pods

Once the pods are ready, run the port forwarding:

sudo kubectl port-forward svc/vllm-router-service 30080:80

Finally, run the benchmarking code by:

bash warmup.sh meta-llama/Llama-3.1-8B-Instruct http://localhost:30080/v1/
bash run.sh meta-llama/Llama-3.1-8B-Instruct http://localhost:30080/v1/ stack

Step 2: Running Benchmarks with Naive Kubernetes

First, start a naive Kubernetes server.

To begin with, create a naive.yaml configuration file:

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "llama3"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"
    replicaCount: 1
    requestCPU: 10
    requestMemory: "150Gi"
    requestGPU: 1
    pvcStorage: "50Gi"
    pvcMatchLabels:
      model: "llama3"
    pvcAccessMode:
      - ReadWriteOnce
    vllmConfig:
      enableChunkedPrefill: false
      enablePrefixCaching: true
      maxModelLen: 32000
      extraArgs: ["--disable-log-requests", "--swap-space", 0]

    lmcacheConfig:
      enabled: false

    hf_token: <YOUR HUGGINGFACE TOKEN>

routerSpec:
  resources:
  requests:
    cpu: "2"
    memory: "8G"
  limits:
    cpu: "2"
    memory: "8G"
  routingLogic: "roundrobin"

Deploy the Naive K8s stack server:

sudo helm repo add vllm https://vllm-project.github.io/production-stack
sudo helm install vllm vllm/vllm-stack -f naive.yaml

Then you can verify the pod readiness:

kubectl get pods

Once the pods are ready, run the port forwarding:

sudo kubectl port-forward svc/vllm-router-service 30080:80

Finally, run the benchmarking code by:

bash warmup.sh meta-llama/Llama-3.1-8B-Instruct http://localhost:30080/v1/
bash run.sh meta-llama/Llama-3.1-8B-Instruct http://localhost:30080/v1/ native

Step 3: Running Benchmarks with AIBrix

We followed the installation steps documented in AIBrix's official repo to install their necessary packages needed to run on the Lambda server.

To align the configurations used in benchmarking vLLM Production Stack and naive K8s, we changed the configurations documented in AIBrix's official repo to enable AIBrix's KV Cache CPU offloading. Specifically, we changed the model name in their deployment configuration yaml file at lines #4, #6, #17, #21, #38, #81, #86 and #99 from deepseek-coder-7b-instruct to llama3-1-8b; and line #36 from deepseek-ai/deepseek-coder-6.7b-instruct to meta-llama/Llama-3.1-8B-Instruct; and line #57 from and line #73 from deepseek-coder-7b-kvcache-rpc:9600 to llama3-1-8b-kvcache-rpc:9600 /var/run/vineyard-kubernetes/default/deepseek-coder-7b-kvcache to /var/run/vineyard-kubernetes/default/llama3-1-8b-kvcache. We also changed the CPU offload memory limit at line #47 from 10 to 120 to match the configuration used in Step 1. Finally, we changed the replica number at line #9 from 1 to 8.

We also changed the CPU memory limit in AIBrix's KV cache server config: At line #4, we changed from deepseek-coder-7b-kvcache to llama3-1-8b-kvcache; and at line #7, we changed from deepseek-coder-7b-instruct to llama3-1-8b; and at line #17, we changed from 4Gi to 150Gi for aligning with the configuration used in Step 1.

Finally, we follow the steps in AIBrix's official repo to start AIBrix server and then run the benchmarking code by:

bash bash warmup.sh llama3 http://localhost:8888/v1/ bash run.sh llama3 http://localhost:8888/v1/ aibrix

Conclusion

This tutorial provides a comprehensive guide to setting up and benchmarking vLLM Production Stack, Native Kubernetes, and AIBrix. By following these steps, you can effectively evaluate their performance in your environment.