vllm-project · Jeffwan · Dec 16, 2024 · Dec 16, 2024 · Dec 16, 2024 · Dec 16, 2024
diff --git a/docs/source/development/development.rst b/docs/source/development/development.rst
@@ -3,3 +3,45 @@
 ===========
 Development
 ===========
+
+Build and Run
+-------------
+
+We encourage contributors to build and test aibrix on Local dev environment for most of the cases.
+If you use Macbook, `Docker for Desktop <https://www.docker.com/products/docker-desktop/>`_ is the most convenient tool to use.
+
+Following commands will build ``nightly`` docker images.
+
+.. code-block:: bash
+
+    make docker-build-all
+
+Run following command to quickly deploy the latest code changes to your dev kubernetes environment.
+
+.. code-block:: bash
+
+    kubectl create -f config/dependency
+    kubectl create -f config/default
+
+
+If you want to clean up everything and reinstall the latest code
+
+.. code-block:: bash
+
+    kubectl delete -f config/default
+    kubectl delete -f config/dependency
+
+Mocked CPU App
+--------------
+
+In order to run the control plane and data plane e2e in development environments, we build a mocked app to mock a model server.
+Now, it supports basic model inference, metrics and lora feature. Feel free to enrich the features. Check ``development`` folder for more details.
+
+
+Test on GPU Cluster
+-------------------
+
+If you need to test the model in real GPU environment, we highly recommended `Lambda Labs <https://lambdalabs.com/>`_ platform to install and test kind based deployment.
+
+.. attention::
+    Kind itself doesn't support GPU yet. In order to use the kind version with GPU support, feel free to checkout `nvkind <https://github.com/klueska/nvkind>`_.
diff --git a/docs/source/development/release.rst b/docs/source/development/release.rst
@@ -10,10 +10,17 @@ Release
 This process outlines the steps required to create and publish a release for AIBrix Github project.
 Follow these steps to ensure a smooth and consistent release cycle.
 
-1. Prepare the code
------------------------------
+Prepare the code
+----------------
+
+Option 1 RC version release
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For RC release like ``v0.2.0-rc.1``, there's no need to checkout a new branch, Let's cut the tag & release
+directly against ``main`` branch.
 
-Option 1 minor version release
+
+Option 2 minor version release
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 For new minor version release like ``v0.1.0``, please checkout a new branch named ``release-0.1``.
@@ -25,24 +32,16 @@ For new minor version release like ``v0.1.0``, please checkout a new branch name
     git push origin release-0.1
 
 .. note::
-    If origin doesn't points to upstream, let's say you fork the remote, ``upstream`` or other remotes should be right remote to push to.
+    Here we assume ``origin`` points to upstream, if it doesn't, other remotes like ``upstream`` should be right remote to push to.
 
-Option 2: patch version release
+Option 3: patch version release
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Cut a PR to sync `main` branch changes to `release-0.1`, a example PR is like `Sync main branch changes to release-0.1 for rc4 release <https://github.com/aibrix/aibrix/pull/312>`_
+Bug fixes should be merged on ``main`` first. Then cherry-pick the bugfix to target release branch like ``release-0.1``.
+Due to ``main`` changes, the fix may not able to be cherry-picked to ``release-0.1``. If that's the case, cut PR to release branch directly.
 For patch version like ``v0.1.1``, please reuse the release branch ``release-0.1``, it should be created earlier from the minor version release.
 for patch release, we do not rebase ``main`` because it will introduce new features. All fixes have to be cherry-picked or cut PR against ``release-0.1`` directly.
 
-.. code-block:: bash
-
-    git checkout release-0.1
-    git fetch origin
-    git rebase origin/release-0.1
-
-    # not need to push, it should be update to date.
-
-
 Cut a PR
 --------
 

diff --git a/docs/source/features/autoscaling.rst b/docs/source/features/autoscaling.rst
@@ -16,7 +16,7 @@ In the following sections, we will demonstrate how users can create various type
 KPA Autoscaler
 --------------
 
-The KPA, inspired by Knative, maintains two time windows: a longer "stable window" and a shorter "panic window". It rapidly scales up resources in response to sudden spikes in traffic based on the panic window measurements.
+The KPA, inspired by Knative, maintains two time windows: a longer ``stable window`` and a shorter ``panic window``. It rapidly scales up resources in response to sudden spikes in traffic based on the panic window measurements.
 
 Unlike other solutions that might rely on Prometheus for gathering deployment metrics, AIBrix fetches and maintains metrics internally, enabling faster response times.
 

diff --git a/docs/source/features/gateway-plugins.rst b/docs/source/features/gateway-plugins.rst
@@ -53,14 +53,15 @@ To set up rate limiting, add the user header in the request, like this:
 Routing Strategies
 ------------------
 
-Gateway supports two routing strategies right now.
-1. least-request: routes request to a pod with least ongoing request.
-2. throughput: routes request to a pod which has processed lowest tokens.
+Gateway supports three routing strategies right now.
+
+* random: routes request to a random pod.
+* least-request: routes request to a pod with least ongoing request.
+* throughput: routes request to a pod which has processed lowest tokens.
 
 .. code-block:: bash
 
     curl -v http://localhost:8888/v1/chat/completions \
-    -H "user: your-user-name" \
     -H "routing-strategy: least-request" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer any_key" \

diff --git a/docs/source/getting_started/installation.rst b/docs/source/getting_started/installation.rst
@@ -22,8 +22,11 @@ Stable Version
 
 .. code:: bash
 
-    kubectl apply -f https://github.com/aibrix/aibrix/releases/download/v0.1.1/aibrix-dependency-v0.2.0-rc.1.yaml
-    kubectl apply -f https://github.com/aibrix/aibrix/releases/download/v0.1.1/aibrix-core-v0.2.0-rc.1.yaml
+    # Install component dependencies
+    kubectl apply -f https://github.com/aibrix/aibrix/releases/download/v0.2.0-rc.1/aibrix-dependency-v0.2.0-rc.1.yaml
+
+    # Install aibrix components
+    kubectl apply -f https://github.com/aibrix/aibrix/releases/download/v0.2.0-rc.1/aibrix-core-v0.2.0-rc.1.yaml
 
 
 Nightly Version
@@ -37,33 +40,35 @@ Nightly Version
 
     # Install component dependencies
     kubectl create -k config/dependency
-
-    # Install aibrix components
     kubectl create -k config/default
 
 
 Install Individual AIBrix Components
 ------------------------------------
 
+
+Autoscaler
+^^^^^^^^^^
+
 .. code:: bash
 
-    # autoscaler controller
     kubectl apply -k config/standalone/autoscaler-controller/
 
-    # distributed inference orchestrations controller
+
+Distributed Inference
+^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
     kubectl apply -k config/standalone/distributed-inference-controller/
 
-    # model adapter controller
-    kubectl apply -k config/standalone/model-adapter-controller
 
 
+Model Adapter(Lora)
+^^^^^^^^^^^^^^^^^^^
 
-Install AIBrix on Kind Cluster
-------------------------------
+.. code:: bash
 
-.. attention::
-    Kind itself doesn't support GPU yet. In order to use the kind version with GPU support, feel free to checkout `nvkind <https://github.com/klueska/nvkind>`_.
+    kubectl apply -k config/standalone/model-adapter-controller
 
-We use `Lambda Labs <https://lambdalabs.com/>`_ platform to install and test kind based deployment.
 
-TODO
diff --git a/docs/source/getting_started/quickstart.rst b/docs/source/getting_started/quickstart.rst
@@ -7,14 +7,28 @@ Quickstart
 Install AIBrix
 ^^^^^^^^^^^^^^
 
+Get your kubernetes cluster ready, run following commands to install aibrix components in your cluster.
+
 .. note::
-    If following way doesn't work for you, please check installation guidance for more installation options.
+    If you just want to install specific components or specific version, please check installation guidance for more installation options.
 
 .. code-block:: bash
 
     kubectl apply -f https://github.com/aibrix/aibrix/releases/download/v0.2.0-rc.1/aibrix-dependency-v0.2.0-rc.1.yaml
     kubectl apply -f https://github.com/aibrix/aibrix/releases/download/v0.2.0-rc.1/aibrix-core-v0.2.0-rc.1.yaml
 
+Wait for few minutes and run `kubectl get pods -n aibrix-system` to check pod status util they are ready.
+
+.. code-block:: bash
+
+    NAME                                         READY   STATUS    RESTARTS   AGE
+    aibrix-controller-manager-56576666d6-gsl8s   1/1     Running   0          5h24m
+    aibrix-gateway-plugins-c6cb7545-r4xwj        1/1     Running   0          5h24m
+    aibrix-gpu-optimizer-89b9d9895-t8wnq         1/1     Running   0          5h24m
+    aibrix-kuberay-operator-6dcf94b49f-l4522     1/1     Running   0          5h24m
+    aibrix-metadata-service-6b4d44d5bd-h5g2r     1/1     Running   0          5h24m
+    aibrix-redis-master-84769768cb-fsq45         1/1     Running   0          5h24m
+
 
 Deploy base model
 ^^^^^^^^^^^^^^^^^
@@ -28,16 +42,16 @@ Save yaml as `deployment.yaml` and run `kubectl apply -f deployment.yaml`.
     metadata:
       labels:
         # Note: The label value `model.aibrix.ai/name` here must match with the service name.
-        model.aibrix.ai/name: llama-2-7b-hf
+        model.aibrix.ai/name: qwen25-7b-Instruct
         model.aibrix.ai/port: "8000"
         adapter.model.aibrix.ai/enabled: true
-      name: llama-2-7b-hf
+      name: qwen25-7b-Instruct
       namespace: default
     spec:
       replicas: 1
       selector:
         matchLabels:
-          model.aibrix.ai/name: llama-2-7b-hf
+          model.aibrix.ai/name: qwen25-7b-Instruct
       strategy:
         rollingUpdate:
           maxSurge: 25%
@@ -46,7 +60,7 @@ Save yaml as `deployment.yaml` and run `kubectl apply -f deployment.yaml`.
       template:
         metadata:
           labels:
-            model.aibrix.ai/name: llama-2-7b-hf
+            model.aibrix.ai/name: qwen25-7b-Instruct
         spec:
           containers:
             - command:
@@ -58,10 +72,10 @@ Save yaml as `deployment.yaml` and run `kubectl apply -f deployment.yaml`.
                 - --port
                 - "8000"
                 - --model
-                - meta-llama/Llama-2-7b-hf
+                - Qwen/Qwen2.5-7B-Instruct
                 - --served-model-name
                 # Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
-                - llama-2-7b-hf
+                - qwen25-7b-Instruct
                 - --trust-remote-code
                 - --enable-lora
               env:
@@ -116,12 +130,12 @@ Save yaml as `service.yaml` and run `kubectl apply -f service.yaml`.
     metadata:
       labels:
         # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
-        model.aibrix.ai/name: llama-2-7b-hf
+        model.aibrix.ai/name: qwen25-7b-Instruct
         prometheus-discovery: "true"
       annotations:
         prometheus.io/scrape: "true"
         prometheus.io/port: "8080"
-      name: llama-2-7b-hf
+      name: qwen25-7b-Instruct
       namespace: default
     spec:
       ports:
@@ -134,7 +148,7 @@ Save yaml as `service.yaml` and run `kubectl apply -f service.yaml`.
           protocol: TCP
           targetPort: 8080
       selector:
-        model.aibrix.ai/name: llama-2-7b-hf
+        model.aibrix.ai/name: qwen25-7b-Instruct
       type: ClusterIP
 
 .. note::
@@ -145,20 +159,6 @@ Save yaml as `service.yaml` and run `kubectl apply -f service.yaml`.
    2. The `--served-model-name` argument value in the `Deployment` command is also consistent with the `Service` name and `model.aibrix.ai/name` label.
 
 
-Register a user to authenticate the gateway
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. code-block:: bash
-
-    kubectl -n aibrix-system port-forward svc/aibrix-gateway-users 8090:8090
-
-.. code-block:: bash
-
-    curl http://localhost:8090/CreateUser \
-      -H "Content-Type: application/json" \
-      -d '{"name": "test-user","rpm": 100,"tpm": 10000}'
-
-
 
 Invoke the model endpoint using gateway api
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -174,10 +174,9 @@ Invoke the model endpoint using gateway api
 
     curl -v http://localhost:8888/v1/completions \
         -H "Content-Type: application/json" \
-        -H "user: test-user" \
-        -H "model: meta-llama/Llama-2-7b-hf" \
+        -H "model: qwen25-7b-Instruct" \
         -d '{
-            "model": "meta-llama/llama-2-7b-hf",
+            "model": "qwen25-7b-Instruct",
             "prompt": "San Francisco is a",
             "max_tokens": 128,
             "temperature": 0