ModelAdapter seems to be working abnormally #801

ying2025 · 2025-03-05T09:06:48Z

🚀 Feature Description and Motivation

I create the base model by https://aibrix.readthedocs.io/latest/features/lora-dynamic-loading.html#create-base-model, and it work well. But when I create https://aibrix.readthedocs.io/latest/features/lora-dynamic-loading.html#create-lora-model-adapter, the adapter status is abnormally, what is it happend?
Otherwise, about lora workflow, It seems necessary to call the base pod to load a new model, and then create a service pointing to the base pod to implement Model Adapter and adjust the pod load model is the new model？https://aibrix.readthedocs.io/latest/_images/lora-controller-workflow.png

Use Case

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:16<00:00, 2.15it/s]
INFO 03-03 02:22:03 model_runner.py:1562] Graph capturing finished in 16 secs, took 1.89 GiB
INFO 03-03 02:22:03 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 26.85 seconds
INFO 03-03 02:22:04 api_server.py:756] Using supplied chat template:
INFO 03-03 02:22:04 api_server.py:756] None
INFO 03-03 02:22:04 launcher.py:21] Available routes are:
INFO 03-03 02:22:04 launcher.py:29] Route: /openapi.json, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /docs, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /redoc, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /health, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /ping, Methods: POST, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /tokenize, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /detokenize, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/models, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /version, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/chat/completions, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/completions, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/embeddings, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /pooling, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /score, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/score, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v2/rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /invocations, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/load_lora_adapter, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/unload_lora_adapter, Methods: POST

INFO 03-03 02:30:10 chat_utils.py:332] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
INFO 03-03 02:30:10 logger.py:39] Received request chatcmpl-cdb83189b5ff41758a81273eb9bc5e9a: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n使用golang写出快排<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32733, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 03-03 02:30:10 engine.py:275] Added request chatcmpl-cdb83189b5ff41758a81273eb9bc5e9a.
INFO 03-03 02:30:11 metrics.py:455] Avg prompt throughput: 4.1 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 03-03 02:30:16 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 70.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 03-03 02:30:26 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.3 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Name: qwen-code-lora-v2
Namespace: prdsafe
Labels: model.aibrix.ai/name=qwen-code-lora
model.aibrix.ai/port=8000
Annotations:
API Version: model.aibrix.ai/v1alpha1
Kind: ModelAdapter
Metadata:
Creation Timestamp: 2025-03-03T11:56:33Z
Finalizers:
adapter.model.aibrix.ai/finalizer
Generation: 1
Resource Version: 67792616
UID: e7781649-0d4e-4f27-b22b-d563754a5a55
Spec:
Artifact URL: /models/qwen/Qwen2___5-Coder-7B-Instruct
Base Model: qwen-coder-1-5b-instruct
Pod Selector:
Match Labels:
model.aibrix.ai/name: qwen-coder-1-5b-instruct
Replicas: 1
Scheduler Name: default
Status:
Conditions:
Last Transition Time: 2025-03-03T11:56:33Z
Message: Starting reconciliation
Reason: ModelAdapterPending
Status: Unknown
Type: Initialized
Last Transition Time: 2025-03-03T11:56:33Z
Message: ModelAdapter prdsafe/qwen-code-lora-v2 has been allocated to pod prdsafe/qwen-coder-1-5b-instruct-78d5894cdf-cq7td
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-03T11:56:33Z
Message: ModelAdapter prdsafe/qwen-code-lora-v2 is loaded
Reason: ModelAdapterLoadingError
Status: False
Type: Bound
Instances:
qwen-coder-1-5b-instruct-78d5894cdf-cq7td
Phase: Bound
Events:

Proposed Solution

No response

The text was updated successfully, but these errors were encountered:

varungup90 · 2025-03-05T18:22:04Z

Model adapter status is correct. From the last condition, status is bound and instances list the pod name on which lora adapter is loaded.

Can you try to run inference request for lora adapter?

- Setup port forwarding.
kubectl port-forward svc/<base-model-name> 8000:8000 &
kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc  8888:80 &

- List models
curl http://localhost:8000/v1/models -H "Authorization: Bearer your-key" | jq .

- Request
curl -v http://localhost:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-key" \
  -d '{
     "model": "<lora-adapter-name>",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModelAdapter seems to be working abnormally #801

ModelAdapter seems to be working abnormally #801

ying2025 commented Mar 5, 2025 •

edited

Loading

varungup90 commented Mar 5, 2025

ModelAdapter seems to be working abnormally #801

ModelAdapter seems to be working abnormally #801

Comments

ying2025 commented Mar 5, 2025 • edited Loading

🚀 Feature Description and Motivation

Use Case

Proposed Solution

varungup90 commented Mar 5, 2025

ying2025 commented Mar 5, 2025 •

edited

Loading