Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelAdapter seems to be working abnormally #801

Open
ying2025 opened this issue Mar 5, 2025 · 1 comment
Open

ModelAdapter seems to be working abnormally #801

ying2025 opened this issue Mar 5, 2025 · 1 comment

Comments

@ying2025
Copy link

ying2025 commented Mar 5, 2025

🚀 Feature Description and Motivation

I create the base model by https://aibrix.readthedocs.io/latest/features/lora-dynamic-loading.html#create-base-model, and it work well. But when I create https://aibrix.readthedocs.io/latest/features/lora-dynamic-loading.html#create-lora-model-adapter, the adapter status is abnormally, what is it happend?
Otherwise, about lora workflow, It seems necessary to call the base pod to load a new model, and then create a service pointing to the base pod to implement Model Adapter and adjust the pod load model is the new model?https://aibrix.readthedocs.io/latest/_images/lora-controller-workflow.png

Use Case

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:16<00:00, 2.15it/s]
INFO 03-03 02:22:03 model_runner.py:1562] Graph capturing finished in 16 secs, took 1.89 GiB
INFO 03-03 02:22:03 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 26.85 seconds
INFO 03-03 02:22:04 api_server.py:756] Using supplied chat template:
INFO 03-03 02:22:04 api_server.py:756] None
INFO 03-03 02:22:04 launcher.py:21] Available routes are:
INFO 03-03 02:22:04 launcher.py:29] Route: /openapi.json, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /docs, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /redoc, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /health, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /ping, Methods: POST, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /tokenize, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /detokenize, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/models, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /version, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/chat/completions, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/completions, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/embeddings, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /pooling, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /score, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/score, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v2/rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /invocations, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/load_lora_adapter, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/unload_lora_adapter, Methods: POST

INFO 03-03 02:30:10 chat_utils.py:332] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
INFO 03-03 02:30:10 logger.py:39] Received request chatcmpl-cdb83189b5ff41758a81273eb9bc5e9a: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n使用golang写出快排<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32733, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 03-03 02:30:10 engine.py:275] Added request chatcmpl-cdb83189b5ff41758a81273eb9bc5e9a.
INFO 03-03 02:30:11 metrics.py:455] Avg prompt throughput: 4.1 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 03-03 02:30:16 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 70.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 03-03 02:30:26 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.3 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Name: qwen-code-lora-v2
Namespace: prdsafe
Labels: model.aibrix.ai/name=qwen-code-lora
model.aibrix.ai/port=8000
Annotations:
API Version: model.aibrix.ai/v1alpha1
Kind: ModelAdapter
Metadata:
Creation Timestamp: 2025-03-03T11:56:33Z
Finalizers:
adapter.model.aibrix.ai/finalizer
Generation: 1
Resource Version: 67792616
UID: e7781649-0d4e-4f27-b22b-d563754a5a55
Spec:
Artifact URL: /models/qwen/Qwen2___5-Coder-7B-Instruct
Base Model: qwen-coder-1-5b-instruct
Pod Selector:
Match Labels:
model.aibrix.ai/name: qwen-coder-1-5b-instruct
Replicas: 1
Scheduler Name: default
Status:
Conditions:
Last Transition Time: 2025-03-03T11:56:33Z
Message: Starting reconciliation
Reason: ModelAdapterPending
Status: Unknown
Type: Initialized
Last Transition Time: 2025-03-03T11:56:33Z
Message: ModelAdapter prdsafe/qwen-code-lora-v2 has been allocated to pod prdsafe/qwen-coder-1-5b-instruct-78d5894cdf-cq7td
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-03T11:56:33Z
Message: ModelAdapter prdsafe/qwen-code-lora-v2 is loaded
Reason: ModelAdapterLoadingError
Status: False
Type: Bound
Instances:
qwen-coder-1-5b-instruct-78d5894cdf-cq7td
Phase: Bound
Events:

Proposed Solution

No response

@varungup90
Copy link
Collaborator

Model adapter status is correct. From the last condition, status is bound and instances list the pod name on which lora adapter is loaded.

Image

Can you try to run inference request for lora adapter?

- Setup port forwarding.
kubectl port-forward svc/<base-model-name> 8000:8000 &
kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc  8888:80 &

- List models
curl http://localhost:8000/v1/models -H "Authorization: Bearer your-key" | jq .

- Request
curl -v http://localhost:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-key" \
  -d '{
     "model": "<lora-adapter-name>",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.7
   }'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants