You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:16<00:00, 2.15it/s]
INFO 03-03 02:22:03 model_runner.py:1562] Graph capturing finished in 16 secs, took 1.89 GiB
INFO 03-03 02:22:03 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 26.85 seconds
INFO 03-03 02:22:04 api_server.py:756] Using supplied chat template:
INFO 03-03 02:22:04 api_server.py:756] None
INFO 03-03 02:22:04 launcher.py:21] Available routes are:
INFO 03-03 02:22:04 launcher.py:29] Route: /openapi.json, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /docs, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /redoc, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /health, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /ping, Methods: POST, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /tokenize, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /detokenize, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/models, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /version, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/chat/completions, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/completions, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/embeddings, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /pooling, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /score, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/score, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v2/rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /invocations, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/load_lora_adapter, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/unload_lora_adapter, Methods: POST
INFO 03-03 02:30:10 chat_utils.py:332] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
INFO 03-03 02:30:10 logger.py:39] Received request chatcmpl-cdb83189b5ff41758a81273eb9bc5e9a: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n使用golang写出快排<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32733, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 03-03 02:30:10 engine.py:275] Added request chatcmpl-cdb83189b5ff41758a81273eb9bc5e9a.
INFO 03-03 02:30:11 metrics.py:455] Avg prompt throughput: 4.1 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 03-03 02:30:16 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 70.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 03-03 02:30:26 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.3 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Name: qwen-code-lora-v2
Namespace: prdsafe
Labels: model.aibrix.ai/name=qwen-code-lora
model.aibrix.ai/port=8000
Annotations:
API Version: model.aibrix.ai/v1alpha1
Kind: ModelAdapter
Metadata:
Creation Timestamp: 2025-03-03T11:56:33Z
Finalizers:
adapter.model.aibrix.ai/finalizer
Generation: 1
Resource Version: 67792616
UID: e7781649-0d4e-4f27-b22b-d563754a5a55
Spec:
Artifact URL: /models/qwen/Qwen2___5-Coder-7B-Instruct
Base Model: qwen-coder-1-5b-instruct
Pod Selector:
Match Labels:
model.aibrix.ai/name: qwen-coder-1-5b-instruct
Replicas: 1
Scheduler Name: default
Status:
Conditions:
Last Transition Time: 2025-03-03T11:56:33Z
Message: Starting reconciliation
Reason: ModelAdapterPending
Status: Unknown
Type: Initialized
Last Transition Time: 2025-03-03T11:56:33Z
Message: ModelAdapter prdsafe/qwen-code-lora-v2 has been allocated to pod prdsafe/qwen-coder-1-5b-instruct-78d5894cdf-cq7td
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-03T11:56:33Z
Message: ModelAdapter prdsafe/qwen-code-lora-v2 is loaded
Reason: ModelAdapterLoadingError
Status: False
Type: Bound
Instances:
qwen-coder-1-5b-instruct-78d5894cdf-cq7td
Phase: Bound
Events:
Proposed Solution
No response
The text was updated successfully, but these errors were encountered:
🚀 Feature Description and Motivation
I create the base model by https://aibrix.readthedocs.io/latest/features/lora-dynamic-loading.html#create-base-model, and it work well. But when I create https://aibrix.readthedocs.io/latest/features/lora-dynamic-loading.html#create-lora-model-adapter, the adapter status is abnormally, what is it happend?
Otherwise, about lora workflow, It seems necessary to call the base pod to load a new model, and then create a service pointing to the base pod to implement Model Adapter and adjust the pod load model is the new model?https://aibrix.readthedocs.io/latest/_images/lora-controller-workflow.png
Use Case
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:16<00:00, 2.15it/s]
INFO 03-03 02:22:03 model_runner.py:1562] Graph capturing finished in 16 secs, took 1.89 GiB
INFO 03-03 02:22:03 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 26.85 seconds
INFO 03-03 02:22:04 api_server.py:756] Using supplied chat template:
INFO 03-03 02:22:04 api_server.py:756] None
INFO 03-03 02:22:04 launcher.py:21] Available routes are:
INFO 03-03 02:22:04 launcher.py:29] Route: /openapi.json, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /docs, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /redoc, Methods: HEAD, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /health, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /ping, Methods: POST, GET
INFO 03-03 02:22:04 launcher.py:29] Route: /tokenize, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /detokenize, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/models, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /version, Methods: GET
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/chat/completions, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/completions, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/embeddings, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /pooling, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /score, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/score, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v2/rerank, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /invocations, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/load_lora_adapter, Methods: POST
INFO 03-03 02:22:04 launcher.py:29] Route: /v1/unload_lora_adapter, Methods: POST
INFO 03-03 02:30:10 chat_utils.py:332] Detected the chat template content format to be 'string'. You can set
--chat-template-content-format
to override this.INFO 03-03 02:30:10 logger.py:39] Received request chatcmpl-cdb83189b5ff41758a81273eb9bc5e9a: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n使用golang写出快排<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32733, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 03-03 02:30:10 engine.py:275] Added request chatcmpl-cdb83189b5ff41758a81273eb9bc5e9a.
INFO 03-03 02:30:11 metrics.py:455] Avg prompt throughput: 4.1 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 03-03 02:30:16 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 70.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 03-03 02:30:26 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.3 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Name: qwen-code-lora-v2
Namespace: prdsafe
Labels: model.aibrix.ai/name=qwen-code-lora
model.aibrix.ai/port=8000
Annotations:
API Version: model.aibrix.ai/v1alpha1
Kind: ModelAdapter
Metadata:
Creation Timestamp: 2025-03-03T11:56:33Z
Finalizers:
adapter.model.aibrix.ai/finalizer
Generation: 1
Resource Version: 67792616
UID: e7781649-0d4e-4f27-b22b-d563754a5a55
Spec:
Artifact URL: /models/qwen/Qwen2___5-Coder-7B-Instruct
Base Model: qwen-coder-1-5b-instruct
Pod Selector:
Match Labels:
model.aibrix.ai/name: qwen-coder-1-5b-instruct
Replicas: 1
Scheduler Name: default
Status:
Conditions:
Last Transition Time: 2025-03-03T11:56:33Z
Message: Starting reconciliation
Reason: ModelAdapterPending
Status: Unknown
Type: Initialized
Last Transition Time: 2025-03-03T11:56:33Z
Message: ModelAdapter prdsafe/qwen-code-lora-v2 has been allocated to pod prdsafe/qwen-coder-1-5b-instruct-78d5894cdf-cq7td
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-03T11:56:33Z
Message: ModelAdapter prdsafe/qwen-code-lora-v2 is loaded
Reason: ModelAdapterLoadingError
Status: False
Type: Bound
Instances:
qwen-coder-1-5b-instruct-78d5894cdf-cq7td
Phase: Bound
Events:
Proposed Solution
No response
The text was updated successfully, but these errors were encountered: