Inference speed is slow #111

junming-yang · 2025-02-19T13:52:02Z

在测试 Qwen/Qwen2.5-0.5B-Instruct 时发现，3090能达到 300 token/s，910B 单卡维持在20 ～ 30 token/s，似乎慢了不少

Yikun · 2025-02-19T14:00:24Z

Thanks for feedback. Would you mind also mention the version info here?

Current version is still working in process, we are still actively working on it, the first release will come in later Q1. There will be more performance gains in that version.

Welcome to contribute and improve it with us together.

[1] https://vllm-ascend.readthedocs.io/en/latest/developer_guide/versioning_policy.html#release-cadence

SLTK1 · 2025-02-19T15:02:08Z

在测试 Qwen/Qwen2.5-0.5B-Instruct 时发现，3090能达到 300 token/s，910B 单卡维持在20 ～ 30 token/s，似乎慢了不少

测试qwen2vl也有同样的问题，使用单卡910b推理性能只有A800的一半

311dada · 2025-02-24T08:36:13Z

Environment

vllm: v0.7.1
vllm-ascend: v0.7.1rc1
torch: 2.5.1
torch_npu: 2.5.1@20250218

Performance

I compared the performance on the 910b and A800 using the follwing provided test script over Qwen2.5-14B-Instruct:

from vllm import LLM, SamplingParams
import sys

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0, top_p=0.95)
# Create an LLM.
llm = LLM(model=sys.argv[1])

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The results are as follows:

910b

A800

We can observe that the results from both are basically the same, but in terms of speed, the inference speed on the 910b is only one-third of that on the A800.

Yikun · 2025-02-24T10:30:50Z

@311dada Thanks for feedback.

Currently, the performance of vLLM on Ascend still need to be improved. We are also working together with the Ascend team to improve it. The first release will be v0.7.3 in 2025 Q1. Therefore, welcome everyone join us to improve it.

311dada · 2025-02-24T11:57:17Z

Thank you very much for your contribution! I am also very willing to contribute to improving vLLM's performance on Ascend. Could you please let me know the specific tasks I can participate in?

Yikun · 2025-02-25T11:30:58Z

@311dada

For performance part, you can refer to the RFC: #156

For other contributions, Pls feel free to help resolve issue, fix bug, test specific model.

Yikun added the performance label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference speed is slow #111

Inference speed is slow #111

junming-yang commented Feb 19, 2025

Yikun commented Feb 19, 2025 •

edited

Loading

SLTK1 commented Feb 19, 2025

311dada commented Feb 24, 2025 •

edited

Loading

Yikun commented Feb 24, 2025 •

edited

Loading

311dada commented Feb 24, 2025

Yikun commented Feb 25, 2025 •

edited

Loading

Inference speed is slow #111

Inference speed is slow #111

Comments

junming-yang commented Feb 19, 2025

Yikun commented Feb 19, 2025 • edited Loading

SLTK1 commented Feb 19, 2025

311dada commented Feb 24, 2025 • edited Loading

Environment

Performance

Yikun commented Feb 24, 2025 • edited Loading

311dada commented Feb 24, 2025

Yikun commented Feb 25, 2025 • edited Loading

Yikun commented Feb 19, 2025 •

edited

Loading

311dada commented Feb 24, 2025 •

edited

Loading

Yikun commented Feb 24, 2025 •

edited

Loading

Yikun commented Feb 25, 2025 •

edited

Loading