Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference speed is slow #111

Open
junming-yang opened this issue Feb 19, 2025 · 6 comments
Open

Inference speed is slow #111

junming-yang opened this issue Feb 19, 2025 · 6 comments

Comments

@junming-yang
Copy link

在测试 Qwen/Qwen2.5-0.5B-Instruct 时发现,3090能达到 300 token/s,910B 单卡维持在20 ~ 30 token/s,似乎慢了不少

@Yikun
Copy link
Collaborator

Yikun commented Feb 19, 2025

Thanks for feedback. Would you mind also mention the version info here?

Current version is still working in process, we are still actively working on it, the first release will come in later Q1. There will be more performance gains in that version.

Welcome to contribute and improve it with us together.

[1] https://vllm-ascend.readthedocs.io/en/latest/developer_guide/versioning_policy.html#release-cadence

@SLTK1
Copy link

SLTK1 commented Feb 19, 2025

在测试 Qwen/Qwen2.5-0.5B-Instruct 时发现,3090能达到 300 token/s,910B 单卡维持在20 ~ 30 token/s,似乎慢了不少

测试qwen2vl也有同样的问题,使用单卡910b推理性能只有A800的一半

@311dada
Copy link

311dada commented Feb 24, 2025

Environment

  • vllm: v0.7.1
  • vllm-ascend: v0.7.1rc1
  • torch: 2.5.1
  • torch_npu: 2.5.1@20250218

Performance

I compared the performance on the 910b and A800 using the follwing provided test script over Qwen2.5-14B-Instruct:

from vllm import LLM, SamplingParams
import sys

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0, top_p=0.95)
# Create an LLM.
llm = LLM(model=sys.argv[1])

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The results are as follows:

  • 910b

Image

  • A800

Image

We can observe that the results from both are basically the same, but in terms of speed, the inference speed on the 910b is only one-third of that on the A800.

@Yikun
Copy link
Collaborator

Yikun commented Feb 24, 2025

@311dada Thanks for feedback.

Currently, the performance of vLLM on Ascend still need to be improved. We are also working together with the Ascend team to improve it. The first release will be v0.7.3 in 2025 Q1. Therefore, welcome everyone join us to improve it.

@311dada
Copy link

311dada commented Feb 24, 2025

Thank you very much for your contribution! I am also very willing to contribute to improving vLLM's performance on Ascend. Could you please let me know the specific tasks I can participate in?

@Yikun
Copy link
Collaborator

Yikun commented Feb 25, 2025

@311dada

For performance part, you can refer to the RFC: #156

For other contributions, Pls feel free to help resolve issue, fix bug, test specific model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants