-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference speed is slow #111
Comments
Thanks for feedback. Would you mind also mention the version info here? Current version is still working in process, we are still actively working on it, the first release will come in later Q1. There will be more performance gains in that version. Welcome to contribute and improve it with us together. [1] https://vllm-ascend.readthedocs.io/en/latest/developer_guide/versioning_policy.html#release-cadence |
测试qwen2vl也有同样的问题,使用单卡910b推理性能只有A800的一半 |
Environment
PerformanceI compared the performance on the 910b and A800 using the follwing provided test script over Qwen2.5-14B-Instruct:
The results are as follows:
We can observe that the results from both are basically the same, but in terms of speed, the inference speed on the 910b is only one-third of that on the A800. |
@311dada Thanks for feedback. Currently, the performance of vLLM on Ascend still need to be improved. We are also working together with the Ascend team to improve it. The first release will be v0.7.3 in 2025 Q1. Therefore, welcome everyone join us to improve it. |
Thank you very much for your contribution! I am also very willing to contribute to improving vLLM's performance on Ascend. Could you please let me know the specific tasks I can participate in? |
在测试 Qwen/Qwen2.5-0.5B-Instruct 时发现,3090能达到 300 token/s,910B 单卡维持在20 ~ 30 token/s,似乎慢了不少
The text was updated successfully, but these errors were encountered: