-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Qwen2-VL-72B-Instruct Inference failure #115
Comments
I notice that Qwen-2.5 VL has been updated recentlly: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/commit/5d8e171e5ee60e8ca4c6daa380bd29f78fe19021 This may lead a problem that vllm0.7.1 doesn't work with the newest model weights. Just a remind for double check. But I think this is not related to this bug. The error is:
It looks like a torch_npu bug. @ganyi1996ppo Please take a look as well. Update: |
I got the same problem on Qwen2-VL. In vllm source code, the attn-mask dimension of SPDA is 3 but the flash attention ops on NPU only supports 2 or 4. If it's true that qwen2 in vllm only work with torch SDPA on non-GPU platform, I suggest you may use the exclusive attention ops on NPU, torch_npu.npu_fusion_attention, which may enhance adaptability and inference speed. It works very well on the Ascend-vLLM of Huawei Cloud. This is just a suggestion for your reference. I'm not a specialist in this area. |
@Ziang-Zack-Gao Qwen2-VL which using Ascend attention backend is under develope. It'll coming in next vllm-ascend release. On the other hand, we'll contribute to vllm to let Qwen2-vl support more attention backend as well. |
npu-smi info
run cmd
full docker logs
The text was updated successfully, but these errors were encountered: