-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can flash attention be used for inference? #427
Comments
FlashAttention right now isn't very fast for iterative decoding where Q has seqlen=1 (you can use it for prompt processing). It's ongoing work, we'll eventually make inference fast (idk when yet). |
@tridao hello,can flash_attn only be used for inference, in other word, when a model was trained by standard attention, can it use flash_attn to replace during infernce,thx |
Yes, FlashAttention computes the same attention (up to usual numerical difference). You can also train with FlashAttention and do inference with standard attention. |
Want to follow up on this. I tried to do inference on |
Yes you can use FlashAttention 2, for both prompt processing and iterative decoding. It's now optimized for both. |
Thanks for the update. I opened a new issue for my OOM, see here: |
Love you my man (a VN hard core here...!) Cheers, |
I tried inference with and without flash attention in the megatron-deepspeed code and found a difference in inference speed of just 0.2 seconds.
In addition, in huggingface's openllama model structure, flash attention is also limited to training.

Can flash attention be used for inference acceleration?
The text was updated successfully, but these errors were encountered: