Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can flash attention be used for inference? #427

Closed
Shannen3206 opened this issue Aug 7, 2023 · 7 comments
Closed

Can flash attention be used for inference? #427

Shannen3206 opened this issue Aug 7, 2023 · 7 comments

Comments

@Shannen3206
Copy link

I tried inference with and without flash attention in the megatron-deepspeed code and found a difference in inference speed of just 0.2 seconds.

In addition, in huggingface's openllama model structure, flash attention is also limited to training.
image

Can flash attention be used for inference acceleration?

@tridao
Copy link
Member

tridao commented Aug 7, 2023

FlashAttention right now isn't very fast for iterative decoding where Q has seqlen=1 (you can use it for prompt processing). It's ongoing work, we'll eventually make inference fast (idk when yet).

@77h2l
Copy link

77h2l commented Aug 16, 2023

@tridao hello,can flash_attn only be used for inference, in other word, when a model was trained by standard attention, can it use flash_attn to replace during infernce,thx

@tridao
Copy link
Member

tridao commented Aug 16, 2023

@tridao hello,can flash_attn only be used for inference, in other word, when a model was trained by standard attention, can it use flash_attn to replace during infernce,thx

Yes, FlashAttention computes the same attention (up to usual numerical difference). You can also train with FlashAttention and do inference with standard attention.

@jmzeng
Copy link

jmzeng commented Oct 4, 2023

Want to follow up on this. I tried to do inference on longchat-v1.5-7b-32k with the flash attention patch and was getting OOM on a single inference run with accelerate. Do you know if there is a way flash attention 2 can be used?

@tridao
Copy link
Member

tridao commented Oct 4, 2023

Yes you can use FlashAttention 2, for both prompt processing and iterative decoding. It's now optimized for both.

@jmzeng
Copy link

jmzeng commented Oct 4, 2023

Thanks for the update. I opened a new issue for my OOM, see here:

#590

@thusinh1969
Copy link

@tridao hello,can flash_attn only be used for inference, in other word, when a model was trained by standard attention, can it use flash_attn to replace during infernce,thx

Yes, FlashAttention computes the same attention (up to usual numerical difference). You can also train with FlashAttention and do inference with standard attention.

Love you my man (a VN hard core here...!)

Cheers,
Steve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants