Can flash attention be used for inference? #427

Shannen3206 · 2023-08-07T07:25:32Z

I tried inference with and without flash attention in the megatron-deepspeed code and found a difference in inference speed of just 0.2 seconds.

In addition, in huggingface's openllama model structure, flash attention is also limited to training.

Can flash attention be used for inference acceleration?

tridao · 2023-08-07T16:56:47Z

FlashAttention right now isn't very fast for iterative decoding where Q has seqlen=1 (you can use it for prompt processing). It's ongoing work, we'll eventually make inference fast (idk when yet).

77h2l · 2023-08-16T09:56:44Z

@tridao hello，can flash_attn only be used for inference, in other word, when a model was trained by standard attention, can it use flash_attn to replace during infernce,thx

tridao · 2023-08-16T15:49:09Z

@tridao hello，can flash_attn only be used for inference, in other word, when a model was trained by standard attention, can it use flash_attn to replace during infernce,thx

Yes, FlashAttention computes the same attention (up to usual numerical difference). You can also train with FlashAttention and do inference with standard attention.

jmzeng · 2023-10-04T05:14:26Z

Want to follow up on this. I tried to do inference on longchat-v1.5-7b-32k with the flash attention patch and was getting OOM on a single inference run with accelerate. Do you know if there is a way flash attention 2 can be used?

tridao · 2023-10-04T05:24:48Z

Yes you can use FlashAttention 2, for both prompt processing and iterative decoding. It's now optimized for both.

jmzeng · 2023-10-04T17:00:44Z

Thanks for the update. I opened a new issue for my OOM, see here:

#590

thusinh1969 · 2023-10-06T19:52:25Z

@tridao hello，can flash_attn only be used for inference, in other word, when a model was trained by standard attention, can it use flash_attn to replace during infernce,thx

Yes, FlashAttention computes the same attention (up to usual numerical difference). You can also train with FlashAttention and do inference with standard attention.

Love you my man (a VN hard core here...!)

Cheers,
Steve

masahi mentioned this issue Aug 9, 2023

[Question] How to use Relax attention op in mlc-llm workflow mlc-ai/mlc-llm#690

Closed

donglc20 mentioned this issue Aug 23, 2023

The scenario is the inference optimization implementation when q is 1 #476

Closed

learning-chip mentioned this issue Sep 2, 2023

Flash Attention V2 vllm-project/vllm#485

Closed

thusinh1969 mentioned this issue Oct 6, 2023

Question: Do I have to use Flash-Attention all the way from pretrain and finetune and inference ? And if so can I quantize with AutoGPTQ after all ? ymcui/Chinese-LLaMA-Alpaca-2#323

Closed

3 tasks

Lvjinhong mentioned this issue Dec 19, 2023

[new feature] flash decoding ++ vllm-project/vllm#1568

Open

tridao closed this as completed Apr 18, 2024

ghost mentioned this issue Jun 5, 2024

Add nccl package and Bump vLLM to 0.4.3 for huggingface runtime kserve/kserve#3723

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can flash attention be used for inference? #427

Can flash attention be used for inference? #427

Shannen3206 commented Aug 7, 2023

tridao commented Aug 7, 2023

77h2l commented Aug 16, 2023

tridao commented Aug 16, 2023

jmzeng commented Oct 4, 2023 •

edited

Loading

tridao commented Oct 4, 2023 •

edited

Loading

jmzeng commented Oct 4, 2023

thusinh1969 commented Oct 6, 2023

Can flash attention be used for inference? #427

Can flash attention be used for inference? #427

Comments

Shannen3206 commented Aug 7, 2023

tridao commented Aug 7, 2023

77h2l commented Aug 16, 2023

tridao commented Aug 16, 2023

jmzeng commented Oct 4, 2023 • edited Loading

tridao commented Oct 4, 2023 • edited Loading

jmzeng commented Oct 4, 2023

thusinh1969 commented Oct 6, 2023

jmzeng commented Oct 4, 2023 •

edited

Loading

tridao commented Oct 4, 2023 •

edited

Loading