is Paged or Flash Attention a default ? #753

matthieu-perso · 2023-08-01T13:59:49Z

Feature request

Hey all,
The TGI documentation states that PagedAttention and FlashAttention are used. Is there a way to choose which one we use ? Different Attention mechanisms have different pros and cons, and choosing which one to use would be relevant in production.

Motivation

Selecting different attention mechanisms would be relevant for different types of documents. In our case, attention mechanisms that are suited for long sequences would be useful.

Your contribution

Based on the above, happy to contribute code

abhinavkulkarni · 2023-08-02T15:14:08Z

Nothing is the default. Flash (v1 and v2) and Paged Attention are used if they are available. You need to cd server && make install-flash-attention, cd server && make install-flash-attention-v2 and cd server && make install-vllm.

Narsil · 2023-08-03T08:31:29Z

What other attention mechanisms do you have in mind ?

Contributions are great and welcome. Since this is kind of a core change you are suggesting, do you mind explaining a bit more the motivations, the intended changes etc.. ?

And yes, flash attention and vllm (paged attention) are defaults. They are in the docker which is our main distribution scheme.
For the local install you do need to build yourself (those are quite slow to compile and do not work on every environment, therefore we do not do it by default for you).

Also even when compiled, depending on the models it may or may not be used.

We always try to use them since they are better, but we will gracefully downgrade if anything is wrong.

@merveenoyan FYI.

Narsil · 2023-08-03T09:35:05Z

Did you mean this: https://github.com/ModelTC/lightllm/blob/main/docs/LightLLM.md ?

If yes and there's already an implementation, we're definitely eager to take it for a spin.

matthieu-perso · 2023-08-16T13:43:46Z

Hi both - I didn't have anything else in mind - simply asking if we needed to choose ourselves one of these two attention mechanisms (and potential others). If they are implemented and chosen by default, this could be good enough. Thank you both

Narsil · 2023-08-16T19:08:22Z

lightllm implementors reached out but we haven't had a proper discussion or me implemented it in TGI.

For adding features the general mindset applies:

If feature X is strictly superior to feature Y, we drop feature Y, implement only X
If X, and Y are quite complimentary with both very nice pros and different cons, making them exploitable in different settings, we tend to add a flag.
If X is strictly superior to Y, but X cannot be run in a general environement (specific hardware, cuda version>12 etc..) then we gracefully fallback to Y (with a flag to deactivate for users).
If there's a doubt X is providing better life than Y, we just don't implement it.

If it's in an intermediary decision, we discuss and pick a side.

Closing this, since as off now we are in the 3rd setting. (v2+pagedattention is strictly superior to everything else, and we gracefully downgrade).

I'm closing this issue since I think this answers the question about the default, feel free to reopen one specifically for lightllm support and discussing tradeoffs of that specific attention implementation.

ghost · 2023-09-25T22:59:22Z

@Narsil Thanks for the detailed response! Is it then confirmed that deploying CodeLlama via TGI v1.0.3 docker image has flash attention v2 implemented by default, since CodeLlama is based on Llama 2?

Narsil closed this as completed Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is Paged or Flash Attention a default ? #753

is Paged or Flash Attention a default ? #753

matthieu-perso commented Aug 1, 2023 •

edited

Loading

abhinavkulkarni commented Aug 2, 2023

Narsil commented Aug 3, 2023

Narsil commented Aug 3, 2023

matthieu-perso commented Aug 16, 2023 •

edited

Loading

Narsil commented Aug 16, 2023

ghost commented Sep 25, 2023

is Paged or Flash Attention a default ? #753

is Paged or Flash Attention a default ? #753

Comments

matthieu-perso commented Aug 1, 2023 • edited Loading

Feature request

Motivation

Your contribution

abhinavkulkarni commented Aug 2, 2023

Narsil commented Aug 3, 2023

Narsil commented Aug 3, 2023

matthieu-perso commented Aug 16, 2023 • edited Loading

Narsil commented Aug 16, 2023

ghost commented Sep 25, 2023

matthieu-perso commented Aug 1, 2023 •

edited

Loading

matthieu-perso commented Aug 16, 2023 •

edited

Loading