Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is Paged or Flash Attention a default ? #753

Closed
matthieu-perso opened this issue Aug 1, 2023 · 6 comments
Closed

is Paged or Flash Attention a default ? #753

matthieu-perso opened this issue Aug 1, 2023 · 6 comments

Comments

@matthieu-perso
Copy link

matthieu-perso commented Aug 1, 2023

Feature request

Hey all,
The TGI documentation states that PagedAttention and FlashAttention are used. Is there a way to choose which one we use ? Different Attention mechanisms have different pros and cons, and choosing which one to use would be relevant in production.

Motivation

Selecting different attention mechanisms would be relevant for different types of documents. In our case, attention mechanisms that are suited for long sequences would be useful.

Your contribution

Based on the above, happy to contribute code

@abhinavkulkarni
Copy link
Contributor

Nothing is the default. Flash (v1 and v2) and Paged Attention are used if they are available. You need to cd server && make install-flash-attention, cd server && make install-flash-attention-v2 and cd server && make install-vllm.

@Narsil
Copy link
Collaborator

Narsil commented Aug 3, 2023

What other attention mechanisms do you have in mind ?

Contributions are great and welcome. Since this is kind of a core change you are suggesting, do you mind explaining a bit more the motivations, the intended changes etc.. ?

And yes, flash attention and vllm (paged attention) are defaults. They are in the docker which is our main distribution scheme.
For the local install you do need to build yourself (those are quite slow to compile and do not work on every environment, therefore we do not do it by default for you).

Also even when compiled, depending on the models it may or may not be used.

We always try to use them since they are better, but we will gracefully downgrade if anything is wrong.

@merveenoyan FYI.

@Narsil
Copy link
Collaborator

Narsil commented Aug 3, 2023

Did you mean this: https://github.com/ModelTC/lightllm/blob/main/docs/LightLLM.md ?

If yes and there's already an implementation, we're definitely eager to take it for a spin.

@matthieu-perso
Copy link
Author

matthieu-perso commented Aug 16, 2023

Hi both - I didn't have anything else in mind - simply asking if we needed to choose ourselves one of these two attention mechanisms (and potential others). If they are implemented and chosen by default, this could be good enough. Thank you both

@Narsil
Copy link
Collaborator

Narsil commented Aug 16, 2023

lightllm implementors reached out but we haven't had a proper discussion or me implemented it in TGI.

For adding features the general mindset applies:

  • If feature X is strictly superior to feature Y, we drop feature Y, implement only X
  • If X, and Y are quite complimentary with both very nice pros and different cons, making them exploitable in different settings, we tend to add a flag.
  • If X is strictly superior to Y, but X cannot be run in a general environement (specific hardware, cuda version>12 etc..) then we gracefully fallback to Y (with a flag to deactivate for users).
  • If there's a doubt X is providing better life than Y, we just don't implement it.

If it's in an intermediary decision, we discuss and pick a side.

Closing this, since as off now we are in the 3rd setting. (v2+pagedattention is strictly superior to everything else, and we gracefully downgrade).

I'm closing this issue since I think this answers the question about the default, feel free to reopen one specifically for lightllm support and discussing tradeoffs of that specific attention implementation.

@Narsil Narsil closed this as completed Aug 16, 2023
@ghost
Copy link

ghost commented Sep 25, 2023

@Narsil Thanks for the detailed response! Is it then confirmed that deploying CodeLlama via TGI v1.0.3 docker image has flash attention v2 implemented by default, since CodeLlama is based on Llama 2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants