-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is Paged or Flash Attention a default ? #753
Comments
Nothing is the default. Flash (v1 and v2) and Paged Attention are used if they are available. You need to |
What other attention mechanisms do you have in mind ? Contributions are great and welcome. Since this is kind of a core change you are suggesting, do you mind explaining a bit more the motivations, the intended changes etc.. ? And yes, flash attention and vllm (paged attention) are defaults. They are in the docker which is our main distribution scheme. Also even when compiled, depending on the models it may or may not be used. We always try to use them since they are better, but we will gracefully downgrade if anything is wrong. @merveenoyan FYI. |
Did you mean this: https://github.com/ModelTC/lightllm/blob/main/docs/LightLLM.md ? If yes and there's already an implementation, we're definitely eager to take it for a spin. |
Hi both - I didn't have anything else in mind - simply asking if we needed to choose ourselves one of these two attention mechanisms (and potential others). If they are implemented and chosen by default, this could be good enough. Thank you both |
lightllm implementors reached out but we haven't had a proper discussion or me implemented it in TGI. For adding features the general mindset applies:
If it's in an intermediary decision, we discuss and pick a side. Closing this, since as off now we are in the 3rd setting. (v2+pagedattention is strictly superior to everything else, and we gracefully downgrade). I'm closing this issue since I think this answers the question about the default, feel free to reopen one specifically for lightllm support and discussing tradeoffs of that specific attention implementation. |
@Narsil Thanks for the detailed response! Is it then confirmed that deploying CodeLlama via TGI v1.0.3 docker image has flash attention v2 implemented by default, since CodeLlama is based on Llama 2? |
Feature request
Hey all,
The TGI documentation states that PagedAttention and FlashAttention are used. Is there a way to choose which one we use ? Different Attention mechanisms have different pros and cons, and choosing which one to use would be relevant in production.
Motivation
Selecting different attention mechanisms would be relevant for different types of documents. In our case, attention mechanisms that are suited for long sequences would be useful.
Your contribution
Based on the above, happy to contribute code
The text was updated successfully, but these errors were encountered: