You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently flash attn v2 is supported by tvm and I am trying to use it to optimize performance of mlc-llm on GPU.
Currently in mlc_llm/relax_model/llama.py, the attention computation is done using many basic operations instead of Relax's attention. I tried to use relax attention op (see this code) but it didn't quite work out.
The performance of attention is very low and seems it's not even running on GPU:
There's no proper documentations about Relax to refer to. Could anyone tell me what's the proper usage of R.nn.attention and why mlc-llm is not using it to implement the model?
The text was updated successfully, but these errors were encountered:
See rewrite_attention function in #651. Flash v2 actually is not fast for a single-query workload (Dao-AILab/flash-attention#427 (comment)), so for the decoder we use the xformer kernel.
❓ General Questions
Recently flash attn v2 is supported by tvm and I am trying to use it to optimize performance of mlc-llm on GPU.
Currently in
mlc_llm/relax_model/llama.py
, the attention computation is done using many basic operations instead of Relax'sattention
. I tried to use relax attention op (see this code) but it didn't quite work out.The performance of attention is very low and seems it's not even running on GPU:
There's no proper documentations about Relax to refer to. Could anyone tell me what's the proper usage of
R.nn.attention
and why mlc-llm is not using it to implement the model?The text was updated successfully, but these errors were encountered: