Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable custom paged attention kernel for Navi 3/4 #446

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

hyoon1
Copy link

@hyoon1 hyoon1 commented Feb 24, 2025

Introduce custom paged attention kernel for Navi3/4

  • Supports cases where head_size == 128 and block_size == 16.
  • It does not support alibi_slopes and kv_cache_dtype == fp8.
  • It supports gqa_ratio up to 16, and shows performance gains over the existing kernel when gqa_ratio is 3 or higher. Therefore, it is enabled for gqa_ratio values between 3 and 16.
  • Fixed paged attention unit test to pass on Navi

Performance Gain
Script: python ./benchmarks/benchmark_throughput.py --model --trust-remote-code --dataset <ShareGPT_V3_unfiltered_cleaned_split.json> --num_prompts 1000 --max-model-len 4096 --gpu-memory-utilization 0.95

Navi 3

Models Num Heads GQA Ratio Output Token/s (original) Output Token/s (custom) Gain
glm-4-9b-chat 32 16 991.43 1113.75 12.3%
chatglm3-6b 32 16 1442.07 1554.23 7.8%
Meta-Llama-3.1-8B-Instruct 32 4 1143.65 1221.75 6.8%
Llama-3.2-3B-Instruct 24 3 2058.97 2146.62 4.3%
Qwen1.5-7B-Chat 32 1 904.46 882.53 -2.4%

Navi 4

Models Num Heads GQA Ratio Output Token/s (original) Output Token/s (custom) Gain
glm-4-9b-chat 32 16 1195.56 1433.13 19.9%
chatglm3-6b 32 16 1750.34 1962.21 12.1%
Meta-Llama-3.1-8B-Instruct 32 4 1405.42 1516.69 7.9%
Llama-3.2-3B-Instruct 24 3 2419.31 2561.47 5.9%
Qwen1.5-7B-Chat 32 1 765.6 761.3 -0.6%

Copy link
Collaborator

@shajrawi shajrawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! can you please create it against upstream instead per our contribution guidelines? https://github.com/vllm-project/vllm

@hyoon1 hyoon1 force-pushed the custom_kernel_navi branch from 63b41a7 to e689d99 Compare February 26, 2025 20:27
@hyoon1 hyoon1 force-pushed the custom_kernel_navi branch from 047b9ce to c8fe9aa Compare March 5, 2025 22:25
@hyoon1 hyoon1 changed the title Enable custom paged attention kernel for Navi3x Enable custom paged attention kernel for Navi 3/4 Mar 5, 2025
@hyoon1 hyoon1 force-pushed the custom_kernel_navi branch from c8fe9aa to dfcccb5 Compare March 5, 2025 22:29
@hyoon1 hyoon1 force-pushed the custom_kernel_navi branch from dfcccb5 to 5335b48 Compare March 6, 2025 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants