Enable custom paged attention kernel for Navi 3/4 #446

hyoon1 · 2025-02-24T18:20:46Z

Introduce custom paged attention kernel for Navi3/4

Supports cases where head_size == 128 and block_size == 16.
It does not support alibi_slopes and kv_cache_dtype == fp8.
It supports gqa_ratio up to 16, and shows performance gains over the existing kernel when gqa_ratio is 3 or higher. Therefore, it is enabled for gqa_ratio values between 3 and 16.
Fixed paged attention unit test to pass on Navi

Performance Gain
Script: python ./benchmarks/benchmark_throughput.py --model --trust-remote-code --dataset <ShareGPT_V3_unfiltered_cleaned_split.json> --num_prompts 1000 --max-model-len 4096 --gpu-memory-utilization 0.95

Navi 3

Models	Num Heads	GQA Ratio	Output Token/s (original)	Output Token/s (custom)	Gain
glm-4-9b-chat	32	16	991.43	1113.75	12.3%
chatglm3-6b	32	16	1442.07	1554.23	7.8%
Meta-Llama-3.1-8B-Instruct	32	4	1143.65	1221.75	6.8%
Llama-3.2-3B-Instruct	24	3	2058.97	2146.62	4.3%
Qwen1.5-7B-Chat	32	1	904.46	882.53	-2.4%

Navi 4

Models	Num Heads	GQA Ratio	Output Token/s (original)	Output Token/s (custom)	Gain
glm-4-9b-chat	32	16	1195.56	1433.13	19.9%
chatglm3-6b	32	16	1750.34	1962.21	12.1%
Meta-Llama-3.1-8B-Instruct	32	4	1405.42	1516.69	7.9%
Llama-3.2-3B-Instruct	24	3	2419.31	2561.47	5.9%
Qwen1.5-7B-Chat	32	1	765.6	761.3	-0.6%

shajrawi

Thank you for the PR! can you please create it against upstream instead per our contribution guidelines? https://github.com/vllm-project/vllm

- alibi slopes and fp8 kv cache are not supported Signed-off-by: Hosang Yoon <[email protected]>

Signed-off-by: Hosang Yoon <[email protected]>

hyoon1 requested review from Alexei-V-Ivanov-AMD, shajrawi, gshtras, maleksan85, sunway513, charlifu and mawong-amd as code owners February 24, 2025 18:20

shajrawi requested changes Feb 24, 2025

View reviewed changes

hyoon1 force-pushed the custom_kernel_navi branch from 63b41a7 to e689d99 Compare February 26, 2025 20:27

hyoon1 added 3 commits March 5, 2025 17:23

Enable custom paged attention kernel for Navi3x

1aeb22e

- alibi slopes and fp8 kv cache are not supported Signed-off-by: Hosang Yoon <[email protected]>

remove fp8 scale when reducing on Navi

2da0df2

add navi4x support for custom paged attention kernel

af5e8fc

Signed-off-by: Hosang Yoon <[email protected]>

hyoon1 force-pushed the custom_kernel_navi branch from 047b9ce to c8fe9aa Compare March 5, 2025 22:25

hyoon1 changed the title ~~Enable custom paged attention kernel for Navi3x~~ Enable custom paged attention kernel for Navi 3/4 Mar 5, 2025

hyoon1 force-pushed the custom_kernel_navi branch from c8fe9aa to dfcccb5 Compare March 5, 2025 22:29

fix unit test args for custom paged attention

5335b48

Signed-off-by: Hosang Yoon <[email protected]>

hyoon1 force-pushed the custom_kernel_navi branch from dfcccb5 to 5335b48 Compare March 6, 2025 04:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable custom paged attention kernel for Navi 3/4 #446

Enable custom paged attention kernel for Navi 3/4 #446

hyoon1 commented Feb 24, 2025 •

edited by github-actions bot

Loading

shajrawi left a comment

Enable custom paged attention kernel for Navi 3/4 #446

Are you sure you want to change the base?

Enable custom paged attention kernel for Navi 3/4 #446

Conversation

hyoon1 commented Feb 24, 2025 • edited by github-actions bot Loading

shajrawi left a comment

Choose a reason for hiding this comment

hyoon1 commented Feb 24, 2025 •

edited by github-actions bot

Loading