[Doc]: Why is max block_size on CUDA 32? #14319

ptarasiewiczNV · 2025-03-05T23:50:23Z

📚 The doc issue

In the args:
https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L454
it says about block_size parameter:

Token block size for contiguous chunks of tokens. This is ignored on neuron devices and set to --max-model-len. On CUDA devices, only block sizes up to 32 are supported. On HPU devices, block size defaults to 128.

Where is this requirement for <= 32 on CUDA devices coming from?
I was able to successfully run vLLM with block_size 128 on Hopper and see some minor performance improvement. Is the requirement up to date?
In flash attention docs I see that paged attention minimum block size is actually 256:
https://github.com/Dao-AILab/flash-attention/blob/d82bbf26924c492064af8b27ab299ff4808d1bf6/hopper/flash_attn_interface.py#L662
Does vLLM use this interface? How does FA paged_block_size relates to vLLM block_size?

Suggest a potential alternative/fix

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

ptarasiewiczNV added the documentation Improvements or additions to documentation label Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc]: Why is max block_size on CUDA 32? #14319

[Doc]: Why is max block_size on CUDA 32? #14319

ptarasiewiczNV commented Mar 5, 2025 •

edited

Loading

[Doc]: Why is max block_size on CUDA 32? #14319

[Doc]: Why is max block_size on CUDA 32? #14319

Comments

ptarasiewiczNV commented Mar 5, 2025 • edited Loading

📚 The doc issue

Suggest a potential alternative/fix

Before submitting a new issue...

ptarasiewiczNV commented Mar 5, 2025 •

edited

Loading