Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc]: Why is max block_size on CUDA 32? #14319

Open
1 task done
ptarasiewiczNV opened this issue Mar 5, 2025 · 0 comments
Open
1 task done

[Doc]: Why is max block_size on CUDA 32? #14319

ptarasiewiczNV opened this issue Mar 5, 2025 · 0 comments
Labels
documentation Improvements or additions to documentation

Comments

@ptarasiewiczNV
Copy link

ptarasiewiczNV commented Mar 5, 2025

📚 The doc issue

In the args:
https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L454
it says about block_size parameter:

Token block size for contiguous chunks of tokens. This is ignored on neuron devices and set to --max-model-len. On CUDA devices, only block sizes up to 32 are supported. On HPU devices, block size defaults to 128.

  1. Where is this requirement for <= 32 on CUDA devices coming from?
  2. I was able to successfully run vLLM with block_size 128 on Hopper and see some minor performance improvement. Is the requirement up to date?
  3. In flash attention docs I see that paged attention minimum block size is actually 256:
    https://github.com/Dao-AILab/flash-attention/blob/d82bbf26924c492064af8b27ab299ff4808d1bf6/hopper/flash_attn_interface.py#L662
    Does vLLM use this interface? How does FA paged_block_size relates to vLLM block_size?

Suggest a potential alternative/fix

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@ptarasiewiczNV ptarasiewiczNV added the documentation Improvements or additions to documentation label Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant