You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Token block size for contiguous chunks of tokens. This is ignored on neuron devices and set to --max-model-len. On CUDA devices, only block sizes up to 32 are supported. On HPU devices, block size defaults to 128.
Where is this requirement for <= 32 on CUDA devices coming from?
I was able to successfully run vLLM with block_size 128 on Hopper and see some minor performance improvement. Is the requirement up to date?
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
📚 The doc issue
In the args:
https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L454
it says about block_size parameter:
https://github.com/Dao-AILab/flash-attention/blob/d82bbf26924c492064af8b27ab299ff4808d1bf6/hopper/flash_attn_interface.py#L662
Does vLLM use this interface? How does FA paged_block_size relates to vLLM block_size?
Suggest a potential alternative/fix
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: