-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Restore support for larger block sizes #11259
[Bugfix] Restore support for larger block sizes #11259
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This reverts commit 69ba344. Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
007a573
to
9059886
Compare
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
ecff947
to
c9a38a4
Compare
Signed-off-by: Konrad Zawora <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reasonable and thanks for the clear error for CUDA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this
Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Sage Moore <[email protected]>
Found this PR because I got an error saying block size can only be up to 32 on CUDA. However, I was using larger block sizes with FlashAttention without any issue. Just want to confirm that do we really have this constraint for all use cases on CUDA? |
Signed-off-by: Konrad Zawora <[email protected]>
This PR reverts #10938, expands description of block size and adds assertion for CUDA-supported block sizes in CacheConfig.
While GPU kernels might not support block sizes greater than 32, other accelerators do. On HPU, going below block size 128 is very detrimental to performance, and 128 is used there by default.