-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] Add support for Punica kernels on AMD GPUs #3140
[ROCm] Add support for Punica kernels on AMD GPUs #3140
Conversation
@hongxiayang @lcskrishna Could you help review this PR? |
@hongxiayang @dllehr-amd Could you review this PR? This is an important PR that enables the AMD GPUs to support multi-LoRA serving, which is a key feature in vLLM liked by many users. |
This script can help verify this works end to end https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py |
will check. Thanks for this effort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some verification and ran the e2e verify script.
At first punica C extension was not build by default and import failed. And then I added the env VLLM_INSTALL_PUNICA_KERNELS and then rebuild the docker image. It was ok afterwards. Maybe worth a mention in the documentation somewhere if not already.
#else | ||
|
||
#include <hip/hip_bf16.h> | ||
#include <hip/hip_fp16.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this file is hipified, this should not be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should not be needing this, but we may still need this for the time being as in quite a few environments I've tested with, including the rocm/pytorch images Dockerfile.rocm builds from, the hipify tool seems to have failed to convert the bf16 header.
|
||
//====== pybind ====== | ||
|
||
#define DEFINE_pybind(name) m.def(#name, &name, #name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this macro used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used but it is shipped with the implementation for cuda, so I kept it when refactoring for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kliuae please ping @simon-mo or @WoosukKwon to further review/approve this PR so that it can be merged.
I was going to try this out soon, is this in a good spot or is it still being worked? |
It's in a good state for testing, though occasionally I'll be merging in the upstream to fix conflicts before it gets merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kliuae Sorry for the late review. The PR looks good. Could you please resolved the merge conflict in CMakeLists.txt so that I can merge it? Thanks!
@WoosukKwon Merge conflicts are resolved |
Co-authored-by: miloice <[email protected]>
Co-authored-by: miloice <[email protected]>
Co-authored-by: miloice <[email protected]>
This PR adds ROCm support for punica kernels to enable multi-LoRA on AMD GPUs.
Some Punica files are slightly refactored so that the correct c++/hipcc compilers can be invoked when building under ROCm.
A custom bgmv shrink kernel is added to account for the difference in warp size between AMD's GPUs and Nvidia's.
The port has been tested on MI210, and the unit tests applying LoRA are passing.