-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Use vllm-flash-attn instead of flash-attn #4686
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thank you very much. By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti? I recall that the Turing GPU supports flash-attn1. |
I‘m confused. What is the difference between vllm-flash-attn and flash-attn,why use vllm-flash-attn instead of flash attn? |
Currently, vllm-flash-attn only supports cuda12.1. Should I recompile it from source code for the other cuda or torch version? |
This PR is to use the pre-built
vllm-flash-attn
wheel instead of the originalflash-attn
.