-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!!
#10656
Comments
I encountered the same issue, only the vLLM version: 0.6.1 |
Also cc @mgoin |
As far as I can tell the gptq kernel hasn't been touched all year, the last change was #2330 by @chu-tianxiang This may be a fundamental issue with the kernel for this model, someone would need to dive in and learn about it. |
I had the same problem when using the Qwen2-72B-Instruct model, is there a solution now |
Hi, it appears that #11493 is about the marlin gptq kernel, while this issue is about the previous gptq kernel. I wonder if it's also fixed. |
Yes, that's right. I didn't notice that this was inferred using the original gptq kernel. My PR addresses the issue within gptq_marlin. We might need to reopen this issue. |
We do not really have the bandwidth to investigate this so would welcome a contribution from anyone in the community! Additionally, one could explore extending W4 triton kernels to support GPTQ models (currently they run with AWQ only). This could be a good long term solution if anyone is up for a challenge! |
Your current environment
The output of `python collect_env.py`
N/A; happened to multiple users.
Model Input Dumps
No response
🐛 Describe the bug
We have been receiving reports that the 4-bit GPTQ version of Qwen2.5-32B-Instruct cannot be used with
vllm
. The generation only contains!!!!!
. However, it was also reported that the same model worked usingtransformers
andauto_gptq
.Here are some related issues:
We attempted to reproduce the issue, which appears related to quantization kernels, and the following is a summary:
gptq_marlin
worksgptq
fails for requests withlen(prompt_token_ids)<=50
but works for longer input sequencesThe results are consistent for
tensor-parallel-size
: 2, 4, 8vllm
versions: v0.6.1.post2, v0.6.2, v0.6.3.post1, v0.6.4.post1As
gptq_marlin
is not available for turing and volta cards, we are not able to find a workaround for those users. It would help a lot if one could help investigate the issue.Before submitting a new issue...
The text was updated successfully, but these errors were encountered: