Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!! #10656

Closed
1 task done
jklj077 opened this issue Nov 26, 2024 · 8 comments · Fixed by #11493
Closed
1 task done

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!! #10656

jklj077 opened this issue Nov 26, 2024 · 8 comments · Fixed by #11493
Labels
bug Something isn't working

Comments

@jklj077
Copy link

jklj077 commented Nov 26, 2024

Your current environment

The output of `python collect_env.py`

N/A; happened to multiple users.

Model Input Dumps

No response

🐛 Describe the bug

We have been receiving reports that the 4-bit GPTQ version of Qwen2.5-32B-Instruct cannot be used with vllm. The generation only contains !!!!!. However, it was also reported that the same model worked using transformers and auto_gptq.

Here are some related issues:

We attempted to reproduce the issue, which appears related to quantization kernels, and the following is a summary:

  • gptq_marlin works
  • gptq fails for requests with len(prompt_token_ids)<=50 but works for longer input sequences

The results are consistent for

  • tensor-parallel-size: 2, 4, 8
  • vllm versions: v0.6.1.post2, v0.6.2, v0.6.3.post1, v0.6.4.post1
  • nvidia driver versions: 535.183.06, 560.35.05

As gptq_marlin is not available for turing and volta cards, we are not able to find a workaround for those users. It would help a lot if one could help investigate the issue.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@jklj077 jklj077 added the bug Something isn't working label Nov 26, 2024
@youkaichao
Copy link
Member

cc @robertgshaw2-neuralmagic

@youqugit
Copy link

I encountered the same issue, only the /chat/completions endpoint returns an error that many !!!!!, while the /completions endpoint works fine.

vLLM version: 0.6.1

@DarkLight1337
Copy link
Member

Also cc @mgoin

@mgoin
Copy link
Member

mgoin commented Nov 26, 2024

As far as I can tell the gptq kernel hasn't been touched all year, the last change was #2330 by @chu-tianxiang

This may be a fundamental issue with the kernel for this model, someone would need to dive in and learn about it.

@fsh2102
Copy link

fsh2102 commented Dec 2, 2024

I had the same problem when using the Qwen2-72B-Instruct model, is there a solution now

@jklj077
Copy link
Author

jklj077 commented Jan 3, 2025

Hi, it appears that #11493 is about the marlin gptq kernel, while this issue is about the previous gptq kernel. I wonder if it's also fixed.

@wchen61
Copy link
Contributor

wchen61 commented Jan 3, 2025

Hi, it appears that #11493 is about the marlin gptq kernel, while this issue is about the previous gptq kernel. I wonder if it's also fixed.

Yes, that's right. I didn't notice that this was inferred using the original gptq kernel. My PR addresses the issue within gptq_marlin. We might need to reopen this issue.

@robertgshaw2-redhat
Copy link
Collaborator

We do not really have the bandwidth to investigate this so would welcome a contribution from anyone in the community! Additionally, one could explore extending W4 triton kernels to support GPTQ models (currently they run with AWQ only). This could be a good long term solution if anyone is up for a challenge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants