[Bug]: Qwen2.5-32B-GPTQ-Int4 inference `!!!!!` #10656

jklj077 · 2024-11-26T05:16:37Z

Your current environment

The output of `python collect_env.py`

N/A; happened to multiple users.

Model Input Dumps

No response

🐛 Describe the bug

We have been receiving reports that the 4-bit GPTQ version of Qwen2.5-32B-Instruct cannot be used with vllm. The generation only contains !!!!!. However, it was also reported that the same model worked using transformers and auto_gptq.

Here are some related issues:

[Badcase]: Model inference Qwen2.5-32B-Instruct-GPTQ-Int4 appears as garbled text !!!!!!!!!!!!!!!!!! QwenLM/Qwen2.5#945 (v0.6.1.post2, v0.6.2, v0.6.3)
[Bug]: Qwen2.5-32b-int4用vllm跑好像只会生成感叹号 QwenLM/Qwen2.5#1103 (v0.6.1)
[Bug]: 使用 Xinference vLLM 启动 qwen2.5-32b-instruct 推理结果都是感叹号 QwenLM/Qwen2.5#1038 (v0.4.2, v0.5.1)

We attempted to reproduce the issue, which appears related to quantization kernels, and the following is a summary:

gptq_marlin works
gptq fails for requests with len(prompt_token_ids)<=50 but works for longer input sequences

The results are consistent for

tensor-parallel-size: 2, 4, 8
vllm versions: v0.6.1.post2, v0.6.2, v0.6.3.post1, v0.6.4.post1
nvidia driver versions: 535.183.06, 560.35.05

As gptq_marlin is not available for turing and volta cards, we are not able to find a workaround for those users. It would help a lot if one could help investigate the issue.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-11-26T05:24:54Z

cc @robertgshaw2-neuralmagic

youqugit · 2024-11-26T11:00:31Z

I encountered the same issue, only the /chat/completions endpoint returns an error that many !!!!!, while the /completions endpoint works fine.

vLLM version: 0.6.1

DarkLight1337 · 2024-11-26T16:45:35Z

Also cc @mgoin

mgoin · 2024-11-26T23:06:02Z

As far as I can tell the gptq kernel hasn't been touched all year, the last change was #2330 by @chu-tianxiang

This may be a fundamental issue with the kernel for this model, someone would need to dive in and learn about it.

fsh2102 · 2024-12-02T08:16:13Z

I had the same problem when using the Qwen2-72B-Instruct model, is there a solution now

jklj077 · 2025-01-03T05:46:46Z

Hi, it appears that #11493 is about the marlin gptq kernel, while this issue is about the previous gptq kernel. I wonder if it's also fixed.

wchen61 · 2025-01-03T06:30:51Z

Hi, it appears that #11493 is about the marlin gptq kernel, while this issue is about the previous gptq kernel. I wonder if it's also fixed.

Yes, that's right. I didn't notice that this was inferred using the original gptq kernel. My PR addresses the issue within gptq_marlin. We might need to reopen this issue.

robertgshaw2-redhat · 2025-01-03T19:51:07Z

We do not really have the bandwidth to investigate this so would welcome a contribution from anyone in the community! Additionally, one could explore extending W4 triton kernels to support GPTQ models (currently they run with AWQ only). This could be a good long term solution if anyone is up for a challenge!

jklj077 added the bug Something isn't working label Nov 26, 2024

jeejeelee mentioned this issue Nov 27, 2024

[Usage]: vllm infer with 2 * Nvidia-L20, output repeat !!!! #10713

Open

1 task

wchen61 mentioned this issue Dec 25, 2024

Resolve race conditions in Marlin kernel #11493

Merged

mgoin closed this as completed in #11493 Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference `!!!!!` #10656

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference `!!!!!` #10656

jklj077 commented Nov 26, 2024 •

edited

Loading

youkaichao commented Nov 26, 2024

youqugit commented Nov 26, 2024

DarkLight1337 commented Nov 26, 2024

mgoin commented Nov 26, 2024

fsh2102 commented Dec 2, 2024

jklj077 commented Jan 3, 2025 •

edited

Loading

wchen61 commented Jan 3, 2025

robertgshaw2-redhat commented Jan 3, 2025

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!! #10656

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference !!!!! #10656

Comments

jklj077 commented Nov 26, 2024 • edited Loading

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

youkaichao commented Nov 26, 2024

youqugit commented Nov 26, 2024

DarkLight1337 commented Nov 26, 2024

mgoin commented Nov 26, 2024

fsh2102 commented Dec 2, 2024

jklj077 commented Jan 3, 2025 • edited Loading

wchen61 commented Jan 3, 2025

robertgshaw2-redhat commented Jan 3, 2025

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference `!!!!!` #10656

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference `!!!!!` #10656

jklj077 commented Nov 26, 2024 •

edited

Loading

jklj077 commented Jan 3, 2025 •

edited

Loading