[Badcase]: 分词错误影响qwen2.5-72b-Instruct结果的发现 #1159

zhaoyukoon · 2025-01-13T06:08:19Z

Model Series

Qwen2.5

What are the models used?

qwen2.5-72b-Instruct

What is the scenario where the problem happened?

大模型给出了错误的回复

Is this badcase known and can it be solved using avaiable techniques?

I have followed the GitHub README.
I have checked the Qwen documentation and cannot find a solution there.
I have checked the documentation of the related framework and cannot find useful information.
I have searched the issues and there is not a similar one.

Information about environment

发生错误的有如下提示词：

李鹏飞和李鹏飞到南京了。请严格根据上文回答：李鹏在哪里？怎么到的？
我去体育商品店里得知乒乓球拍卖完了。请严格根据上文回答问题：乒乓球拍还有货吗？
你能构造出包含分词不同而语义不同的语句吗？举个例子，"李鹏飞到南京"有”李鹏/飞到/南京“和"李鹏飞/到/南京"两种语义上都是正确分词结果。

Description

在https://cloud.siliconflow.cn/models?mfs=Qwen2.5 页面输出上述提示词，就可以看到大模型给出了错误的回复。考虑到大模型随机性，上述提示词的回复可以相对稳定复现（不能保证100%）。

更多的结果分析参考的github项目 https://github.com/zhaoyukoon/damoxing_fenci_gongji

jklj077 · 2025-01-13T10:21:32Z

I think the badcases you provided are primarily due to the lack of corresponding finetuning data, similar to "how many r's in strawberry?" combined with "ruozhiba". The finetuning process overfits the pretrained model to a specific pattern and often a specific interpretation, just not the ones correct for the badcases.

The general background is that for modern large language models, there is no word segmentation and there is only tokenization, e.g., byte-level byte pair encoding for Qwen models. Qwen never tries to do word segmenation and only combines the most frequent byte sequence into tokens. In that sense, tokens are not words or phrases and should not be considered as such. Tokens are only compressed byte sequences.

With large-scale data (emphasis on large scale), the model can learn the meaning of those tokens. To what extend the model learns the meaning of the tokens depends on how good the model learns. In addition, you are using QA to probe model capabilities, which means the sampling process and the prompt engineering in inference also matters.

For example, the tokenization result for "我去体育商品店里得知乒乓球拍卖完了" is

(https://dashscope.console.aliyun.com/tokenizer)
However, the model (Qwen-Plus) clearly understands the sentence:

(https://chat.qwenlm.ai/s/d11f330f-b27c-457c-af4e-1d4b77922510)
In conclusion, it is not due to word segmentation/tokenization primarily, but the lack of corresponding finetuning data. It is a very common issue after finetuning/SFT/RLHF/... and should be addressed with model iterations.

zhaoyukoon · 2025-01-13T15:36:27Z

Thanks for reply!
I am wondering whether Qwen-plus, Qwen-Max, Qwen2.5-plus, Qwen2.5-72B-Instruct share same vocabulary table? If not, whether the table of first three models is public available?

I agree that LLM is a complex black model with long cascaded pipelines, it is hard to decide the wrong point given input and output.

However, still take "我去体育商品店里得知乒乓球拍卖完了。请严格根据上文回答问题：乒乓球拍还有货吗？" as example, it has same tokenization result for both qwen2.5-72b-instruct and qwen-plus
with pattern [乒乓球][拍卖]...[乒乓球][拍?].
Then I got reply from chat.qwenlm.ai .
https://chat.qwenlm.ai/c/93ed2d0f-1d7c-4923-9b25-c649e6f8946e

https://chat.qwenlm.ai/c/04b8fd5f-42ff-4414-bbfa-ba070eca256f

Based on above tokenization result and LLM responses, we can find that Qwen2.5-72B-Instruct's answer strictly follows the tokenization result on '乒乓球' while Qwen-plus is more powerful as it can understand '乒乓球/拍' even given [乒乓球/拍卖] as input. I am curious what is the difference between Qwen2.5-72B-Instruct with Qwen-plus?

I acknowledge Qwen-plus' better perform than Qwen2.5-72B-Instruct. As one of the strongest open source LLM, Qwen2.5-72B-Instruct still makes mistake on simple tricky prompts. I am wondering what happens if tokenization is correct on the first step, e.g, 乒乓球拍卖完了 => [乒乓球][拍][卖][完了]?

Moreover, Have you ever systematically evaluate Qwen's tokenization result? I am really interested in this question.

Besides, I test this prompt on Qwen2.5-32B-Instruct, Qwen2.5-14B-Instruct, Qwen2.5-7B-Instruct.

We can see that 32B is almost correct except 乒乓球拍拍卖, while 14B and 7B get confused.

Finally, compare vocab of Qwen2.5 to Deepseek-v3, Qwen2.5 has much fewer long Chinese tokens with max length 4 to 16 of Deepseek-v3.
In other words, Qwen2.5's tokenizer is better than Deepseek-v3.

zhaoyukoon · 2025-01-14T02:32:44Z

I try a simpler method by add backspace to "我去体育商品店里得知乒乓球拍卖完了。请严格根据上文回答问题：乒乓球拍还有货吗？".

Even 7B can give correct answer by "乒乓球拍卖完了" => "乒乓球拍卖完了"

However Qwen-Plus give wrong answer by "乒乓球拍卖完了" => "乒乓球拍卖完了"

jklj077 · 2025-01-14T11:23:05Z

I am wondering whether Qwen-plus, Qwen-Max, Qwen2.5-plus, Qwen2.5-72B-Instruct share same vocabulary table? If not, whether the table of first three models is public available?

They indeed use the same vocabulay. You can check with the tools at https://dashscope.console.aliyun.com/tokenizer .

Have you ever systematically evaluate Qwen's tokenization result?

For general comparisons on tokenization, you can take a look at the Qwen Technical Report. Not much was published though.

In other words, Qwen2.5's tokenizer is better than Deepseek-v3.

Different tokenizers adopt different strategies. DeepSeek-V3 could produce fewer tokens for Chinese texts, which may gain advantages in terms of memory occupation and inference speed for Chinese. This also applies to Qwen on coding. However, if the training corpora are large and diverse enough (and the meaning of the tokens is indeeded important in the sequences), the difference of the resulting models should not be substantial.

"乒乓球拍卖完了" vs "乒乓球拍卖完了"

Trick sentences (or sentences constructed primarily for grammar studies) are generally not that many in the world or pretraining data and the context in the texts may not be (important) enough for the model to correctly learn their meanings (if one believes in distributed semantics). They are mostly addressed in finetuning. The results could be the lack of corresponding data (what "乒乓球拍卖完了" and "乒乓球拍卖完了" means) or that Qwen2.5-Plus overfits to the most probable interpretation.

jklj077 added the enhancement New feature or request label Jan 13, 2025

jklj077 assigned yangapku Jan 13, 2025

This comment was marked as outdated.

Sign in to view

zhaoyukoon mentioned this issue Jan 14, 2025

[Badcase]: 英文分词问题以及数据发现 #1161

Closed

4 tasks

jklj077 closed this as completed Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Badcase]: 分词错误影响qwen2.5-72b-Instruct结果的发现 #1159

[Badcase]: 分词错误影响qwen2.5-72b-Instruct结果的发现 #1159

zhaoyukoon commented Jan 13, 2025 •

edited

Loading

jklj077 commented Jan 13, 2025

zhaoyukoon commented Jan 13, 2025 •

edited

Loading

This comment was marked as outdated.

zhaoyukoon commented Jan 14, 2025

jklj077 commented Jan 14, 2025 •

edited

Loading

[Badcase]: 分词错误影响qwen2.5-72b-Instruct结果的发现 #1159

[Badcase]: 分词错误影响qwen2.5-72b-Instruct结果的发现 #1159

Comments

zhaoyukoon commented Jan 13, 2025 • edited Loading

Model Series

What are the models used?

What is the scenario where the problem happened?

Is this badcase known and can it be solved using avaiable techniques?

Information about environment

Description

jklj077 commented Jan 13, 2025

zhaoyukoon commented Jan 13, 2025 • edited Loading

This comment was marked as outdated.

zhaoyukoon commented Jan 14, 2025

jklj077 commented Jan 14, 2025 • edited Loading

zhaoyukoon commented Jan 13, 2025 •

edited

Loading

zhaoyukoon commented Jan 13, 2025 •

edited

Loading

jklj077 commented Jan 14, 2025 •

edited

Loading