Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Badcase]: 分词错误影响qwen2.5-72b-Instruct结果的发现 #1159

Closed
4 tasks done
zhaoyukoon opened this issue Jan 13, 2025 · 5 comments
Closed
4 tasks done

[Badcase]: 分词错误影响qwen2.5-72b-Instruct结果的发现 #1159

zhaoyukoon opened this issue Jan 13, 2025 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@zhaoyukoon
Copy link

zhaoyukoon commented Jan 13, 2025

Model Series

Qwen2.5

What are the models used?

qwen2.5-72b-Instruct

What is the scenario where the problem happened?

大模型给出了错误的回复

Is this badcase known and can it be solved using avaiable techniques?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find a solution there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

发生错误的有如下提示词:

  1. 李鹏飞和李鹏飞到南京了。请严格根据上文回答:李鹏在哪里?怎么到的?
  2. 我去体育商品店里得知乒乓球拍卖完了。请严格根据上文回答问题:乒乓球拍还有货吗?
  3. 你能构造出包含分词不同而语义不同的语句吗?举个例子,"李鹏飞到南京"有”李鹏/飞到/南京“和"李鹏飞/到/南京"两种语义上都是正确分词结果。

Description

https://cloud.siliconflow.cn/models?mfs=Qwen2.5 页面输出上述提示词,就可以看到大模型给出了错误的回复。考虑到大模型随机性,上述提示词的回复可以相对稳定复现(不能保证100%)。

image

image

更多的结果分析参考的github项目 https://github.com/zhaoyukoon/damoxing_fenci_gongji

@jklj077
Copy link
Collaborator

jklj077 commented Jan 13, 2025

I think the badcases you provided are primarily due to the lack of corresponding finetuning data, similar to "how many r's in strawberry?" combined with "ruozhiba". The finetuning process overfits the pretrained model to a specific pattern and often a specific interpretation, just not the ones correct for the badcases.

The general background is that for modern large language models, there is no word segmentation and there is only tokenization, e.g., byte-level byte pair encoding for Qwen models. Qwen never tries to do word segmenation and only combines the most frequent byte sequence into tokens. In that sense, tokens are not words or phrases and should not be considered as such. Tokens are only compressed byte sequences.

With large-scale data (emphasis on large scale), the model can learn the meaning of those tokens. To what extend the model learns the meaning of the tokens depends on how good the model learns. In addition, you are using QA to probe model capabilities, which means the sampling process and the prompt engineering in inference also matters.

For example, the tokenization result for "我去体育商品店里得知乒乓球拍卖完了" is
image
(https://dashscope.console.aliyun.com/tokenizer)
However, the model (Qwen-Plus) clearly understands the sentence:
image
(https://chat.qwenlm.ai/s/d11f330f-b27c-457c-af4e-1d4b77922510)
In conclusion, it is not due to word segmentation/tokenization primarily, but the lack of corresponding finetuning data. It is a very common issue after finetuning/SFT/RLHF/... and should be addressed with model iterations.

@jklj077 jklj077 added the enhancement New feature or request label Jan 13, 2025
@zhaoyukoon
Copy link
Author

zhaoyukoon commented Jan 13, 2025

Thanks for reply!
I am wondering whether Qwen-plus, Qwen-Max, Qwen2.5-plus, Qwen2.5-72B-Instruct share same vocabulary table? If not, whether the table of first three models is public available?

I agree that LLM is a complex black model with long cascaded pipelines, it is hard to decide the wrong point given input and output.

However, still take "我去体育商品店里得知乒乓球拍卖完了。请严格根据上文回答问题:乒乓球拍还有货吗?" as example, it has same tokenization result for both qwen2.5-72b-instruct and qwen-plus
36d7221c54dab27a1262eec3aec41aa with pattern [乒乓球][拍卖]...[乒乓球][拍?].
Then I got reply from chat.qwenlm.ai .
https://chat.qwenlm.ai/c/93ed2d0f-1d7c-4923-9b25-c649e6f8946e
0a540dae14cf3f034b12bc6ee71a513

https://chat.qwenlm.ai/c/04b8fd5f-42ff-4414-bbfa-ba070eca256f
8ded4b094a2ec6e80a107479069e15a

Based on above tokenization result and LLM responses, we can find that Qwen2.5-72B-Instruct's answer strictly follows the tokenization result on '乒乓球' while Qwen-plus is more powerful as it can understand '乒乓球/拍' even given [乒乓球/拍卖] as input. I am curious what is the difference between Qwen2.5-72B-Instruct with Qwen-plus?

I acknowledge Qwen-plus' better perform than Qwen2.5-72B-Instruct. As one of the strongest open source LLM, Qwen2.5-72B-Instruct still makes mistake on simple tricky prompts. I am wondering what happens if tokenization is correct on the first step, e.g, 乒乓球拍卖完了 => [乒乓球][拍][卖][完了]?

Moreover, Have you ever systematically evaluate Qwen's tokenization result? I am really interested in this question.

Besides, I test this prompt on Qwen2.5-32B-Instruct, Qwen2.5-14B-Instruct, Qwen2.5-7B-Instruct.

c977095415465c056fea05e3f260f1f

8e8e8fae4aea14360b51e9768cdeced

c5b2bfce53453525535e375675eee77

We can see that 32B is almost correct except 乒乓球拍拍卖, while 14B and 7B get confused.

Finally, compare vocab of Qwen2.5 to Deepseek-v3, Qwen2.5 has much fewer long Chinese tokens with max length 4 to 16 of Deepseek-v3.
image In other words, Qwen2.5's tokenizer is better than Deepseek-v3.

@zhaoyukoon

This comment was marked as outdated.

@zhaoyukoon
Copy link
Author

I try a simpler method by add backspace to "我去体育商品店里得知乒乓球拍卖完了。请严格根据上文回答问题:乒乓球拍还有货吗?".

Even 7B can give correct answer by "乒乓球拍卖完了" => "乒乓球拍 卖完了"
image

However Qwen-Plus give wrong answer by "乒乓球拍卖完了" => "乒乓球 拍卖完了"

image

@jklj077
Copy link
Collaborator

jklj077 commented Jan 14, 2025

I am wondering whether Qwen-plus, Qwen-Max, Qwen2.5-plus, Qwen2.5-72B-Instruct share same vocabulary table? If not, whether the table of first three models is public available?

They indeed use the same vocabulay. You can check with the tools at https://dashscope.console.aliyun.com/tokenizer .

Have you ever systematically evaluate Qwen's tokenization result?

For general comparisons on tokenization, you can take a look at the Qwen Technical Report. Not much was published though.

In other words, Qwen2.5's tokenizer is better than Deepseek-v3.

Different tokenizers adopt different strategies. DeepSeek-V3 could produce fewer tokens for Chinese texts, which may gain advantages in terms of memory occupation and inference speed for Chinese. This also applies to Qwen on coding. However, if the training corpora are large and diverse enough (and the meaning of the tokens is indeeded important in the sequences), the difference of the resulting models should not be substantial.

"乒乓球拍 卖完了" vs "乒乓球 拍卖完了"

Trick sentences (or sentences constructed primarily for grammar studies) are generally not that many in the world or pretraining data and the context in the texts may not be (important) enough for the model to correctly learn their meanings (if one believes in distributed semantics). They are mostly addressed in finetuning. The results could be the lack of corresponding data (what "乒乓球拍卖完了" and "乒乓球 拍卖完了" means) or that Qwen2.5-Plus overfits to the most probable interpretation.

@jklj077 jklj077 closed this as completed Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants