-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Badcase]: 分词错误影响qwen2.5-72b-Instruct结果的发现 #1159
Comments
I think the badcases you provided are primarily due to the lack of corresponding finetuning data, similar to "how many r's in strawberry?" combined with "ruozhiba". The finetuning process overfits the pretrained model to a specific pattern and often a specific interpretation, just not the ones correct for the badcases. The general background is that for modern large language models, there is no word segmentation and there is only tokenization, e.g., byte-level byte pair encoding for Qwen models. Qwen never tries to do word segmenation and only combines the most frequent byte sequence into tokens. In that sense, tokens are not words or phrases and should not be considered as such. Tokens are only compressed byte sequences. With large-scale data (emphasis on large scale), the model can learn the meaning of those tokens. To what extend the model learns the meaning of the tokens depends on how good the model learns. In addition, you are using QA to probe model capabilities, which means the sampling process and the prompt engineering in inference also matters. For example, the tokenization result for "我去体育商品店里得知乒乓球拍卖完了" is |
Thanks for reply! I agree that LLM is a complex black model with long cascaded pipelines, it is hard to decide the wrong point given input and output. However, still take "我去体育商品店里得知乒乓球拍卖完了。请严格根据上文回答问题:乒乓球拍还有货吗?" as example, it has same tokenization result for both qwen2.5-72b-instruct and qwen-plus https://chat.qwenlm.ai/c/04b8fd5f-42ff-4414-bbfa-ba070eca256f Based on above tokenization result and LLM responses, we can find that Qwen2.5-72B-Instruct's answer strictly follows the tokenization result on '乒乓球' while Qwen-plus is more powerful as it can understand '乒乓球/拍' even given [乒乓球/拍卖] as input. I am curious what is the difference between Qwen2.5-72B-Instruct with Qwen-plus? I acknowledge Qwen-plus' better perform than Qwen2.5-72B-Instruct. As one of the strongest open source LLM, Qwen2.5-72B-Instruct still makes mistake on simple tricky prompts. I am wondering what happens if tokenization is correct on the first step, e.g, 乒乓球拍卖完了 => [乒乓球][拍][卖][完了]? Moreover, Have you ever systematically evaluate Qwen's tokenization result? I am really interested in this question. Besides, I test this prompt on Qwen2.5-32B-Instruct, Qwen2.5-14B-Instruct, Qwen2.5-7B-Instruct. We can see that 32B is almost correct except Finally, compare vocab of Qwen2.5 to Deepseek-v3, Qwen2.5 has much fewer long Chinese tokens with max length 4 to 16 of Deepseek-v3. |
This comment was marked as outdated.
This comment was marked as outdated.
They indeed use the same vocabulay. You can check with the tools at https://dashscope.console.aliyun.com/tokenizer .
For general comparisons on tokenization, you can take a look at the Qwen Technical Report. Not much was published though.
Different tokenizers adopt different strategies. DeepSeek-V3 could produce fewer tokens for Chinese texts, which may gain advantages in terms of memory occupation and inference speed for Chinese. This also applies to Qwen on coding. However, if the training corpora are large and diverse enough (and the meaning of the tokens is indeeded important in the sequences), the difference of the resulting models should not be substantial.
Trick sentences (or sentences constructed primarily for grammar studies) are generally not that many in the world or pretraining data and the context in the texts may not be (important) enough for the model to correctly learn their meanings (if one believes in distributed semantics). They are mostly addressed in finetuning. The results could be the lack of corresponding data (what "乒乓球拍卖完了" and "乒乓球 拍卖完了" means) or that Qwen2.5-Plus overfits to the most probable interpretation. |
Model Series
Qwen2.5
What are the models used?
qwen2.5-72b-Instruct
What is the scenario where the problem happened?
大模型给出了错误的回复
Is this badcase known and can it be solved using avaiable techniques?
Information about environment
发生错误的有如下提示词:
Description
在https://cloud.siliconflow.cn/models?mfs=Qwen2.5 页面输出上述提示词,就可以看到大模型给出了错误的回复。考虑到大模型随机性,上述提示词的回复可以相对稳定复现(不能保证100%)。
更多的结果分析参考的github项目 https://github.com/zhaoyukoon/damoxing_fenci_gongji
The text was updated successfully, but these errors were encountered: