Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixbug:llama3在增量预训练时应该使用<|end_of_text|>标识文本的结束 #4204

Merged
merged 2 commits into from
Jun 11, 2024

Conversation

dignfei
Copy link
Contributor

@dignfei dignfei commented Jun 11, 2024

What does this PR do?

llama3在预训练时使用的tokenizer.eos_toke是'<|end_of_text|>',而Meta-Llama-3-8B-Instruct和很多llama3中文模型 都改成了<|eot_id|> ,进行增量预训练时应该使用'<|end_of_text|>' 。

Fixes # (issue)

经过大量的增量预训练,进行对比试验,发现这个bug:llama3在预训练时使用的tokenizer.eos_toke是'<|end_of_text|>' ,这里在每条数据后面也得用这个,而不是'<|eot_id|>',否则增量预训练时很容易导致严重的性能下降

Before submitting

…text|>' ,这里在每条数据后面也得用这个,而不是'<|eot_id|>',否则很容易导致严重的性能下降
@dignfei dignfei changed the title fixbug:llama3在增量预训练时应该使用的<|end_of_text|>标识文本的结束 fixbug:llama3在增量预训练时应该使用<|end_of_text|>标识文本的结束 Jun 11, 2024
@hiyouga hiyouga closed this Jun 11, 2024
@hiyouga hiyouga added the wontfix This will not be worked on label Jun 11, 2024
@hiyouga hiyouga reopened this Jun 11, 2024
@hiyouga hiyouga added pending This problem is yet to be addressed and removed wontfix This will not be worked on labels Jun 11, 2024
@hiyouga hiyouga self-requested a review June 11, 2024 09:02
Copy link
Owner

@hiyouga hiyouga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hiyouga hiyouga merged commit 9049aab into hiyouga:main Jun 11, 2024
1 check passed
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 11, 2024
@WeeeicheN
Copy link

根据 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/105

该问题可能是由于Llama-3的tokenizer_config.json中设定:"eos_token": "<|end_of_text|>"

因此在后续版本中(tokenizer_config.json中设定:"eos_token": "<|eot_id|>")应该不存在这个问题

建议新设置llama3x的template,作为Llama-3.1等后续版本的template,从而在预训练数据的预处理中能够加入<|eot_id|>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants