fixbug：llama3在增量预训练时应该使用<|end_of_text|>标识文本的结束 #4204

dignfei · 2024-06-11T08:32:21Z

What does this PR do?

Fixes # (issue)

经过大量的增量预训练，进行对比试验，发现这个bug：llama3在预训练时使用的tokenizer.eos_toke是'<|end_of_text|>' ，这里在每条数据后面也得用这个，而不是'<|eot_id|>'，否则增量预训练时很容易导致严重的性能下降

…text|>' ，这里在每条数据后面也得用这个，而不是'<|eot_id|>'，否则很容易导致严重的性能下降

hiyouga

LGTM

WeeeicheN · 2024-11-01T09:38:37Z

该问题可能是由于Llama-3的tokenizer_config.json中设定："eos_token": "<|end_of_text|>"

因此在后续版本中（tokenizer_config.json中设定："eos_token": "<|eot_id|>"）应该不存在这个问题

建议新设置llama3x的template，作为Llama-3.1等后续版本的template，从而在预训练数据的预处理中能够加入<|eot_id|>

经过大量的增量预训练，进行对比试验，发现这个bug：llama3在预训练时使用的tokenizer.eos_toke是'<|end_of_…

6979f3f

…text|>' ，这里在每条数据后面也得用这个，而不是'<|eot_id|>'，否则很容易导致严重的性能下降

dignfei changed the title ~~fixbug：llama3在增量预训练时应该使用的<|end_of_text|>标识文本的结束~~ fixbug：llama3在增量预训练时应该使用<|end_of_text|>标识文本的结束 Jun 11, 2024

hiyouga closed this Jun 11, 2024

hiyouga added the wontfix This will not be worked on label Jun 11, 2024

hiyouga reopened this Jun 11, 2024

hiyouga added pending This problem is yet to be addressed and removed wontfix This will not be worked on labels Jun 11, 2024

Update pretrain.py

0c29233

hiyouga self-requested a review June 11, 2024 09:02

hiyouga approved these changes Jun 11, 2024

View reviewed changes

hiyouga merged commit 9049aab into hiyouga:main Jun 11, 2024
1 check passed

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 11, 2024