Yi的template存在问题（附简单测试代码） #4699

rangehow · 2024-07-06T11:06:56Z

Reminder

I have read the README and searched the existing issues.

Reproduction

import os

from transformers import AutoTokenizer

from llamafactory.data import get_template_and_fix_tokenizer


MESSAGES = [
    {"role": "user", "content": "How are you"},
    {"role": "assistant", "content": "I am fine!"},
]

def test_encode_multiturn():
    tokenizer = AutoTokenizer.from_pretrained('01-ai/Yi-1.5-34B-Chat')
    template = get_template_and_fix_tokenizer(tokenizer, name="yi")
    encoded_pairs = template.encode_multiturn(tokenizer, MESSAGES)
    # 把编码结果和原始代码打印出来观察。
    print(encoded_pairs)
    print(tokenizer.apply_chat_template(MESSAGES))
    print(tokenizer.convert_ids_to_tokens([59597,616]))

以上的输出结果是：

[([1581, 59705, 622, 59593, 5858, 46826, 3903, 144, 6546, 678, 641, 7, 59568, 144, 59666, 59705, 622, 59593, 5858, 46826, 765, 13611, 144], [616, 1064, 4064, 99, 7])]
[1581, 59705, 622, 59593, 5858, 46826, 3903, 144, 6546, 678, 641, 7, 59568, 144, 59666, 59705, 622, 59593, 5858, 46826, 765, 13611, 144, 59597, 1064, 4064, 99, 7, 59568, 144]
['I', '▁I']

容易发现，问题出现在一轮对话中response的部分里，llama factory中将每半轮独立encode的行为，会触发yi tokenizer总是为开头词加上下划线这个行为，从而使得最后的token不一致。

hiyouga · 2024-07-14T07:34:31Z

fixed

slow tokenizer for yi models

github-actions bot added the pending This problem is yet to be addressed label Jul 6, 2024

hiyouga closed this as completed in 88a20ba Jul 14, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jul 14, 2024

xtchen96 pushed a commit to xtchen96/LLaMA-Factory that referenced this issue Jul 17, 2024

fix hiyouga#4699

52f3d9b

slow tokenizer for yi models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yi的template存在问题（附简单测试代码） #4699

Yi的template存在问题（附简单测试代码） #4699

rangehow commented Jul 6, 2024 •

edited

Loading

hiyouga commented Jul 14, 2024

Yi的template存在问题（附简单测试代码） #4699

Yi的template存在问题（附简单测试代码） #4699

Comments

rangehow commented Jul 6, 2024 • edited Loading

Reminder

Reproduction

hiyouga commented Jul 14, 2024

rangehow commented Jul 6, 2024 •

edited

Loading