Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yi的template存在问题(附简单测试代码) #4699

Closed
1 task done
rangehow opened this issue Jul 6, 2024 · 1 comment
Closed
1 task done

Yi的template存在问题(附简单测试代码) #4699

rangehow opened this issue Jul 6, 2024 · 1 comment
Labels
solved This problem has been already solved

Comments

@rangehow
Copy link

rangehow commented Jul 6, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

import os

from transformers import AutoTokenizer

from llamafactory.data import get_template_and_fix_tokenizer


MESSAGES = [
    {"role": "user", "content": "How are you"},
    {"role": "assistant", "content": "I am fine!"},
]

def test_encode_multiturn():
    tokenizer = AutoTokenizer.from_pretrained('01-ai/Yi-1.5-34B-Chat')
    template = get_template_and_fix_tokenizer(tokenizer, name="yi")
    encoded_pairs = template.encode_multiturn(tokenizer, MESSAGES)
    # 把编码结果和原始代码打印出来观察。
    print(encoded_pairs)
    print(tokenizer.apply_chat_template(MESSAGES))
    print(tokenizer.convert_ids_to_tokens([59597,616]))

以上的输出结果是:

[([1581, 59705, 622, 59593, 5858, 46826, 3903, 144, 6546, 678, 641, 7, 59568, 144, 59666, 59705, 622, 59593, 5858, 46826, 765, 13611, 144], [616, 1064, 4064, 99, 7])]
[1581, 59705, 622, 59593, 5858, 46826, 3903, 144, 6546, 678, 641, 7, 59568, 144, 59666, 59705, 622, 59593, 5858, 46826, 765, 13611, 144, 59597, 1064, 4064, 99, 7, 59568, 144]
['I', '▁I']

容易发现,问题出现在一轮对话中response的部分里,llama factory中将每半轮独立encode的行为,会触发yi tokenizer总是为开头词加上下划线这个行为,从而使得最后的token不一致。

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jul 6, 2024
@hiyouga
Copy link
Owner

hiyouga commented Jul 14, 2024

fixed

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jul 14, 2024
xtchen96 pushed a commit to xtchen96/LLaMA-Factory that referenced this issue Jul 17, 2024
slow tokenizer for yi models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants