We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import os from transformers import AutoTokenizer from llamafactory.data import get_template_and_fix_tokenizer MESSAGES = [ {"role": "user", "content": "How are you"}, {"role": "assistant", "content": "I am fine!"}, ] def test_encode_multiturn(): tokenizer = AutoTokenizer.from_pretrained('01-ai/Yi-1.5-34B-Chat') template = get_template_and_fix_tokenizer(tokenizer, name="yi") encoded_pairs = template.encode_multiturn(tokenizer, MESSAGES) # 把编码结果和原始代码打印出来观察。 print(encoded_pairs) print(tokenizer.apply_chat_template(MESSAGES)) print(tokenizer.convert_ids_to_tokens([59597,616]))
以上的输出结果是:
[([1581, 59705, 622, 59593, 5858, 46826, 3903, 144, 6546, 678, 641, 7, 59568, 144, 59666, 59705, 622, 59593, 5858, 46826, 765, 13611, 144], [616, 1064, 4064, 99, 7])] [1581, 59705, 622, 59593, 5858, 46826, 3903, 144, 6546, 678, 641, 7, 59568, 144, 59666, 59705, 622, 59593, 5858, 46826, 765, 13611, 144, 59597, 1064, 4064, 99, 7, 59568, 144] ['I', '▁I']
容易发现,问题出现在一轮对话中response的部分里,llama factory中将每半轮独立encode的行为,会触发yi tokenizer总是为开头词加上下划线这个行为,从而使得最后的token不一致。
The text was updated successfully, but these errors were encountered:
88a20ba
fixed
Sorry, something went wrong.
fix hiyouga#4699
52f3d9b
slow tokenizer for yi models
No branches or pull requests
Reminder
Reproduction
以上的输出结果是:
容易发现,问题出现在一轮对话中response的部分里,llama factory中将每半轮独立encode的行为,会触发yi tokenizer总是为开头词加上下划线这个行为,从而使得最后的token不一致。
The text was updated successfully, but these errors were encountered: