Fix eos_token problem in all required models #1806

krammnic · 2024-10-11T09:36:07Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?
Closes #1478
Closes #1479
Closes #1480
Closes #1481

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

So generally for Mistral and Gemma - just check to pass None if add_eos = False in tokenize_messages_no_special_tokens
For Qwen and Phi - same check as in #1477 in truncate(actually for Phi should be reviewed in more accurate way)

And 4 unittests for all models.

pytorch-bot · 2024-10-11T09:36:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1806

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 33c6d54 with merge base c5b7386 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

krammnic · 2024-10-11T09:37:27Z

Actually, I think that tests with this big lists of token ids should be refactored(maybe some fixture?).

krammnic · 2024-10-11T09:37:46Z

@RdoubleA @joecummings Require review

joecummings · 2024-10-11T13:15:38Z

torchtune/models/phi3/_tokenizer.py

@@ -101,13 +101,11 @@ def encode(
            trim_leading_whitespace=trim_leading_whitespace,
        )

-    def decode(self, ids: List[int], skip_special_tokens: bool = True) -> str:


Why did you remove skip_special_tokens?

Weird, maybe accidentally removed. Let me fix

krammnic · 2024-10-11T13:35:21Z

Actually, I think that tests with this big lists of token ids should be refactored(maybe some fixture?).

Will think about it more and maybe open PR

krammnic · 2024-10-11T14:30:26Z

Something really weird.

joecummings

LGTM

krammnic added 8 commits October 11, 2024 05:28

add eos_drop unittest in test_gemma_tokenizer.py

8006418

Add eos_drop unittest in test_mistral_tokenizer.py

61a8bde

Add eos_drop in test_phi3_tokenizer.py

951f8d0

Add eos_drop in test_qwen2_tokenizer.py

9fa7a47

Fix eos in gemma _tokenizer.py

c651250

Fix eos in mistral _tokenizer.py

5992980

Fix eos in phi3 _tokenizer.py

09be301

Fix eos in qwen2_tokenizer.py

4315ee7

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 11, 2024

joecummings reviewed Oct 11, 2024

View reviewed changes

Return skip_special_tokens in _tokenizer.py

33c6d54

joecummings approved these changes Oct 11, 2024

View reviewed changes

joecummings linked an issue Oct 11, 2024 that may be closed by this pull request

[Qwen2 tokenizer] Ensure eos_token is not added if add_eos=False #1478

Closed

joecummings merged commit 543f698 into pytorch:main Oct 11, 2024
17 checks passed

mori360 pushed a commit to mori360/torchtune that referenced this pull request Oct 14, 2024

Fix eos_token problem in all required models (pytorch#1806)

3c65c65

krammnic mentioned this pull request Oct 25, 2024

Fix Mistral tokenizer if no eos_token is added #1693

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix eos_token problem in all required models #1806

Fix eos_token problem in all required models #1806

krammnic commented Oct 11, 2024 •

edited by joecummings

Loading

pytorch-bot bot commented Oct 11, 2024 •

edited

Loading

krammnic commented Oct 11, 2024

krammnic commented Oct 11, 2024

joecummings Oct 11, 2024

krammnic Oct 11, 2024

krammnic Oct 11, 2024

krammnic commented Oct 11, 2024

krammnic commented Oct 11, 2024

joecummings left a comment

Fix eos_token problem in all required models #1806

Fix eos_token problem in all required models #1806

Conversation

krammnic commented Oct 11, 2024 • edited by joecummings Loading

Context

Changelog

Test plan

UX

pytorch-bot bot commented Oct 11, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1806

✅ No Failures

krammnic commented Oct 11, 2024

krammnic commented Oct 11, 2024

joecummings Oct 11, 2024

Choose a reason for hiding this comment

krammnic Oct 11, 2024

Choose a reason for hiding this comment

krammnic Oct 11, 2024

Choose a reason for hiding this comment

krammnic commented Oct 11, 2024

krammnic commented Oct 11, 2024

joecummings left a comment

Choose a reason for hiding this comment

krammnic commented Oct 11, 2024 •

edited by joecummings

Loading

pytorch-bot bot commented Oct 11, 2024 •

edited

Loading