-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fast tokenizer for BARTpho #17254
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Following: #13788 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your PR! As mentioned by @patil-suraj already, we don't rely on inheritance in Transformers but each object should be fully defined in their configuration/modeling/tokenizet file (there are some instances of subclasses for older models, but this will be cleaned up in the future).
So you should revert your changes in the slow tokenizer file to not inherit on XLMRobertaTokenizer, and make the fast version be independent of XLM-RoBERTa as well.
Hi @patil-suraj and @sgugger I revised the slow and fast BartphoTokenizer variants to satisfy your requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but it looks like the changes in the slow tokenizer are breaking, which we can't really do.
Co-authored-by: Sylvain Gugger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your contribution!
I think I personally lack context on what motivated the changes in the python version of the BartphoTokenizer
tokenizer. In particular, I understand that you changed the spm model uploaded to the hub to vinai/bartpho-syllable
(before it had a 250000
-sized vocabulary and now it has a 40003
-sized vocabulary).
Additionally, those changes are breaking for the slow tokenizer and we generally try to avoid those in transformers
(cc @LysandreJik , @sgugger and @patil-suraj ) 😄
Please note that the unsuccessful checks are due to the failed |
@datquocnguyen We can't merge anything that has any breaking change on the existing tokenizer, as I said before. |
@sgugger Ah, I now see your point. I initially thought the code would be much nicer if I also push a new version of the slow tokenizer. But then it breaks the existing code. Anyway, the fast tokenizer would totally work without changing the original code of the slow tokenizer (as I already developed the fast_tokenizer_file), I think. I would need a bit of time to roll back the slow tokenizer to its original version. (cc @SaulLu , @LysandreJik , @patil-suraj and @patrickvonplaten ) |
…BERT_BERTweet Update test_tokenization_bartpho.py
Update commits
…BERT_BERTweet Update commits
Update latest commits
Update latest commits
…BERT_BERTweet Merge pull request #29 from huggingface/main
Update the latest commits
…BERT_BERTweet Update the latest commits
Update latest commits
…BERT_BERTweet Update latest commits
Update commits
…BERT_BERTweet Update commits
…BERT_BERTweet Update
Hi @SaulLu @LysandreJik , I am wondering about the status/progress of the "sharing a custom tokenizer" feature on the hub. Is there anything I can help with? This feature would help BERTweet, PhoBERT, BARTpho and the like to be easier to be used with their fast customed tokenizers. Thank you. |
The custom tokenizer should now work correctly! @ArthurZucker, if you have a spare cycle, could you look into supporting the tokenizers added here by @datquocnguyen with code on the hub using the custom tokenizers? A guide showing how to is available here. Thanks! |
Update commits
Hi @LysandreJik @ArthurZucker @SaulLu , I followed the guide, and can confirm that it works. For example, the following piece of code results in a correct fast tokenizer BertTweetTokenizerFast: from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-covid19-base-uncased", trust_remote_code=True, revision="ddfcf0409600519d6f8907531a65151f39be5c01")
print(tokenizer.__class__.__name__) The current issue is that the examples have not yet included the option
To handle this error/issue, I have to modify the I am wondering whether there is any faster approach to handle this issue without modifying each of the examples? Thanks. |
Oh great question @datquocnguyen, and thanks for taking care of the implementation! Really cool to see it works well. @sgugger, what do you think regarding the examples? Should we add a |
It should be one of the |
The
|
No, one is enough. Users that want more finegrained control can just modify the examples to suit their needs. |
Add up-to-date commits
…BERT_BERTweet Add up-to-date commits
Merge latest commits
…BERT_BERTweet Merge latest commits
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
This PR is to add a "fast" BARTpho tokenizer (backed by HuggingFace's tokenizers library).
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.