Add fast tokenizer for BARTpho #17254

datquocnguyen · 2022-05-14T15:11:15Z

This PR is to add a "fast" BARTpho tokenizer (backed by HuggingFace's tokenizers library).

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-05-14T15:28:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

datquocnguyen · 2022-05-14T16:44:34Z

Following: #13788
I now add a "fast" version of the BartphoTokenizer.
@sgugger , @LysandreJik, @patil-suraj , @SaulLu and @patrickvonplaten Please could you have a look and provide your feedback? Thanks.

src/transformers/models/bartpho/tokenization_bartpho.py

src/transformers/models/bartpho/tokenization_bartpho_fast.py

sgugger

Thanks a lot for your PR! As mentioned by @patil-suraj already, we don't rely on inheritance in Transformers but each object should be fully defined in their configuration/modeling/tokenizet file (there are some instances of subclasses for older models, but this will be cleaned up in the future).

So you should revert your changes in the slow tokenizer file to not inherit on XLMRobertaTokenizer, and make the fast version be independent of XLM-RoBERTa as well.

src/transformers/models/bartpho/tokenization_bartpho.py

src/transformers/models/bartpho/tokenization_bartpho_fast.py

datquocnguyen · 2022-05-17T10:33:31Z

Hi @patil-suraj and @sgugger I revised the slow and fast BartphoTokenizer variants to satisfy your requirements.
Please have a look and give feedback. Thanks.
cc: @SaulLu @LysandreJik

sgugger

Thanks, but it looks like the changes in the slow tokenizer are breaking, which we can't really do.

src/transformers/models/bartpho/tokenization_bartpho_fast.py

Co-authored-by: Sylvain Gugger <[email protected]>

SaulLu

Thank you very much for your contribution!

I think I personally lack context on what motivated the changes in the python version of the BartphoTokenizer tokenizer. In particular, I understand that you changed the spm model uploaded to the hub to vinai/bartpho-syllable (before it had a 250000-sized vocabulary and now it has a 40003-sized vocabulary).

Additionally, those changes are breaking for the slow tokenizer and we generally try to avoid those in transformers (cc @LysandreJik , @sgugger and @patil-suraj ) 😄

src/transformers/models/bartpho/tokenization_bartpho.py

src/transformers/convert_slow_tokenizer.py

datquocnguyen · 2022-05-17T16:38:09Z

Please note that the unsuccessful checks are due to the failed test_modeling_wav2vec2_conformer.py, not related to our BartphoTokenizer. @SaulLu

patrickvonplaten · 2022-05-17T21:50:29Z

Please note that the unsuccessful checks are due to the failed test_modeling_wav2vec2_conformer.py, not related to our BartphoTokenizer. @SaulLu

@SaulLu fixed the wav2vec2_conformer tests on master

sgugger · 2022-05-18T15:48:36Z

@datquocnguyen We can't merge anything that has any breaking change on the existing tokenizer, as I said before.

datquocnguyen · 2022-05-18T16:45:48Z

@sgugger Ah, I now see your point. I initially thought the code would be much nicer if I also push a new version of the slow tokenizer. But then it breaks the existing code.

Anyway, the fast tokenizer would totally work without changing the original code of the slow tokenizer (as I already developed the fast_tokenizer_file), I think. I would need a bit of time to roll back the slow tokenizer to its original version.

(cc @SaulLu , @LysandreJik , @patil-suraj and @patrickvonplaten )

…BERT_BERTweet Update test_tokenization_bartpho.py

Update commits

…BERT_BERTweet Update commits

Update latest commits

…BERT_BERTweet Merge pull request #29 from huggingface/main

Update the latest commits

…BERT_BERTweet Update the latest commits

Update latest commits

…BERT_BERTweet Update latest commits

Update commits

…BERT_BERTweet Update commits

Update

…BERT_BERTweet Update

datquocnguyen · 2022-10-05T10:57:49Z

Hi @SaulLu @LysandreJik , I am wondering about the status/progress of the "sharing a custom tokenizer" feature on the hub. Is there anything I can help with? This feature would help BERTweet, PhoBERT, BARTpho and the like to be easier to be used with their fast customed tokenizers. Thank you.

LysandreJik · 2022-10-10T19:12:07Z

The custom tokenizer should now work correctly! @ArthurZucker, if you have a spare cycle, could you look into supporting the tokenizers added here by @datquocnguyen with code on the hub using the custom tokenizers?

A guide showing how to is available here. Thanks!

Update commits

datquocnguyen · 2022-10-20T10:08:13Z

Hi @LysandreJik @ArthurZucker @SaulLu , I followed the guide, and can confirm that it works. For example, the following piece of code results in a correct fast tokenizer BertTweetTokenizerFast:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-covid19-base-uncased", trust_remote_code=True, revision="ddfcf0409600519d6f8907531a65151f39be5c01")
print(tokenizer.__class__.__name__)

The current issue is that the examples have not yet included the option trust_remote_code, so they produce errors. E.g:

Traceback (most recent call last):
  File "run_ner.py", line 630, in <module>
    main()
  File "run_ner.py", line 358, in main
    add_prefix_space=True,
  File "/home/sonla/workspace/transformers/src/transformers/models/auto/tokenization_auto.py", line 587, in from_pretrained
    f"Loading {pretrained_model_name_or_path} requires you to execute the tokenizer file in that"
ValueError: Loading /home/sonla/workspace/BERTweet/bertweet-covid19-base-uncased requires you to execute the tokenizer file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.

To handle this error/issue, I have to modify the run_ner.py to include the option trust_remote_code and add this option to the tokenizer loading part. And the modified run_ner.py file now runs properly as before.

I am wondering whether there is any faster approach to handle this issue without modifying each of the examples? Thanks.

LysandreJik · 2022-10-20T18:30:07Z

Oh great question @datquocnguyen, and thanks for taking care of the implementation! Really cool to see it works well.

@sgugger, what do you think regarding the examples? Should we add a TrainingArgument to enable specifying models with remote code? WDYT?

sgugger · 2022-10-20T18:37:11Z

It should be one of the ModelArguments defined in the example (where the rest of the args, like revision etc. lie) but yes, I don't see why not!

datquocnguyen · 2022-10-24T04:27:36Z

The ModelArguments should have trust_remote_code_model and trust_remote_code_tokenizer separately for the model and tokenizer loading, respectively, shouldn't it? For example:

tokenizer_name_or_path = model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path
    if config.model_type in {"bloom", "gpt2", "roberta"}:
        tokenizer = AutoTokenizer.from_pretrained(
            tokenizer_name_or_path,
            cache_dir=model_args.cache_dir,
            use_fast=True,
            revision=model_args.model_revision,
            trust_remote_code=trust_remote_code_tokenizer, # For tokenizer
            use_auth_token=True if model_args.use_auth_token else None,
            add_prefix_space=True,
        )
    else:
        tokenizer = AutoTokenizer.from_pretrained(
            tokenizer_name_or_path,
            cache_dir=model_args.cache_dir,
            use_fast=True,
            revision=model_args.model_revision,
            trust_remote_code=trust_remote_code_tokenizer, # For tokenizer
            use_auth_token=True if model_args.use_auth_token else None,
        )

    model = AutoModelForTokenClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        trust_remote_code=trust_remote_code_model, # For model
        use_auth_token=True if model_args.use_auth_token else None,
        ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
    )

sgugger · 2022-10-24T13:31:50Z

No, one is enough. Users that want more finegrained control can just modify the examples to suit their needs.

Add up-to-date commits

…BERT_BERTweet Add up-to-date commits

Merge latest commits

…BERT_BERTweet Merge latest commits

github-actions · 2022-12-17T15:02:14Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

datquocnguyen added 2 commits May 14, 2022 21:58

Add BartphoTokenizerFast

b889418

Add BartphoTokenizerFast

4c64432

Add test for BartphoTokenizerFast

3496219

patil-suraj reviewed May 16, 2022

View reviewed changes

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved

src/transformers/models/bartpho/tokenization_bartpho_fast.py Outdated Show resolved Hide resolved

patil-suraj requested a review from SaulLu May 16, 2022 12:22

sgugger reviewed May 16, 2022

View reviewed changes

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved

src/transformers/models/bartpho/tokenization_bartpho_fast.py Outdated Show resolved Hide resolved

datquocnguyen and others added 6 commits May 17, 2022 12:59

Revise BARTpho slow and fast tokenizers to be independent

aa77c99

Fix formatting

f5931ed

Fix formatting

4655d6d

Fix formatting

3618b77

Merge branch 'main' into main

a81aa03

Fix formatting

8bfea5a

sgugger reviewed May 17, 2022

View reviewed changes

src/transformers/models/bartpho/tokenization_bartpho_fast.py Outdated Show resolved Hide resolved

datquocnguyen and others added 2 commits May 17, 2022 21:43

Update src/transformers/models/bartpho/tokenization_bartpho_fast.py

069612f

Co-authored-by: Sylvain Gugger <[email protected]>

Fix formatting

7cb6707

SaulLu reviewed May 17, 2022

View reviewed changes

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved

src/transformers/convert_slow_tokenizer.py Show resolved Hide resolved

Remove hardcoded value

d76fffa

datquocnguyen added 5 commits May 19, 2022 22:21

Revert the new slow tokenizer to the original slow one

6c1b82c

Fix formatting

af0aa0e

The fast tokenizer with the same tokenization strategy as the slow one

1b22570

Fix formatting

a915922

Add fast tokenizers for PhoBERT and BERTweet

8835d18

datquocnguyen changed the title ~~Add BartphoTokenizerFast~~ Add fast tokenizers for BARTpho, PhoBERT and BERTweet May 22, 2022

datquocnguyen added 15 commits August 13, 2022 12:57

Update test_tokenization_bartpho.py

1bd229f

Merge pull request #25 from datquocnguyen/fast_tokenizers_BARTpho_Pho…

57a7f67

…BERT_BERTweet Update test_tokenization_bartpho.py

Merge pull request #27 from datquocnguyen/tmp_branch

2140b76

Update commits

Merge pull request #28 from datquocnguyen/fast_tokenizers_BARTpho_Pho…

30275f1

…BERT_BERTweet Update commits

Merge pull request #29 from huggingface/main

e56cb63

Update latest commits

Merge pull request #30 from huggingface/main

0b61789

Update latest commits

Merge pull request #31 from datquocnguyen/fast_tokenizers_BARTpho_Pho…

0a4b3c1

…BERT_BERTweet Merge pull request #29 from huggingface/main

Merge pull request #32 from huggingface/main

f214fa0

Update the latest commits

Merge pull request #33 from datquocnguyen/fast_tokenizers_BARTpho_Pho…

8303666

…BERT_BERTweet Update the latest commits

Merge pull request #34 from huggingface/main

5a7b682

Update latest commits

Merge pull request #35 from datquocnguyen/fast_tokenizers_BARTpho_Pho…

6833da7

…BERT_BERTweet Update latest commits

Merge pull request #36 from huggingface/main

85ecfbd

Update commits

Merge pull request #37 from datquocnguyen/fast_tokenizers_BARTpho_Pho…

53a577e

…BERT_BERTweet Update commits

Merge pull request #38 from huggingface/main

391a440

Update

Merge pull request #39 from datquocnguyen/fast_tokenizers_BARTpho_Pho…

2a06fa7

…BERT_BERTweet Update

Merge pull request #40 from huggingface/main

8048f3a

Update commits

datquocnguyen added 4 commits November 5, 2022 13:29

Merge pull request #41 from huggingface/main

0787507

Add up-to-date commits

Merge pull request #42 from datquocnguyen/fast_tokenizers_BARTpho_Pho…

c0726f1

…BERT_BERTweet Add up-to-date commits

Merge pull request #43 from huggingface/main

0f90212

Merge latest commits

Merge pull request #44 from datquocnguyen/fast_tokenizers_BARTpho_Pho…

809f738

…BERT_BERTweet Merge latest commits

github-actions bot closed this Dec 25, 2022

datquocnguyen mentioned this pull request Jul 15, 2023

ValueError: An instance of tokenizer class BioGptTokenizer cannot be converted in a Fast tokenizer instance. No converter was found. #24612

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast tokenizer for BARTpho #17254

Add fast tokenizer for BARTpho #17254

datquocnguyen commented May 14, 2022

HuggingFaceDocBuilderDev commented May 14, 2022

datquocnguyen commented May 14, 2022

sgugger left a comment

datquocnguyen commented May 17, 2022

sgugger left a comment

SaulLu left a comment

datquocnguyen commented May 17, 2022

patrickvonplaten commented May 17, 2022

sgugger commented May 18, 2022

datquocnguyen commented May 18, 2022

datquocnguyen commented Oct 5, 2022

LysandreJik commented Oct 10, 2022

datquocnguyen commented Oct 20, 2022

LysandreJik commented Oct 20, 2022

sgugger commented Oct 20, 2022

datquocnguyen commented Oct 24, 2022

sgugger commented Oct 24, 2022

github-actions bot commented Dec 17, 2022

Add fast tokenizer for BARTpho #17254

Add fast tokenizer for BARTpho #17254

Conversation

datquocnguyen commented May 14, 2022

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented May 14, 2022

datquocnguyen commented May 14, 2022

sgugger left a comment

Choose a reason for hiding this comment

datquocnguyen commented May 17, 2022

sgugger left a comment

Choose a reason for hiding this comment

SaulLu left a comment

Choose a reason for hiding this comment

datquocnguyen commented May 17, 2022

patrickvonplaten commented May 17, 2022

sgugger commented May 18, 2022

datquocnguyen commented May 18, 2022

datquocnguyen commented Oct 5, 2022

LysandreJik commented Oct 10, 2022

datquocnguyen commented Oct 20, 2022

LysandreJik commented Oct 20, 2022

sgugger commented Oct 20, 2022

datquocnguyen commented Oct 24, 2022

sgugger commented Oct 24, 2022

github-actions bot commented Dec 17, 2022