Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast tokenizer for BARTpho #17254

Closed
wants to merge 70 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
b889418
Add BartphoTokenizerFast
datquocnguyen May 14, 2022
4c64432
Add BartphoTokenizerFast
datquocnguyen May 14, 2022
3496219
Add test for BartphoTokenizerFast
datquocnguyen May 14, 2022
aa77c99
Revise BARTpho slow and fast tokenizers to be independent
datquocnguyen May 17, 2022
f5931ed
Fix formatting
datquocnguyen May 17, 2022
4655d6d
Fix formatting
datquocnguyen May 17, 2022
3618b77
Fix formatting
datquocnguyen May 17, 2022
a81aa03
Merge branch 'main' into main
datquocnguyen May 17, 2022
8bfea5a
Fix formatting
datquocnguyen May 17, 2022
069612f
Update src/transformers/models/bartpho/tokenization_bartpho_fast.py
datquocnguyen May 17, 2022
7cb6707
Fix formatting
datquocnguyen May 17, 2022
d76fffa
Remove hardcoded value
datquocnguyen May 17, 2022
6c1b82c
Revert the new slow tokenizer to the original slow one
datquocnguyen May 19, 2022
af0aa0e
Fix formatting
datquocnguyen May 19, 2022
1b22570
The fast tokenizer with the same tokenization strategy as the slow one
datquocnguyen May 21, 2022
a915922
Fix formatting
datquocnguyen May 21, 2022
8835d18
Add fast tokenizers for PhoBERT and BERTweet
datquocnguyen May 22, 2022
b3677f5
Fix formatting
datquocnguyen May 22, 2022
7d9d477
Add require_torch
datquocnguyen May 22, 2022
542cfe2
Improved tokenization strategy for BartphoTokenizerFast
datquocnguyen May 27, 2022
cd82d13
Original BERTweet and PhoBERT tokenizers
datquocnguyen May 27, 2022
f59b4af
Fix format
datquocnguyen May 27, 2022
0b25f8c
Improve get_added_vocabulary_hacking
datquocnguyen May 29, 2022
176e323
Merge pull request #1 from datquocnguyen/main
datquocnguyen May 29, 2022
a7aba09
Fix formatting
datquocnguyen May 29, 2022
0047c83
Fast tokenizers for PhoBERT and BERTweet
datquocnguyen May 30, 2022
18e7684
Fix formatting
datquocnguyen May 31, 2022
a41f761
Fix formatting
datquocnguyen May 31, 2022
a65cef8
Merge pull request #2 from huggingface/main
datquocnguyen Jun 1, 2022
7692138
Merge pull request #4 from huggingface/main
datquocnguyen Jun 1, 2022
64a27eb
Merge pull request #5 from huggingface/main
datquocnguyen Jun 2, 2022
d0ad0de
Merge pull request #6 from huggingface/main
datquocnguyen Jun 2, 2022
72651a2
Merge pull request #7 from huggingface/main
datquocnguyen Jun 3, 2022
a14b6a5
Merge pull request #8 from huggingface/main
datquocnguyen Jun 10, 2022
68c3148
Merge pull request #9 from huggingface/main
datquocnguyen Jun 12, 2022
cf9a23c
Merge pull request #10 from huggingface/main
datquocnguyen Jun 17, 2022
321c148
Merge pull request #11 from huggingface/main
datquocnguyen Jun 23, 2022
d592599
Merge pull request #12 from huggingface/main
datquocnguyen Jul 13, 2022
9630bce
Merge pull request #13 from huggingface/main
datquocnguyen Jul 19, 2022
5c0fdac
Merge pull request #14 from huggingface/main
datquocnguyen Jul 26, 2022
3da785c
Merge pull request #15 from huggingface/main
datquocnguyen Jul 28, 2022
2f0940f
Merge pull request #17 from huggingface/main
datquocnguyen Aug 7, 2022
99b1c05
Merge pull request #18 from huggingface/main
datquocnguyen Aug 8, 2022
f29a771
Merge pull request #19 from huggingface/main
datquocnguyen Aug 9, 2022
3f0bdce
Merge pull request #20 from huggingface/main
datquocnguyen Aug 9, 2022
0e79af5
Merge pull request #21 from huggingface/main
datquocnguyen Aug 10, 2022
0db5b71
Merge pull request #22 from huggingface/main
datquocnguyen Aug 13, 2022
c21aadb
Merge pull request #23 from datquocnguyen/fast_tokenizers_BARTpho_Pho…
datquocnguyen Aug 13, 2022
d4a6fbb
Update test_tokenization_bartpho.py
datquocnguyen Aug 13, 2022
af834cf
Merge pull request #24 from datquocnguyen/fast_tokenizers_BARTpho_Pho…
datquocnguyen Aug 13, 2022
1bd229f
Update test_tokenization_bartpho.py
datquocnguyen Aug 13, 2022
57a7f67
Merge pull request #25 from datquocnguyen/fast_tokenizers_BARTpho_Pho…
datquocnguyen Aug 13, 2022
2140b76
Merge pull request #27 from datquocnguyen/tmp_branch
datquocnguyen Aug 18, 2022
30275f1
Merge pull request #28 from datquocnguyen/fast_tokenizers_BARTpho_Pho…
datquocnguyen Aug 18, 2022
e56cb63
Merge pull request #29 from huggingface/main
datquocnguyen Aug 19, 2022
0b61789
Merge pull request #30 from huggingface/main
datquocnguyen Aug 19, 2022
0a4b3c1
Merge pull request #31 from datquocnguyen/fast_tokenizers_BARTpho_Pho…
datquocnguyen Aug 19, 2022
f214fa0
Merge pull request #32 from huggingface/main
datquocnguyen Aug 23, 2022
8303666
Merge pull request #33 from datquocnguyen/fast_tokenizers_BARTpho_Pho…
datquocnguyen Aug 23, 2022
5a7b682
Merge pull request #34 from huggingface/main
datquocnguyen Sep 6, 2022
6833da7
Merge pull request #35 from datquocnguyen/fast_tokenizers_BARTpho_Pho…
datquocnguyen Sep 6, 2022
85ecfbd
Merge pull request #36 from huggingface/main
datquocnguyen Sep 18, 2022
53a577e
Merge pull request #37 from datquocnguyen/fast_tokenizers_BARTpho_Pho…
datquocnguyen Sep 18, 2022
391a440
Merge pull request #38 from huggingface/main
datquocnguyen Sep 20, 2022
2a06fa7
Merge pull request #39 from datquocnguyen/fast_tokenizers_BARTpho_Pho…
datquocnguyen Sep 20, 2022
8048f3a
Merge pull request #40 from huggingface/main
datquocnguyen Oct 19, 2022
0787507
Merge pull request #41 from huggingface/main
datquocnguyen Nov 5, 2022
c0726f1
Merge pull request #42 from datquocnguyen/fast_tokenizers_BARTpho_Pho…
datquocnguyen Nov 5, 2022
0f90212
Merge pull request #43 from huggingface/main
datquocnguyen Nov 23, 2022
809f738
Merge pull request #44 from datquocnguyen/fast_tokenizers_BARTpho_Pho…
datquocnguyen Nov 23, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions docs/source/en/model_doc/bartpho.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -70,13 +70,14 @@ Tips:
>>> print(tokenizer.decode(predictions).split())
```

- This implementation is only for tokenization: "monolingual_vocab_file" consists of Vietnamese-specialized types
extracted from the pre-trained SentencePiece model "vocab_file" that is available from the multilingual XLM-RoBERTa.
Other languages, if employing this pre-trained multilingual SentencePiece model "vocab_file" for subword
segmentation, can reuse BartphoTokenizer with their own language-specialized "monolingual_vocab_file".
- This implementation is only for tokenization.

This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).

## BartphoTokenizer

[[autodoc]] BartphoTokenizer

## BartphoTokenizerFast

[[autodoc]] BartphoTokenizerFast
2 changes: 2 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -473,6 +473,7 @@
_import_structure["models.albert"].append("AlbertTokenizerFast")
_import_structure["models.bart"].append("BartTokenizerFast")
_import_structure["models.barthez"].append("BarthezTokenizerFast")
_import_structure["models.bartpho"].append("BartphoTokenizerFast")
_import_structure["models.bert"].append("BertTokenizerFast")
_import_structure["models.big_bird"].append("BigBirdTokenizerFast")
_import_structure["models.blenderbot"].append("BlenderbotTokenizerFast")
Expand Down Expand Up @@ -2922,6 +2923,7 @@
from .models.albert import AlbertTokenizerFast
from .models.bart import BartTokenizerFast
from .models.barthez import BarthezTokenizerFast
from .models.bartpho import BartphoTokenizerFast
from .models.bert import BertTokenizerFast
from .models.big_bird import BigBirdTokenizerFast
from .models.blenderbot import BlenderbotTokenizerFast
Expand Down
55 changes: 55 additions & 0 deletions src/transformers/convert_slow_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -551,6 +551,60 @@ def post_processor(self):
)


class BartphoConverter(SpmConverter):
datquocnguyen marked this conversation as resolved.
Show resolved Hide resolved
def vocab(self, proto):
vocab = [
("<s>", 0.0),
("<pad>", 0.0),
("</s>", 0.0),
("<unk>", 0.0),
]
vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
vocab += [
("ar_AR", 0.0),
("cs_CZ", 0.0),
("de_DE", 0.0),
("en_XX", 0.0),
("es_XX", 0.0),
("et_EE", 0.0),
("fi_FI", 0.0),
("fr_XX", 0.0),
("gu_IN", 0.0),
("hi_IN", 0.0),
("it_IT", 0.0),
("ja_XX", 0.0),
("kk_KZ", 0.0),
("ko_KR", 0.0),
("lt_LT", 0.0),
("lv_LV", 0.0),
("my_MM", 0.0),
("ne_NP", 0.0),
("nl_XX", 0.0),
("ro_RO", 0.0),
("ru_RU", 0.0),
("si_LK", 0.0),
("tr_TR", 0.0),
("vi_VN", 0.0),
("zh_CN", 0.0),
]
vocab += [("<mask>", 0.0)]
return vocab

def unk_id(self, proto):
unk_id = 3
return unk_id
datquocnguyen marked this conversation as resolved.
Show resolved Hide resolved

def post_processor(self):
return processors.TemplateProcessing(
single="<s> $A </s>",
pair="<s> $A </s> </s> $B </s>",
special_tokens=[
("<s>", self.original_tokenizer.convert_tokens_to_ids("<s>")),
("</s>", self.original_tokenizer.convert_tokens_to_ids("</s>")),
],
)


class CamembertConverter(SpmConverter):
def vocab(self, proto):
vocab = [
Expand Down Expand Up @@ -1004,6 +1058,7 @@ def post_processor(self):
"AlbertTokenizer": AlbertConverter,
"BartTokenizer": RobertaConverter,
"BarthezTokenizer": BarthezConverter,
"BartphoTokenizer": BartphoConverter,
"BertTokenizer": BertConverter,
"BigBirdTokenizer": BigBirdConverter,
"BlenderbotTokenizer": BlenderbotConverter,
Expand Down
8 changes: 7 additions & 1 deletion src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,13 @@
),
("herbert", ("HerbertTokenizer", "HerbertTokenizerFast" if is_tokenizers_available() else None)),
("phobert", ("PhobertTokenizer", None)),
("bartpho", ("BartphoTokenizer", None)),
(
"bartpho",
(
"BartphoTokenizer" if is_sentencepiece_available() else None,
"BartphoTokenizerFast" if is_tokenizers_available() else None,
),
),
(
"barthez",
(
Expand Down
18 changes: 17 additions & 1 deletion src/transformers/models/bartpho/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@

from typing import TYPE_CHECKING

from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_sentencepiece_available
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_sentencepiece_available, is_tokenizers_available


_import_structure = {}
Expand All @@ -31,6 +31,14 @@
else:
_import_structure["tokenization_bartpho"] = ["BartphoTokenizer"]

try:
if not is_tokenizers_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["tokenization_bartpho_fast"] = ["BartphoTokenizerFast"]

if TYPE_CHECKING:
try:
if not is_sentencepiece_available():
Expand All @@ -40,6 +48,14 @@
else:
from .tokenization_bartpho import BartphoTokenizer

try:
if not is_tokenizers_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .tokenization_bartpho_fast import BartphoTokenizerFast

else:
import sys

Expand Down
Loading