Add `add_bos=False, add_eos=False` to SentencePieceTokenizer.init() #1811

briango28 · 2024-09-05T05:25:51Z

Minor change that exposes the add_bos & add_eos parameters from tf_text.SentencepieceTokenizer.__init__() to keras_nlp.tokenizers.SentencePieceTokenizer.__init__(). from #1710

Also adds tests for bos/eos token emission.

There is a potential issue about truncating long sequences when emitting the EOS token ('</s>').

The current behavior does not give the EOS token special treatment when truncating sequences based on sequence_length, meaning a truncated sequence will not have the EOS token at its end.
While this is necessary in some tasks (e.g. sequence generation) for the model to not learn wrong sequence terminators, always having the EOS token as the last token of a sequence may be potentially beneficial in others.

I did not believe this warrants an additional implementation of tokenize() however; I assume that users would not specify sequence_length in SentencePieceTokenizer.__init__() and add their own pad/truncator afterwards.

google-cla · 2024-09-05T05:25:55Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

…eceTokenizerTest

briango28 · 2024-09-05T05:31:30Z

Please freely edit the test names if I've overlooked some convention or formatting rule.

mattdangerw

Looks good! Just two minor housekeeping comments.

keras_nlp/src/tokenizers/sentence_piece_tokenizer.py

mattdangerw · 2024-09-09T19:51:01Z

Thanks!

briango28 added 2 commits September 5, 2024 14:29

Add add_bos, add_eos parameters to SentencePieceTokenizer.__init__()

dad52a9

Add test_scalar_bos_eos() & test_string_bos_eos() tests to SentencePi…

cd2994f

…eceTokenizerTest

mattdangerw reviewed Sep 5, 2024

View reviewed changes

keras_nlp/src/tokenizers/sentence_piece_tokenizer.py Show resolved Hide resolved

keras_nlp/src/tokenizers/sentence_piece_tokenizer.py Show resolved Hide resolved

briango28 added 2 commits September 6, 2024 20:22

Add docstrings for add_bos, add_eos in SentencePieceTokenizer.

32ee45f

Add add_bos, add_eos to SentencePieceTokenizer.get_config()

b57396c

mattdangerw merged commit a806571 into keras-team:master Sep 9, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `add_bos=False, add_eos=False` to SentencePieceTokenizer.init() #1811

Add `add_bos=False, add_eos=False` to SentencePieceTokenizer.init() #1811

briango28 commented Sep 5, 2024

google-cla bot commented Sep 5, 2024

briango28 commented Sep 5, 2024

mattdangerw left a comment

mattdangerw commented Sep 9, 2024

Add add_bos=False, add_eos=False to SentencePieceTokenizer.__init__() #1811

Add add_bos=False, add_eos=False to SentencePieceTokenizer.__init__() #1811

Conversation

briango28 commented Sep 5, 2024

google-cla bot commented Sep 5, 2024

briango28 commented Sep 5, 2024

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw commented Sep 9, 2024

Add `add_bos=False, add_eos=False` to SentencePieceTokenizer.init() #1811

Add `add_bos=False, add_eos=False` to SentencePieceTokenizer.init() #1811