Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add add_bos=False, add_eos=False to SentencePieceTokenizer.__init__() #1811

Merged
merged 4 commits into from
Sep 9, 2024
Merged

Conversation

briango28
Copy link
Contributor

Minor change that exposes the add_bos & add_eos parameters from tf_text.SentencepieceTokenizer.__init__() to keras_nlp.tokenizers.SentencePieceTokenizer.__init__(). from #1710

Also adds tests for bos/eos token emission.

There is a potential issue about truncating long sequences when emitting the EOS token ('</s>').

The current behavior does not give the EOS token special treatment when truncating sequences based on sequence_length, meaning a truncated sequence will not have the EOS token at its end.
While this is necessary in some tasks (e.g. sequence generation) for the model to not learn wrong sequence terminators, always having the EOS token as the last token of a sequence may be potentially beneficial in others.

I did not believe this warrants an additional implementation of tokenize() however; I assume that users would not specify sequence_length in SentencePieceTokenizer.__init__() and add their own pad/truncator afterwards.

Copy link

google-cla bot commented Sep 5, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@briango28
Copy link
Contributor Author

Please freely edit the test names if I've overlooked some convention or formatting rule.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just two minor housekeeping comments.

@mattdangerw
Copy link
Member

Thanks!

@mattdangerw mattdangerw merged commit a806571 into keras-team:master Sep 9, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants