Remove the use of `SentencePieceTrainer` from tests #1283

tirthasheshpatel · 2023-10-24T17:49:25Z

As we discussed, I factored out the use of sentencepiece in its own file for each model. Let me know if you prefer to keep the generated files in the repo or generate them each time in the CI. Also, let me know if you want me to document this change somewhere e.g. update the contributor guide.

mattdangerw

Looks good! Just a few comments.

mattdangerw · 2023-10-24T20:00:43Z

keras_nlp/models/albert/albert_classifier_test.py

-            AlbertTokenizer(proto=bytes_io.getvalue()),
-            sequence_length=5,
+            AlbertTokenizer(
+                proto=str(


This is a bit of a mouthful. Can we maybe add this to our base class for tests in test_case.py?

proto=os.path.join(self.test_data_dir(), "albert_test_vocab.spm")

mattdangerw · 2023-10-24T20:01:09Z

keras_nlp/models/albert/albert_classifier_test.py

+                    pathlib.Path(__file__).parent.parent.parent
+                    / "tests"
+                    / "test_data"
+                    / "albert_sentencepiece.proto"


maybe in keeping with our preset suffixes and name, let's call this albert_test_vocab.spm. Let's also drop a comment right above this line, # Generated with create_albert_test_proto.py, so people know how to update this.

keras_nlp/models/albert/albert_masked_lm_preprocessor_test.py

mattdangerw · 2023-10-24T20:05:31Z

keras_nlp/models/albert/albert_tokenizer_test.py

+                    pathlib.Path(__file__).parent.parent.parent
+                    / "tests"
+                    / "test_data"
+                    / "sentencepiece_bad.proto"


Maybe let's be more specific than "bad" here.

"no_special_token_vocab.spm"

mattdangerw · 2023-10-24T20:09:21Z

tools/sentencepiece_testing/utils.py

+import sentencepiece
+
+
+def _train_sentencepiece(data, *args, **kwargs):


can we just fold _train_sentencepiece and _save into one? def train_sentencepiece(filename, data, *args, **kwargs):

Also no need to mark this with an underscore. We only really do that for class members, and instead use our keras_nlp_export to mark what is public.

mattdangerw · 2023-10-24T20:12:27Z

tools/sentencepiece_testing/create_deberta_v3_test_proto.py

+
+
+def main():
+    bytes_io = _train_sentencepiece(


For all of these let's just consolidate to a test single proto for each model.

So use the one user_defined_symbols="[MASK]", and update any test output appropriately.

mattdangerw · 2023-10-24T20:12:42Z

Looks like some legitimate test failures with xlm_roberta

- Use one proto per model; modify tests accordingly - Add a comment saying where the test proto file was generated from - Rename the files from `*_sentencepiece.proto` to `*_test_vocab.spm` - Rename the bad proto file to `no_special_token_vocab.spm` - Add a method to get the test dir - Remove the underscores from the sentencepiece util file - Save the file in `train_sentencepiece` function itself - Address the XLM Roberta test failure

tirthasheshpatel · 2023-10-25T18:21:26Z

Looks like some legitimate test failures with xlm_roberta

Addressed. Thanks for the review @mattdangerw! Let me know if this looks good to you now!

mattdangerw · 2023-10-25T18:25:01Z

/gcbrun

tools/sentencepiece_testing/create_bad_proto.py

mattdangerw

Looks great! Just a last couple comments.

mattdangerw · 2023-10-25T18:36:42Z

keras_nlp/tests/test_case.py

@@ -417,3 +418,6 @@ def compare(actual, expected):
                self.assertAllClose(actual, expected, atol=0.01, rtol=0.01)

            tree.map_structure(compare, output, expected_partial_output)
+
+    def get_test_data_dir(self):
+        return pathlib.Path(__file__).parent / "test_data"


Nit, but I think returning a simple string might be a little less surprising to people here.

Ok to use pathlib, but cast to string before return, and then use os.path.join(self.get_test_data_dir(), "file") from the tests.

That will match our usage of get_temp_dir.

Ah, got it!

mattdangerw · 2023-10-25T18:41:52Z

/gcbrun

mattdangerw · 2023-10-26T19:36:33Z

/gcbrun

mattdangerw

Nice looks great! Will pull this in after tests pass.

tirthasheshpatel added 8 commits October 20, 2023 20:47

Remove SentencePieceTrainer from keras_nlp/models/albert

c7c24d4

Remove SentencePieceTrainer from keras_nlp/models/deberta_v3

4261061

Remove SentencePieceTrainer from keras_nlp/models/f_net

dba4ac5

Remove SentencePieceTrainer from keras_nlp/models/t5

e316ff7

Remove SentencePieceTrainer from keras_nlp/models/xlm_roberta

8993672

Remove the .absolute() calls

5e6c26c

Make the bad sentencepiece proto common between all the tests

4aa6347

Factor missing instances out.

0134799

tirthasheshpatel requested a review from mattdangerw October 24, 2023 17:49

mattdangerw reviewed Oct 24, 2023

View reviewed changes

mattdangerw reviewed Oct 25, 2023

View reviewed changes

tools/sentencepiece_testing/create_bad_proto.py Outdated Show resolved Hide resolved

tirthasheshpatel added 2 commits October 25, 2023 18:31

create_bad_proto.py -> create_no_special_token_proto.py

a76c577

Update the SentencePieceTokenizer test proto file

a77b362

mattdangerw approved these changes Oct 25, 2023

View reviewed changes

tirthasheshpatel added 4 commits October 25, 2023 22:50

Use os.path.join and resolve XLMRoberta failures

bf0b044

Merge branch 'master' of github.com:keras-team/keras-nlp into fix-gh1272

0ce2274

Fix T5 Tokenizer test failures

234b908

Fix a merge artifact

75ead27

mattdangerw approved these changes Oct 26, 2023

View reviewed changes

mattdangerw merged commit d254b02 into keras-team:master Oct 26, 2023

tirthasheshpatel deleted the fix-gh1272 branch October 26, 2023 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the use of `SentencePieceTrainer` from tests #1283

Remove the use of `SentencePieceTrainer` from tests #1283

tirthasheshpatel commented Oct 24, 2023

mattdangerw left a comment

mattdangerw Oct 24, 2023

tirthasheshpatel Oct 25, 2023

mattdangerw Oct 24, 2023

tirthasheshpatel Oct 25, 2023

mattdangerw Oct 24, 2023

tirthasheshpatel Oct 25, 2023

mattdangerw Oct 24, 2023

tirthasheshpatel Oct 25, 2023

mattdangerw Oct 24, 2023

tirthasheshpatel Oct 25, 2023

mattdangerw commented Oct 24, 2023

tirthasheshpatel commented Oct 25, 2023

mattdangerw commented Oct 25, 2023

mattdangerw left a comment

mattdangerw Oct 25, 2023 •

edited

Loading

tirthasheshpatel Oct 25, 2023

mattdangerw commented Oct 25, 2023

mattdangerw commented Oct 26, 2023

mattdangerw left a comment

		import sentencepiece


		def _train_sentencepiece(data, args, *kwargs):

Remove the use of SentencePieceTrainer from tests #1283

Remove the use of SentencePieceTrainer from tests #1283

Conversation

tirthasheshpatel commented Oct 24, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw commented Oct 24, 2023

tirthasheshpatel commented Oct 25, 2023

mattdangerw commented Oct 25, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw Oct 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw commented Oct 25, 2023

mattdangerw commented Oct 26, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

Remove the use of `SentencePieceTrainer` from tests #1283

Remove the use of `SentencePieceTrainer` from tests #1283

mattdangerw Oct 25, 2023 •

edited

Loading