-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the use of SentencePieceTrainer
from tests
#1283
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few comments.
AlbertTokenizer(proto=bytes_io.getvalue()), | ||
sequence_length=5, | ||
AlbertTokenizer( | ||
proto=str( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a mouthful. Can we maybe add this to our base class for tests in test_case.py
?
proto=os.path.join(self.test_data_dir(), "albert_test_vocab.spm")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pathlib.Path(__file__).parent.parent.parent | ||
/ "tests" | ||
/ "test_data" | ||
/ "albert_sentencepiece.proto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe in keeping with our preset suffixes and name, let's call this albert_test_vocab.spm
. Let's also drop a comment right above this line, # Generated with create_albert_test_proto.py
, so people know how to update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pathlib.Path(__file__).parent.parent.parent | ||
/ "tests" | ||
/ "test_data" | ||
/ "sentencepiece_bad.proto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's be more specific than "bad"
here.
"no_special_token_vocab.spm"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tools/sentencepiece_testing/utils.py
Outdated
import sentencepiece | ||
|
||
|
||
def _train_sentencepiece(data, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just fold _train_sentencepiece
and _save
into one? def train_sentencepiece(filename, data, *args, **kwargs):
Also no need to mark this with an underscore. We only really do that for class members, and instead use our keras_nlp_export
to mark what is public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
|
||
def main(): | ||
bytes_io = _train_sentencepiece( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all of these let's just consolidate to a test single proto for each model.
So use the one user_defined_symbols="[MASK]"
, and update any test output appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Looks like some legitimate test failures with xlm_roberta |
- Use one proto per model; modify tests accordingly - Add a comment saying where the test proto file was generated from - Rename the files from `*_sentencepiece.proto` to `*_test_vocab.spm` - Rename the bad proto file to `no_special_token_vocab.spm` - Add a method to get the test dir - Remove the underscores from the sentencepiece util file - Save the file in `train_sentencepiece` function itself - Address the XLM Roberta test failure
Addressed. Thanks for the review @mattdangerw! Let me know if this looks good to you now! |
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a last couple comments.
keras_nlp/tests/test_case.py
Outdated
@@ -417,3 +418,6 @@ def compare(actual, expected): | |||
self.assertAllClose(actual, expected, atol=0.01, rtol=0.01) | |||
|
|||
tree.map_structure(compare, output, expected_partial_output) | |||
|
|||
def get_test_data_dir(self): | |||
return pathlib.Path(__file__).parent / "test_data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but I think returning a simple string might be a little less surprising to people here.
Ok to use pathlib, but cast to string before return, and then use os.path.join(self.get_test_data_dir(), "file")
from the tests.
That will match our usage of get_temp_dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it!
/gcbrun |
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice looks great! Will pull this in after tests pass.
Fixes #1272
As we discussed, I factored out the use of
sentencepiece
in its own file for each model. Let me know if you prefer to keep the generated files in the repo or generate them each time in the CI. Also, let me know if you want me to document this change somewhere e.g. update the contributor guide.