Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the use of SentencePieceTrainer from tests #1283

Merged
merged 15 commits into from
Oct 26, 2023

Conversation

tirthasheshpatel
Copy link
Contributor

Fixes #1272

As we discussed, I factored out the use of sentencepiece in its own file for each model. Let me know if you prefer to keep the generated files in the repo or generate them each time in the CI. Also, let me know if you want me to document this change somewhere e.g. update the contributor guide.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few comments.

AlbertTokenizer(proto=bytes_io.getvalue()),
sequence_length=5,
AlbertTokenizer(
proto=str(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a mouthful. Can we maybe add this to our base class for tests in test_case.py?

proto=os.path.join(self.test_data_dir(), "albert_test_vocab.spm")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pathlib.Path(__file__).parent.parent.parent
/ "tests"
/ "test_data"
/ "albert_sentencepiece.proto"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in keeping with our preset suffixes and name, let's call this albert_test_vocab.spm. Let's also drop a comment right above this line, # Generated with create_albert_test_proto.py, so people know how to update this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pathlib.Path(__file__).parent.parent.parent
/ "tests"
/ "test_data"
/ "sentencepiece_bad.proto"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's be more specific than "bad" here.

"no_special_token_vocab.spm"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

import sentencepiece


def _train_sentencepiece(data, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just fold _train_sentencepiece and _save into one? def train_sentencepiece(filename, data, *args, **kwargs):

Also no need to mark this with an underscore. We only really do that for class members, and instead use our keras_nlp_export to mark what is public.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



def main():
bytes_io = _train_sentencepiece(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all of these let's just consolidate to a test single proto for each model.

So use the one user_defined_symbols="[MASK]", and update any test output appropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mattdangerw
Copy link
Member

Looks like some legitimate test failures with xlm_roberta

- Use one proto per model; modify tests accordingly
- Add a comment saying where the test proto file was generated from
- Rename the files from `*_sentencepiece.proto` to `*_test_vocab.spm`
- Rename the bad proto file to `no_special_token_vocab.spm`
- Add a method to get the test dir
- Remove the underscores from the sentencepiece util file
- Save the file in `train_sentencepiece` function itself
- Address the XLM Roberta test failure
@tirthasheshpatel
Copy link
Contributor Author

Looks like some legitimate test failures with xlm_roberta

Addressed. Thanks for the review @mattdangerw! Let me know if this looks good to you now!

@mattdangerw
Copy link
Member

/gcbrun

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a last couple comments.

@@ -417,3 +418,6 @@ def compare(actual, expected):
self.assertAllClose(actual, expected, atol=0.01, rtol=0.01)

tree.map_structure(compare, output, expected_partial_output)

def get_test_data_dir(self):
return pathlib.Path(__file__).parent / "test_data"
Copy link
Member

@mattdangerw mattdangerw Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but I think returning a simple string might be a little less surprising to people here.

Ok to use pathlib, but cast to string before return, and then use os.path.join(self.get_test_data_dir(), "file") from the tests.

That will match our usage of get_temp_dir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it!

@mattdangerw
Copy link
Member

/gcbrun

@mattdangerw
Copy link
Member

/gcbrun

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice looks great! Will pull this in after tests pass.

@mattdangerw mattdangerw merged commit d254b02 into keras-team:master Oct 26, 2023
@tirthasheshpatel tirthasheshpatel deleted the fix-gh1272 branch October 26, 2023 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove use of SentencePieceTrainer from modeling tests
2 participants