-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding an AlbertMaskedLM task + Fix Projection layer dimension in MaskedLMHead #725
Conversation
Ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This looks great!
Please go ahead and add docs, also there are some test failures it looks like (this model using different tokenization and padding than roberta, so the preprocessing tests will need some updates).
CI is green, ready for review @mattdangerw @abheesht17 @jbischof |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick review of the doc-strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just a few quick comments
@mattdangerw CI is green in terms of tests. |
@shivance thanks! will take another pass soon. looks like the accelerator test is just a timeout, so probably unrelated to the code changes here. |
Good to go @mattdangerw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple minor comments. I can fix these up as I merge.
@@ -72,6 +72,7 @@ def __init__(self, proto, **kwargs): | |||
cls_token = "[CLS]" | |||
sep_token = "[SEP]" | |||
pad_token = "<pad>" | |||
mask_token = "<mask>" | |||
for token in [cls_token, sep_token, pad_token]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry missed one thing! We should update our tokenizer to check for the masked token here. Which will then mean we have to update a lot of testing for albert_tokenizer and albert_preprocessor so they actually include the mask token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR shows how to add the mask in when training a sentencepiece model. #732
) | ||
|
||
def test_preprocess_strings(self): | ||
input_data = " airplane at airport" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other follow up as you are adding the check for the mask token--we should also make sure the data you are using to train the sentencpiece model above matches the data you are passing here.
When all is working, the token id output should look something like [bos, mask, mask mask, eos, pad, pad pad]
(replacing those symbols with the proper indices from the vocab).
Still WIP |
No rush! Ping when this is ready for 👀 again! |
The PR is ready for review, waiting for the CI, although tests pass in local. |
CI is green @mattdangerw @jbischof Ready for review ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thank you!
Going to try out a quick training run with it, and if all looks good, will merge this in.
@@ -11,7 +11,7 @@ | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
"""Tests for BERT classification model.""" | |||
"""Tests for ALBERT classification model.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops good catch :)
@mattdangerw Would you mind sharing the training script once done? |
@shivance sure! https://colab.research.google.com/gist/mattdangerw/c73a58e20132fd1117161a0f00b23b4b/albert-mlm.ipynb Things look in the right ballpark to me. 43% guess the word accuracy after a single epoch on IMDb. So going to pull this in. Thanks again! |
@mattdangerw @jbischof @chenmoneygithub I went through the GSoC ideas list I found this one specially intriguing. I think this will add a lot of value to KerasNLP and Keras ecosystem in general. I referred huggingface tasks. Their piepline syntax is really useful. Apart from supporting this task, in GSoC project I think accompanying tutorial will add a lot of value. I'm already working on #754 . I also opened #1253 in Keras-io, these are tutorials for Token Classification and Sentence Similarity respectively. Which which of these task are in priority list and how many of them you suggest me to include in my proposal, (keeping timeline of GSoC in mind). |
Closes #718
Fixes #733
Docstrings are yet to be added. Will add once doubts about implementation are clear.