Solve #721 Deberta masklm model #732

Plutone11011 · 2023-02-09T11:41:51Z

PR for DeBERTa mask model and preprocessor.

Related to this, I noticed that DeBERTaV3 uses a different technique for masking, however the backbone explicitly doesn't include it so I don't know if it is relevant here.

mattdangerw · 2023-02-09T19:26:12Z

@Plutone11011 thanks! Will take a pass soon.

I noticed that DeBERTaV3 uses a different technique for masking

Can you elaborate? If it's all preprocessing, like this comment, I think it is fine to cover the fancier token masking schemes down the road.

If this is a whole different setup, or different "head" on the backbone, we might want to consider some changes.

Plutone11011 · 2023-02-09T20:27:20Z

@Plutone11011 thanks! Will take a pass soon.

I noticed that DeBERTaV3 uses a different technique for masking

Can you elaborate? If it's all preprocessing, like this comment, I think it is fine to cover the fancier token masking schemes down the road.

If this is a whole different setup, or different "head" on the backbone, we might want to consider some changes.

In https://arxiv.org/pdf/2111.09543.pdf the authors talk about a replacement for MLM called RTD (replaced token detection), this is what mainly distinguishes DeBERTaV3 from previous versions. It is still somewhat based on MLM, but it uses a GAN style approach, jointly training a MLM generator and a discriminator/classifier. See section Section 3.1 and 2.3.2 of the paper, specifically.

mattdangerw · 2023-02-10T19:46:58Z

Ah right! DeBERTaV3 uses electra style (GAN like) pre-training.

I think it is still totally valid to ship a mlm task for deberta, but we can probably make a note in the docstring that this is not the "task setup" used by deberta during pre-training.

mattdangerw

Thanks! This looks great! Left a few comments

mattdangerw · 2023-02-10T19:48:06Z

keras_nlp/models/deberta_v3/deberta_v3_masked_lm.py

+    Disclaimer: Pre-trained models are provided on an "as is" basis, without
+    warranties or conditions of any kind. The underlying model is provided by a
+    third party and subject to a separate license, available
+    [here](https://github.com/facebookresearch/fairseq).


switch this to the deberta repo

mattdangerw · 2023-02-10T19:49:49Z

keras_nlp/models/deberta_v3/deberta_v3_masked_lm.py

+        outputs = MaskedLMHead(
+            vocabulary_size=backbone.vocabulary_size,
+            embedding_weights=backbone.token_embedding.embeddings,
+            intermediate_activation="gelu",


We should probably use the same "approximate gelu" as the backbone itself here-> https://github.com/keras-team/keras-nlp/blob/4c1c6ae9e5a3adcf80271ba206b6835caf1b39f2/keras_nlp/models/deberta_v3/deberta_v3_backbone.py#L155

mattdangerw · 2023-02-10T19:51:49Z

keras_nlp/models/deberta_v3/deberta_v3_masked_lm_preprocessor.py

+    Examples:
+    ```python
+    # Load the preprocessor from a preset.
+    preprocessor = keras_nlp.models.DebertaV3MaskedLMPreprocessor.from_preset("deberta_v3_base_en")


format this to fit our lint limit

preprocessor = keras_nlp.models.DebertaV3MaskedLMPreprocessor.from_preset( "deberta_v3_base_en", )

The format script is not detecting it, is it because it's docstrings?

mattdangerw · 2023-02-10T19:56:47Z

keras_nlp/models/deberta_v3/deberta_v3_tokenizer.py

@@ -72,6 +72,7 @@ def __init__(self, proto, **kwargs):
        cls_token = "[CLS]"
        sep_token = "[SEP]"
        pad_token = "[PAD]"
+        mask_token = "[MASK]"
        for token in [cls_token, pad_token, sep_token]:


I think we should add a check for the mask token here, which might also mean you need to update some unit tests for the preprocessor and tokenizer layers (so they add a mask token to the vocabulary).

So, I added the check and as you can see it fails the preset tests, because apparently there isn't a [MASK] token in the deberta_v3_extra_small_en vocabulary, which seems strange to me (I've tried with deberta_v3_base_en too). Any advice?

mattdangerw · 2023-02-10T19:56:57Z

keras_nlp/models/deberta_v3/deberta_v3_masked_lm_test.py

+            bos_piece="[CLS]",
+            eos_piece="[SEP]",
+            unk_piece="[UNK]",
+            user_defined_symbols="[MASK]",


nice! should this not be a list? is it just a comma separate string?

yes it seems to be a comma separated string.
From the options doc
--user_defined_symbols (comma separated list of user defined symbols) type: std::string default: ""

mattdangerw · 2023-02-10T19:58:44Z

keras_nlp/models/deberta_v3/deberta_v3_masked_lm_test.py

+        proto = bytes_io.getvalue()
+        self.preprocessor = DebertaV3MaskedLMPreprocessor(
+            tokenizer=DebertaV3Tokenizer(proto=proto),
+            # Simplify out testing by masking every available token.


I don't think this comment applies here.

Plutone11011 · 2023-02-11T16:08:40Z

Addressed the comments, also added mask check in tokenizer and I had to adjust vocab size and some tests as a result

mattdangerw · 2023-02-16T19:54:52Z

@Plutone11011 thanks! Overall this is looking good, but hit a snag while trying to test this.

Here's a gist -> https://colab.research.google.com/gist/mattdangerw/550ca0fc007579353ec7d0f11ebee03b/deberta-masked-lm.ipynb

Essentially the problem is the [MASK] token does not appear in our version of the debertav3 vocabulary. Looking at the upstream code, it looks like they do have support for a mask token when using sentencepiece, but that it may be layered on top of sentencepiece itself -> https://github.com/microsoft/DeBERTa/blob/11fa20141d9700ba2272b38f2d5fce33d981438b/DeBERTa/deberta/spm_tokenizer.py#L43

We might need a to do a little digging here. What token id is assigned for [MASK] in the upstream implementation? For huggingface it looks like it is appended as a final token, but would that mean we are attempting to do a embedding lookup that falls outside of our embedding size?

Once we figure out how the original implementation handle this, we can figure out what changes we need to make here.

mattdangerw · 2023-02-23T04:40:39Z

Everything looking good here, we can just merge this with #759 when it is ready. We can't merge before without breaking the main model usages.

mattdangerw · 2023-02-24T21:42:51Z

Ok #759 is merged, so we should be able to rebase this and get things working. You can use the colab I linked above to test things out (I recommend a GPU runtime).

mattdangerw · 2023-02-27T22:23:52Z

@Plutone11011 is this ready for review again? If so can take a pass tomorrow.

Plutone11011 · 2023-02-27T22:47:26Z

@Plutone11011 is this ready for review again? If so can take a pass tomorrow.

Yes, I've checked the notebook, training and preprocessor work, there is however still a problem when calling detokenize that I haven't delved into, it doesn't find the [MASK] id.

Plutone11011 · 2023-03-01T09:57:55Z

@mattdangerw in the colab notebook the call to detokenize yields an OutOfRangeError, invalid id 128000, basically unable to find the [MASK] id. I haven't followed closely #759 but I guess this is due to the fact that the [MASK] token is handled internally by DebertaV3Tokenizer in self.mask_token_id, whereas detokenize calls the SentencePiece method. Do you think we should implement a detokenize method inside DebertaV3Tokenizer?

mattdangerw · 2023-03-01T20:07:46Z

@Plutone11011 thanks for checking this out! IMO we don't need to block on the detokenize functionality (it is not critical to any MLM workflow), but it could be worth handling that in a follow up.

I'll take a pass over the code again shortly.

mattdangerw

Thank you! This was a bit of a journey! All looks good to me.

Plutone11011 changed the title ~~#721 Deberta masklm model~~ Solve #721 Deberta masklm model Feb 9, 2023

mattdangerw requested changes Feb 10, 2023

View reviewed changes

mattdangerw mentioned this pull request Feb 10, 2023

Adding an AlbertMaskedLM task + Fix Projection layer dimension in MaskedLMHead #725

Merged

Plutone11011 force-pushed the deberta-masklm branch from 0aa36db to 8dfd375 Compare February 14, 2023 15:25

abheesht17 mentioned this pull request Feb 18, 2023

Handle [MASK] token in DebertaV3Tokenizer #759

Merged

Plutone11011 added 5 commits February 25, 2023 09:04

DeBERTa masked language model

68dec47

format run

3146bee

added mask check to tokenizer, fixed comments and style

543e20d

added mask vocab in classifier test

91ecd8d

fix mask token id assignment tokenizer

07115f8

Plutone11011 force-pushed the deberta-masklm branch from 095d46d to 07115f8 Compare February 25, 2023 09:44

Plutone11011 added 2 commits February 25, 2023 10:46

format fix

0b6c6a6

readded masked deberta imports

7bf0501

mattdangerw approved these changes Mar 3, 2023

View reviewed changes

mattdangerw merged commit 272d9ff into keras-team:master Mar 3, 2023

mattdangerw mentioned this pull request Mar 11, 2023

Deberta tokenizer.detokenize() errors out with mask token #829

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve #721 Deberta masklm model #732

Solve #721 Deberta masklm model #732

Plutone11011 commented Feb 9, 2023

mattdangerw commented Feb 9, 2023

Plutone11011 commented Feb 9, 2023

mattdangerw commented Feb 10, 2023

mattdangerw left a comment

mattdangerw Feb 10, 2023

mattdangerw Feb 10, 2023

mattdangerw Feb 10, 2023

Plutone11011 Feb 11, 2023

mattdangerw Feb 10, 2023

Plutone11011 Feb 14, 2023

mattdangerw Feb 10, 2023

Plutone11011 Feb 11, 2023

mattdangerw Feb 10, 2023

Plutone11011 commented Feb 11, 2023

mattdangerw commented Feb 16, 2023

mattdangerw commented Feb 23, 2023

mattdangerw commented Feb 24, 2023

mattdangerw commented Feb 27, 2023

Plutone11011 commented Feb 27, 2023

Plutone11011 commented Mar 1, 2023 •

edited

Loading

mattdangerw commented Mar 1, 2023

mattdangerw left a comment

Solve #721 Deberta masklm model #732

Solve #721 Deberta masklm model #732

Conversation

Plutone11011 commented Feb 9, 2023

mattdangerw commented Feb 9, 2023

Plutone11011 commented Feb 9, 2023

mattdangerw commented Feb 10, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Plutone11011 commented Feb 11, 2023

mattdangerw commented Feb 16, 2023

mattdangerw commented Feb 23, 2023

mattdangerw commented Feb 24, 2023

mattdangerw commented Feb 27, 2023

Plutone11011 commented Feb 27, 2023

Plutone11011 commented Mar 1, 2023 • edited Loading

mattdangerw commented Mar 1, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

Plutone11011 commented Mar 1, 2023 •

edited

Loading