Add BART Backbone #661

abheesht17 · 2023-01-14T11:26:55Z

Resolves #649

Conversion Notebook: https://colab.research.google.com/drive/1JeWNXHFEdvFpxV_G2PDy5AAk6R6eLdJz?usp=sharing

jbischof

Looking great, some quick comments. What is going on with our TransformerDecoder!?

jbischof · 2023-01-15T19:40:05Z

keras_nlp/layers/transformer_decoder.py

@@ -165,7 +165,7 @@ def _build(self, input_shape, has_cross_attention):
            self._cross_attention_layer = keras.layers.MultiHeadAttention(
                num_heads=self.num_heads,
                key_dim=head_dim,
-                value_dim=hidden_dim,
+                value_dim=head_dim,


Whoa, did you find a bug here? If so, let's discuss and break this into a different PR.

Yeah, this is possibly a bug. The thing is that we haven't actually used the cross-attention layer for any of our models (we have used TransformerDecoder for GPT-2, but since it is a decoder-only model, we don't use the cross-attention layer) so far...so, this bug escaped our attention xD. Discussed this with @mattdangerw on Friday. I'll open a separate PR for this, we might want to patch it up to 0.4.0 ASAP, I suppose?

Great catch!

This is merged, so let's rebase

keras_nlp/models/bart/bart_backbone.py

mattdangerw

Just some quick initial comments!

mattdangerw · 2023-01-18T03:03:10Z

keras_nlp/models/bart/bart_backbone.py

+
+
+@keras.utils.register_keras_serializable(package="keras_nlp")
+class BartBackbone(Backbone):


I wish they had chosen a name that wasn't so visually similar to Bert. This is going to confuse me so much :P

I wish they explained what BART stands for....

BERT -> Bidirectional Encoder Representations from Transformers
BART -> Bidirectional Auto-encoder Representations from Transformers?

Guessing...:P

Interesting! Couldn't find it in the paper

mattdangerw · 2023-01-18T03:05:40Z

keras_nlp/models/bart/bart_backbone.py

+        )
+
+        # Token embedding layer. This layer is shared by encoder and decoder.
+        token_embedding_layer = keras.layers.Embedding(


When BART is used as a language model, are the embedding weights shared for the output projection? Or is there a separate set of parameters used?

@mattdangerw - they do define separate Embedding layers in both the encoder and decoder: https://github.com/huggingface/transformers/blob/44caf4f6f47120a4aaca561c9cc8041455bef705/src/transformers/models/bart/modeling_bart.py#L723-L726 and https://github.com/huggingface/transformers/blob/44caf4f6f47120a4aaca561c9cc8041455bef705/src/transformers/models/bart/modeling_bart.py#L896-L899.

But they copy over the weights for these layers from this layer: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/modeling_bart.py#L1169.

Moreover, when I printed the trainable layers, these two embedding layers did not show up: https://colab.research.google.com/drive/15871_w5LqU9-wzykFbb8dAne6g4zvyat?usp=sharing. So, I think the same set of parameters are being used?

Wouldn't checkpoint conversion confirm this?

Yep, confirmed in the Colab notebook

Great! Great, this is the easy case for us, where the backbone is self sufficient for language modeling tasks. We have prior art for this with BERT pretraining and Chen's upcoming GPT2 language model.

keras_nlp/models/bart/bart_backbone.py

abheesht17

@mattdangerw, replied to the comment about embedding layers.

abheesht17 · 2023-01-18T03:29:52Z

keras_nlp/models/bart/bart_backbone.py

+
+
+@keras.utils.register_keras_serializable(package="keras_nlp")
+class BartBackbone(Backbone):


abheesht17 · 2023-01-18T03:39:03Z

keras_nlp/models/bart/bart_backbone.py

+        )
+
+        # Token embedding layer. This layer is shared by encoder and decoder.
+        token_embedding_layer = keras.layers.Embedding(


@mattdangerw - they do define separate Embedding layers in both the encoder and decoder: https://github.com/huggingface/transformers/blob/44caf4f6f47120a4aaca561c9cc8041455bef705/src/transformers/models/bart/modeling_bart.py#L723-L726 and https://github.com/huggingface/transformers/blob/44caf4f6f47120a4aaca561c9cc8041455bef705/src/transformers/models/bart/modeling_bart.py#L896-L899.

But they copy over the weights for these layers from this layer: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/modeling_bart.py#L1169.

Moreover, when I printed the trainable layers, these two embedding layers did not show up: https://colab.research.google.com/drive/15871_w5LqU9-wzykFbb8dAne6g4zvyat?usp=sharing. So, I think the same set of parameters are being used?

keras_nlp/models/bart/bart_backbone.py

jbischof

Just need to rebase on #667 and we should be able to merge

jbischof · 2023-01-18T18:54:18Z

keras_nlp/layers/transformer_decoder.py

@@ -165,7 +165,7 @@ def _build(self, input_shape, has_cross_attention):
            self._cross_attention_layer = keras.layers.MultiHeadAttention(
                num_heads=self.num_heads,
                key_dim=head_dim,
-                value_dim=hidden_dim,
+                value_dim=head_dim,


This is merged, so let's rebase

jbischof · 2023-01-18T18:55:03Z

keras_nlp/models/bart/bart_backbone.py

+    This class implements a Transformer-based encoder-decoder model as
+    described in
+    ["BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension"](https://arxiv.org/abs/1910.13461).
+    It includes the embedding lookups and transformer layers.


Is this line needed?

Hmmm, just stuck to what's there in other backbone models. Can remove

I don't think BART has any "pretraining head". For all pretraining tasks, it just autoregressively generates/recovers the original input from the denoised input.

For other models, the whole point of this line was to emphasise that the model class does not have pretraining heads. So, yeah having just It includes the embedding lookups and transformer layers. kinda seems repetitive for BART, and can be removed.

Fine to remove!

FWIW, there will still be a pretraining head, even if that head has no parameters. You still need to map from the dense hidden_dim output to LM logits.

Oh, yeah. Brain fart, indeed 🤦🏼‍♂️

jbischof · 2023-01-18T19:08:24Z

keras_nlp/models/bart/bart_backbone.py

+        )
+
+        # Token embedding layer. This layer is shared by encoder and decoder.
+        token_embedding_layer = keras.layers.Embedding(


Wouldn't checkpoint conversion confirm this?

mattdangerw

Just some minor comments. I can fix as I merge.

mattdangerw · 2023-01-18T21:36:03Z

keras_nlp/models/bart/bart_backbone.py

+    Disclaimer: Pre-trained models are provided on an "as is" basis, without
+    warranties or conditions of any kind. The underlying model is provided by a
+    third party and subject to a separate license, available
+    [here](https://github.com/facebookresearch/fairseq/tree/main/examples/bart).


We should actually link to the location with the LICENSE file here, which is the base of the repo.

mattdangerw · 2023-01-18T21:36:56Z

keras_nlp/models/bart/bart_backbone.py

+        ),
+        "decoder_token_ids": tf.ones(shape=(1, 12), dtype=tf.int64),
+        "decoder_padding_mask": tf.constant(
+            [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], shape=(1, 12)


Might be nice to show these padding masks can in fact be different. (Just switch to 0s at a different point.)

mattdangerw · 2023-01-18T23:51:27Z

keras_nlp/models/bart/bart_backbone.py

+
+        # Embed tokens and positions.
+        token_embedding = token_embedding_layer(encoder_token_id_input)
+        position_embedding = PositionEmbedding(


Let's add a comment explaining that the position embedding is not shared, but the token embedding is.

abheesht17 added 12 commits January 13, 2023 05:34

Add rough BART backbone

0db7ca6

Small doc-string change

403b1dd

Change default values

8ba7b8f

Format

44b4914

Small comment changes

5e5dbfb

Pass has_cross_attention

46ae882

Change value_dim for CA layer

5148678

Fixes

b2359ad

Small change

8725024

Fix Add layer

a036f40

Remove extra lines

be35854

Change actn

aa12e9d

abheesht17 changed the title ~~Add BartBackbone~~ Add BART Backbone Jan 14, 2023

jbischof reviewed Jan 15, 2023

View reviewed changes

abheesht17 added 3 commits January 16, 2023 12:52

Address comments

b3cc52f

Temp change

d25a5ba

Revert prev commit

c64af64

abheesht17 mentioned this pull request Jan 16, 2023

Fix value_dim in TransformerDecoder's cross-attn layer #667

Merged

jbischof mentioned this pull request Jan 17, 2023

Add BART presets #672

Closed

mattdangerw requested changes Jan 18, 2023

View reviewed changes

abheesht17 commented Jan 18, 2023

View reviewed changes

abheesht17 added 3 commits January 18, 2023 11:36

Merge branch 'keras-team:master' into bart-backbone

5d99677

Add UTs

21e4eb1

Fix UTs

c449a9f

abheesht17 requested review from mattdangerw and jbischof January 18, 2023 13:31

jbischof approved these changes Jan 18, 2023

View reviewed changes

abheesht17 added 2 commits January 19, 2023 01:13

Merge branch 'keras-team:master' into bart-backbone

052fefe

Remove redundant docstring line

4d26785

mattdangerw approved these changes Jan 18, 2023

View reviewed changes

Minor docstring and comment fixes

b6e3e72

mattdangerw merged commit c9e5040 into keras-team:master Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BART Backbone #661

Add BART Backbone #661

abheesht17 commented Jan 14, 2023 •

edited

Loading

jbischof left a comment

jbischof Jan 15, 2023

abheesht17 Jan 16, 2023

jbischof Jan 16, 2023

jbischof Jan 18, 2023

abheesht17 Jan 18, 2023

mattdangerw left a comment

mattdangerw Jan 18, 2023

abheesht17 Jan 18, 2023

jbischof Jan 18, 2023

abheesht17 Jan 18, 2023

jbischof Jan 18, 2023

mattdangerw Jan 18, 2023

abheesht17 Jan 18, 2023

jbischof Jan 18, 2023

abheesht17 Jan 18, 2023

mattdangerw Jan 18, 2023

abheesht17 left a comment

abheesht17 Jan 18, 2023

abheesht17 Jan 18, 2023

jbischof left a comment

jbischof Jan 18, 2023

jbischof Jan 18, 2023

abheesht17 Jan 18, 2023

abheesht17 Jan 18, 2023 •

edited

Loading

mattdangerw Jan 18, 2023

abheesht17 Jan 18, 2023

jbischof Jan 18, 2023

mattdangerw left a comment

mattdangerw Jan 18, 2023

mattdangerw Jan 18, 2023

mattdangerw Jan 18, 2023



		@keras.utils.register_keras_serializable(package="keras_nlp")
		class BartBackbone(Backbone):

Add BART Backbone #661

Add BART Backbone #661

Conversation

abheesht17 commented Jan 14, 2023 • edited Loading

jbischof left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abheesht17 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbischof left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abheesht17 Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abheesht17 commented Jan 14, 2023 •

edited

Loading

abheesht17 Jan 18, 2023 •

edited

Loading