Adding `GPTNeoXBackbone` #1056

shivance · 2023-05-29T16:08:03Z

Partially Completes #1052

Pythia uses GPT-Neo-X architecture

This PR adds implementation of

RotaryEmbedding paper
GPTNeoXAttention
GPTNeoXDecoder Layer
GPTNeoXBackbone

shivance · 2023-05-29T16:16:15Z

@mattdangerw I opened this PR so that easier for you to review the code !

As of now I referred Huggingface implementation of GPTNeoX and put parts together (Rotary embedding, attention) converting into Tensorflow.

Please Note:

I've referred this for RotaryEmbedding , and most of the copied as it is.
GPTNeoXAttention layer is tensorflow translation of corresponding PyTorch layer from Huggingface
I've copied GPT2Decoder as it is for GPTNeoX Decoder and replaced CachedMultiHeadAttention with GPTNeoXAttention layer
GPTNeoXBackbone is same as GPT2Backbone, except for calling GPTNeoXDecoder instead of TransformerDecoder

keras_nlp/models/gpt-neox/gpt_neox_attention.py

mattdangerw · 2023-05-30T20:04:01Z

Very cool! Excited for this.

Generally question, are we better off calling this GPTNeoX or Pythia? We can ship checkpoints for either, but we may want to go with whatever is better known as the general name.

The first thing I would suggest doing is extending the colab to actually do some weight conversion. It looks like the original project hosts weights in huggingface, so that seems like the place to start. Essentially we would want a colab or a script that can download the original weights, convert them to our backbone model, and run some dummy inputs through both and confirm the outputs are equivalent.

Here is an example colab with distilbert that @abheesht17 made.

chenmoneygithub

Thanks for the PR! I took a brief scan over the code, and it's quite complex to tell if it is correct or not. So we need 2 things here:

As Matt commented above, please share a colab with weights conversion so that we know the code works properly (produce the same result as HF given the same checkpoint).
For methods like _compute_attention, we need some comments to illustrate what it is doing, otherwise we will lose track of it very soon. We will take a deeper look at these, let's focus on the weights conversion for now.

Thanks!

StellaAthena · 2023-06-02T02:38:22Z

Hello from EleutherAI! I’m excited to see this PR in the works.

The short answer is that I would recommend calling it the GPT-NeoX architecture. The GPT-NeoX architecture has been used by a variety of models including GPT-J, GPT-NeoX-20B, Pythia, PaLM, and more. If you don’t consider changing the PE to be a meaningful architecture change, then StableLM and (I believe but can’t find documentation of this fact right now) MPT also use the architecture.

(I can give a longer answer & history if that’s desired)

We also release weights in the format that the GPT-NeoX library keeps them natively. This format is more convenient for distributed training but not for inference. Several such weight formats can be found on the HuggingFace Hub or linked to from our README. The HuggingFace weights are official releases though, and were produced by EleutherAI. Our library actually support an “export to HF” script, and if you develop a conversion script for your library we can add an “export to keras” script as well.

mattdangerw · 2023-06-02T18:05:14Z

@StellaAthena hello EleutherAI!

Thanks so much! This is very helpful context. Let's go with the GPTNeoX name we have on this PR then.

PE differences we would probably consider to be a separate architecture. MPT is ALiBi not RoPE (I think?), so that is probably something we would sort into a separate architecture in our models api. Can still share common code via layers.

keras_nlp/models/gpt_neox/rotary_embedding.py

keras_nlp/models/gpt_neox/gpt_neox_decoder.py

…st_name still fails though)

shivance · 2023-06-12T16:25:21Z

My black seems to be broken :(

shivance · 2023-06-12T19:07:24Z

@mattdangerw Here is the working model+ checkpoints loader without cached attention
https://colab.research.google.com/drive/1cZBDnISjdFPhk_F9cvaffH71E7BS8K60?usp=sharing

I'm looking into matching the outputs

mattdangerw

Left some miscellaneous comments, but think you are overall on the right track already. First step will be to confirm we have an equivalent forward pass with "upstream" versions, then we can refine the code here more.

keras_nlp/models/gpt_neox/gpt_neox_attention.py

keras_nlp/models/gpt_neox/gpt_neox_backbone.py

keras_nlp/models/gpt_neox/gpt_neox_decoder.py

keras_nlp/models/gpt_neox/gpt_neox_backbone.py

mattdangerw · 2023-06-13T02:30:48Z

keras_nlp/models/gpt_neox/gpt_neox_decoder.py

+            rotary_emb_base=10000,
+            kernel_initializer="glorot_uniform",
+            bias_initializer="zeros",
+            use_parallel_residual=True,


is this value both true and false for checkpoints we care about? or does one win out?

if the latter, we could consider ditching this

Yes it is True for all pythia and gpt-neox-20b checkpoints

keras_nlp/models/gpt_neox/rotary_embedding.py

keras_nlp/models/gpt_neox/gpt_neox_attention.py

tools/checkpoint_conversion/convert_gpt_neox_checkpoints.py

mattdangerw · 2023-06-13T02:36:30Z

keras_nlp/models/gpt_neox/rotary_embedding.py

+            1.0 / (self.rotary_emb_base ** (range / self.dim))
+        )
+
+    @staticmethod


why bother marking this a static method and leaving it public?

seems like we could just leave this as a private _apply_rotary_pos_emb regular method for now (and let the fact that this does not access anything on self be incidental)

sounds good

this is still marked as a staticmethod, I think there is no need.

mattdangerw

A few more comments.

mattdangerw · 2023-06-14T17:52:15Z

keras_nlp/models/gpt_neox/rotary_embedding.py

+
+
+class RotaryEmbedding(keras.layers.Layer):
+    def __init__(self, dim, rotary_emb_base=10000):


We should try to name this more descriptively than dim. Is this the "hidden dim" of the model? If so let's call this hidden_dim.

@mattdangerw It's not hidden_dim it is attn_head_size * rotary_pct .
How about we rename it to rotary_ndims itself?

mattdangerw · 2023-06-14T17:53:22Z

keras_nlp/models/gpt_neox/rotary_embedding.py

+
+    def build(self, input_shape):
+        super().build(input_shape)
+        self.inverse_freq = self.add_weight(


actually looking at this, this inverse_freq should all be static right? if we don't need this trainable, instead of having this be a weight, let's move this into call somewhere, we can just compute it on the fly

Hey @mattdangerw ! We can definitely do that, but I would like you to take a look at this. https://github.com/huggingface/transformers/blob/17a55534f5e5df10ac4804d4270bf6b8cc24998d/src/transformers/models/esm/modeling_tf_esm.py#L102-L107

I'm not sure I follow the comment totally. The issue looks to be a precision one, but if the goal is to keep these explicitly as float32, why not just compute them on the fly with an explicit float32 dtype? I still don't understand the need for a variable. And the fact that this is a trainable seems incorrect looking at the torch implementation, these are not trainable in torch.

In general, I would be careful attempting to apply what seems like a fairly technical point about esm checkpoints to other models. Ideally we would just check how close our forward pass outputs are for the actual pythia checkpoints under fully precision (float32 everywhere), and mixed precision (float32 for variables, float16 for computations), and use that to determine our approach here.

mattdangerw · 2023-06-14T17:53:48Z

keras_nlp/models/gpt_neox/rotary_embedding.py

+        sin_emb = sin_emb[:, : tf.shape(tensor)[1], :, :]
+        x1, x2 = tf.split(tensor, 2, axis=-1)
+        half_rot_tensor = tf.concat((-x2, x1), axis=-1)
+        # Incompatible shapes: [32,256,8,2] vs. [1,256,1,16] [Op:Mul]


remember to cleanup little notes like this

mattdangerw · 2023-06-14T17:57:52Z

keras_nlp/models/gpt_neox/gpt_neox_attention.py

+        max_sequence_length=512,
+        kernel_initializer="glorot_uniform",
+        bias_initializer="zeros",
+        rotary_pct=0.25,


let's consider better names for these two argument. probably rotary_percentage is more consistent with Keras' style, and rotary_emb_base is a little confusing, are there better names we could consider from the paper or elsewhere?

I guess one option is to document this the same as max_wavelength in our SinePositionEncoding layer. https://keras.io/api/keras_nlp/modeling_layers/sine_position_encoding/

I'm not sure it's the best name, but at least it will be consistent across the library. We could name these arguments rotary_percentage and rotary_max_wavelength here, and just percentage and max_wavelength on the rotary layer itself.

fixed this !

keras_nlp/models/gpt_neox/gpt_neox_attention.py

keras_nlp/models/gpt_neox/gpt_neox_backbone.py

mattdangerw · 2023-06-15T21:17:48Z

keras_nlp/models/gpt_neox/gpt_neox_attention.py

+        ]
+        value = query_key_value[..., 2 * self.attn_head_size :]
+
+        query_rot, query_pass = (


I wonder if we would be better off moving this slice and concat logic into the RotaryEmbedding call. Then our usage here could look a little more like...

query = self.rotary_embedding(query) key = self.rotary_embedding(key)

And the rotary embedding layer could also hold the percentage argument, which would conceptually be quite clean. Looks like falcon is doing this roughly -> https://huggingface.co/tiiuae/falcon-40b/blob/main/modelling_RW.py

Thanks for this wonderful suggestion !

mattdangerw · 2023-06-15T21:31:13Z

keras_nlp/models/gpt_neox/gpt_neox_attention.py

+        max_sequence_length=512,
+        kernel_initializer="glorot_uniform",
+        bias_initializer="zeros",
+        rotary_pct=0.25,


I guess one option is to document this the same as max_wavelength in our SinePositionEncoding layer. https://keras.io/api/keras_nlp/modeling_layers/sine_position_encoding/

I'm not sure it's the best name, but at least it will be consistent across the library. We could name these arguments rotary_percentage and rotary_max_wavelength here, and just percentage and max_wavelength on the rotary layer itself.

shivance · 2023-06-16T13:28:32Z

Putting back old conversion script unless we are done with presets.

mattdangerw · 2023-06-16T19:12:02Z

Thanks! Will take a look today.

What is the deal with deduped vs not deduped by the way? Deduped training data sounds better to me, is there any reason to not just ignore the "non-deduped" checkpoints?

shivance · 2023-06-16T19:19:12Z

Thanks! Will take a look today.

What is the deal with deduped vs not deduped by the way? Deduped training data sounds better to me, is there any reason to not just ignore the "non-deduped" checkpoints?

mattdangerw

Looking good! Left a few more comments.

mattdangerw · 2023-06-16T21:39:41Z

keras_nlp/models/gpt_neox/gpt_neox_backbone_test.py

+        restored_output = restored_model(self.input_batch)
+        self.assertAllClose(model_output, restored_output)
+
+    # def test_create_layout_map(self):


just remove this for now, we can add/review the code in a follow up

It is still here on this diff, maybe you forgot to remove?

mattdangerw · 2023-06-16T21:43:29Z

keras_nlp/models/gpt_neox/gpt_neox_decoder.py

+        rotary_max_wavelength=10000,
+        kernel_initializer="glorot_uniform",
+        bias_initializer="zeros",
+        use_parallel_residual=True,


I think we decided this was always true for now right? let's just remove the related code, add back if we need it

keras_nlp/models/gpt_neox/gpt_neox_decoder.py

mattdangerw · 2023-06-16T21:58:19Z

keras_nlp/models/gpt_neox/gpt_neox_decoder.py

+        # Infer the dimension of our hidden feature size from the build shape.
+        hidden_dim = input_shape[-1]
+
+        self._input_layernorm = keras.layers.LayerNormalization(


these layernorms are confusingly named, from what i can tell, _input_layernorm is the layernorm for self attention (applied first), _self_attention_layernorm is the layernorm for the feedforward block (applied first).

I would rename these. _input_layernorm -> _self_attention_layernorm and _self_attention_layernorm -> _feedforward_layernorm.

Also, still unresolved.

mattdangerw · 2023-06-16T22:08:34Z

keras_nlp/models/gpt_neox/rotary_embedding.py

+
+    def build(self, input_shape):
+        super().build(input_shape)
+        self.inverse_freq = self.add_weight(


I'm not sure I follow the comment totally. The issue looks to be a precision one, but if the goal is to keep these explicitly as float32, why not just compute them on the fly with an explicit float32 dtype? I still don't understand the need for a variable. And the fact that this is a trainable seems incorrect looking at the torch implementation, these are not trainable in torch.

In general, I would be careful attempting to apply what seems like a fairly technical point about esm checkpoints to other models. Ideally we would just check how close our forward pass outputs are for the actual pythia checkpoints under fully precision (float32 everywhere), and mixed precision (float32 for variables, float16 for computations), and use that to determine our approach here.

mattdangerw · 2023-06-16T22:09:03Z

keras_nlp/models/gpt_neox/rotary_embedding.py

+            1.0 / (self.rotary_emb_base ** (range / self.dim))
+        )
+
+    @staticmethod


this is still marked as a staticmethod, I think there is no need.

keras_nlp/models/gpt_neox/rotary_embedding.py

tools/checkpoint_conversion/convert_gpt_neox_checkpoints.py

mattdangerw · 2023-06-16T22:46:10Z

keras_nlp/models/gpt_neox/gpt_neox_backbone_test.py

+        output = self.backbone.token_embedding(self.input_batch["token_ids"])
+        self.assertEqual(output.shape, (2, 5, 64))
+
+    # def test_name(self):


The _1 would be from creating two backbones with the default name in the same session. I am able to get this test passing simply by uncommenting it and changing "gpt_neox_backbone" to "gpt_neo_x_backbone".

Which actually brings up a good point. We should actually rename the directly and all files from gpt_neox_... to -> gpt_neo_x_... to match the way Keras automatically converts Camel to snake_case.

mattdangerw

Hey! Started a review on this, but I think what happened is your changes might be on #1085 instead of here. Can you move your related changes onto this PR?

keras_nlp/models/gpt_neo_x/rotary_embedding.py

keras_nlp/models/gpt_neo_x/gpt_neo_x_backbone.py

keras_nlp/models/gpt_neo_x/gpt_neo_x_decoder.py

mattdangerw · 2023-06-21T06:08:47Z

keras_nlp/models/gpt_neox/gpt_neox_backbone_test.py

+        output = self.backbone.token_embedding(self.input_batch["token_ids"])
+        self.assertEqual(output.shape, (2, 5, 64))
+
+    # def test_name(self):


I think this is still unresolved. We should rename this directory and all files to gpt_neo_x... and then enable this test. It should pass at that point.

mattdangerw · 2023-06-21T06:09:29Z

keras_nlp/models/gpt_neox/gpt_neox_backbone_test.py

+        restored_output = restored_model(self.input_batch)
+        self.assertAllClose(model_output, restored_output)
+
+    # def test_create_layout_map(self):


It is still here on this diff, maybe you forgot to remove?

mattdangerw · 2023-06-21T06:11:04Z

keras_nlp/models/gpt_neox/gpt_neox_decoder.py

+        # Infer the dimension of our hidden feature size from the build shape.
+        hidden_dim = input_shape[-1]
+
+        self._input_layernorm = keras.layers.LayerNormalization(


Also, still unresolved.

shivance · 2023-06-21T17:36:23Z

Moved tokenizer from #1085 to here. Resolved comments related to that PR.

mattdangerw · 2023-06-21T18:12:15Z

/gcbrun

shivance · 2023-06-21T18:30:52Z

Hey! Started a review on this, but I think what happened is your changes might be on #1085 instead of here. Can you move your related changes onto this PR?

Yeah, I rebased this branch with pythia-tokenizer and move all changes into this single PR.

mattdangerw

Looking good! Mostly minor stuff now!

mattdangerw · 2023-06-21T18:21:12Z

keras_nlp/models/gpt_neo_x/gpt_neo_x_attention.py

+        self.rotary_percentage = rotary_percentage
+        self.dropout = dropout
+        self.attn_head_size = hidden_dim // num_heads
+        self.rotary_ndims = int(self.attn_head_size * rotary_percentage)


It seems to me like you can move this line down into the rotary layer itself, you can get at attn_head_size simply by reading the shape of the passed query and value right?

I would pass percentage and max_wavelength directly as arguments to RotaryEmbedding, and keep all the logic there, that will keep things more compartmentalized.

outputs match after this refactor :)

keras_nlp/models/gpt_neo_x/gpt_neo_x_attention.py

keras_nlp/models/gpt_neo_x/gpt_neo_x_backbone.py

mattdangerw · 2023-06-21T18:44:08Z

tools/checkpoint_conversion/convert_gpt_neox_checkpoints.py

+
+keras_model.get_layer("layer_norm").beta.assign(hf_wts["final_layer_norm.bias"])
+
+hf_tokenizer = AutoTokenizer.from_pretrained(PRESET)


we should update this section to check tokenizer output for some simple input as well, now that we have added the tokenizer here.

Added this part, not using tokenizer input for model as we still don't have preprocessor.
outputs of tokenizer are same as hf tokenizer.
By the way they are using gpt-neox-20b vocabulary for pythia suite.

tools/checkpoint_conversion/convert_gpt_neox_checkpoints.py

keras_nlp/models/gpt_neo_x/gpt_neo_x_decoder.py

keras_nlp/models/gpt_neo_x/gpt_neo_x_attention.py

mattdangerw · 2023-06-24T21:24:27Z

/gcbrun

mattdangerw

This looks great! Pushing some minor style fixes as I land. For future PRs

Make sure to follow the 80 character line limit for docstring besides links.
Do document argument type, but do not document default values unless they are complex in some way. Simple defaults will already render in the keras.io signatures, e.g. https://keras.io/api/keras_nlp/models/bert/bert_backbone/.

Once tests are green I will pull this in!

mattdangerw · 2023-06-24T22:03:08Z

/gcbrun

shivance · 2023-06-24T23:58:30Z

Sounds good @mattdangerw ! All tests pass.

added gpt-neo attention+decoder+backbone

9412a83

shivance changed the title ~~added gpt-neo attention+decoder+backbone~~ Adding GPTNeoX May 29, 2023

shivance commented May 29, 2023

View reviewed changes

keras_nlp/models/gpt-neox/gpt_neox_attention.py Outdated Show resolved Hide resolved

fixed formatting + added backbone test

99a8296

chenmoneygithub suggested changes May 31, 2023

View reviewed changes

shivance mentioned this pull request Jun 2, 2023

Add GPTNeoX Model #1052

Open

5 tasks

shivance added 3 commits June 7, 2023 02:32

fixed rotary embedding and gpt neo attention layer

afb7e1f

updating decoder and backbone to current version

f0f6383

fixed decoder + backbone

bfd56fa

mattdangerw reviewed Jun 7, 2023

View reviewed changes

keras_nlp/models/gpt_neox/rotary_embedding.py Outdated Show resolved Hide resolved

shivance added 3 commits June 10, 2023 11:44

fix forward pass

97a347d

formatting + add checkpoint script

5ead767

fix tpu_test, formatting

5776ac1

shivance commented Jun 10, 2023

View reviewed changes

keras_nlp/models/gpt_neox/gpt_neox_decoder.py Outdated Show resolved Hide resolved

shivance added 2 commits June 12, 2023 21:44

removed unnecessary layernorms, correct arguments, fix unit tests (te…

e0d343b

…st_name still fails though)

fix dropout

451cdbc

mattdangerw requested changes Jun 13, 2023

View reviewed changes

shivance added 3 commits June 14, 2023 21:05

matching outputs with hf

e37fb22

fix formating

ead11c5

resolving few comments

c7117a4

mattdangerw requested changes Jun 14, 2023

View reviewed changes

mattdangerw reviewed Jun 15, 2023

View reviewed changes

shivance added 2 commits June 16, 2023 09:50

fixed unit tests + formatting

c72e629

refactored rotary embedding

2341d0e

shivance requested a review from mattdangerw June 16, 2023 13:29

mattdangerw requested changes Jun 16, 2023

View reviewed changes

shivance added 4 commits June 17, 2023 11:32

incorporated comments

7a66052

code format

6f6f41e

resolved comments + fixed formatting

f34ec47

added gpt neo x tokenizer

34db7f7

shivance mentioned this pull request Jun 17, 2023

Add GPTNeoXTokenizer #1085

Closed

shivance changed the title ~~Adding GPTNeoX~~ Adding GPTNeoXBackbone Jun 17, 2023

shivance requested a review from mattdangerw June 20, 2023 17:07

shivance mentioned this pull request Jun 20, 2023

Move RotaryEmbedding to layers #1087

Closed

mattdangerw requested changes Jun 21, 2023

View reviewed changes

shivance added 2 commits June 21, 2023 23:49

added docstrings

1ecfe51

formatting fix

b3f06e4

mattdangerw requested changes Jun 21, 2023

View reviewed changes

shivance added 2 commits June 23, 2023 13:19

addressing comments

a9f2230

added tokenizer output verification

122a3fb

Minor style fixes

e10ea50

mattdangerw approved these changes Jun 24, 2023

View reviewed changes

mattdangerw merged commit 27f62c0 into keras-team:master Jun 25, 2023

This was referenced Jun 27, 2023

Add FalconBackbone #1081

Closed

Adding GPTNeoX (Tensorflow version) huggingface/transformers#23814

Open



		class RotaryEmbedding(keras.layers.Layer):
		def __init__(self, dim, rotary_emb_base=10000):


		keras_model.get_layer("layer_norm").beta.assign(hf_wts["final_layer_norm.bias"])

		hf_tokenizer = AutoTokenizer.from_pretrained(PRESET)

Adding GPTNeoXBackbone #1056

Adding GPTNeoXBackbone #1056

Conversation

shivance commented May 29, 2023 • edited Loading

shivance commented May 29, 2023 • edited Loading

mattdangerw commented May 30, 2023

chenmoneygithub left a comment

Choose a reason for hiding this comment

StellaAthena commented Jun 2, 2023

mattdangerw commented Jun 2, 2023 • edited Loading

shivance commented Jun 12, 2023

shivance commented Jun 12, 2023 • edited Loading

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivance Jun 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivance commented Jun 16, 2023

mattdangerw commented Jun 16, 2023

shivance commented Jun 16, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw Jun 16, 2023 • edited Loading

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivance commented Jun 21, 2023

mattdangerw commented Jun 21, 2023

shivance commented Jun 21, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivance Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw commented Jun 24, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw commented Jun 24, 2023

shivance commented Jun 24, 2023

Adding `GPTNeoXBackbone` #1056

Adding `GPTNeoXBackbone` #1056

shivance commented May 29, 2023 •

edited

Loading

shivance commented May 29, 2023 •

edited

Loading

mattdangerw commented Jun 2, 2023 •

edited

Loading

shivance commented Jun 12, 2023 •

edited

Loading

shivance Jun 16, 2023 •

edited

Loading

mattdangerw Jun 16, 2023 •

edited

Loading

shivance Jun 23, 2023 •

edited

Loading