Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrected the epsilon value #665

Merged
merged 6 commits into from
Jan 17, 2023
Merged

Conversation

soma2000-lang
Copy link
Contributor

@soma2000-lang soma2000-lang commented Jan 16, 2023

#642
The correct epsilon value is 1e-12 for BERT. We pass this correct value to the LayerNorm layer after embeddings (https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/bert/bert_backbone.py#L149) but do not pass the correct value to the TransformerEncoder layers (https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/bert/bert_backbone.py#L159-L168). The default value for layer_norm_epsilon in TransformerEncoder is 1e-5.

@abheesht17

Copy link
Contributor

@jbischof jbischof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soma2000-lang found a bug. If you run pytest keras_nlp/models/bert you can check your work locally. See our contribution guide for more details (link).

keras_nlp/models/bert/bert_backbone.py Outdated Show resolved Hide resolved
@abheesht17
Copy link
Collaborator

@soma2000-lang, could you also do it for the other backbone models?

@jbischof
Copy link
Contributor

@abheesht17 do we know that 1e-12 is a better default value and used in those papers? If so should we change in TransformerEncoder and TransformerDecoder so that we have good defaults for everyone?

@abheesht17
Copy link
Collaborator

@jbischof , I believe for all our layers, we've kept defaults the same as what they are in the underlying layers. For example, the kernel initialiser is "glorot_uniform" even though most models use truncated normal or random normal. Hence, I think we should probably not change the defaults in our layers.

Most models (BERT, ALBERT) have 1e-12. DeBERTa has 1e-7. GPT-2 uses 1e-5. There is no consensus as such. Anyway, epsilon is the value in the denominator while normalising and is used to prevent divide by zero errors...so, it probably does not have a huge impact? Dunno.

@jbischof
Copy link
Contributor

Makes sense! I guess there's a significant amount of research one has to do for this PR then?

@soma2000-lang
Copy link
Contributor Author

@soma2000-lang, could you also do it for the other backbone models?

Sure @abheesht17

@soma2000-lang
Copy link
Contributor Author

@abheesht17 @jbischof I did run black for this particular file for proper code formatting.Is it okay or should I undo the changes done for black

@abheesht17
Copy link
Collaborator

Makes sense! I guess there's a significant amount of research one has to do for this PR then?

Not really. The epsilon value to be passed to the Transformer Encoder layer is the same as the value passed to the LayerNorm layer post embeddings layers. I've already checked the official repos.

For example, DeBERTav3: https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/deberta_v3/deberta_v3_backbone.py#L131.

@abheesht17
Copy link
Collaborator

@abheesht17 @jbischof I did run black for this particular file for proper code formatting.Is it okay or should I undo the changes done for black

@soma2000-lang, it seems like you are using different black config. This is the command you have to run:

bash shell/format.sh

This will work only if you are using Linux. If you are using Windows, run these commands:

isort --sp "setup.cfg" --sl <name-of-file>
black --line-length 80 <name-of-file>
flake8 --config "setup.cfg" --max-line-length=200 <name-of-file>

@@ -163,6 +163,7 @@ def __init__(
x, approximate=True
),
dropout=dropout,
layer_norm_epilson=1e-12,
Copy link
Collaborator

@abheesht17 abheesht17 Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @soma2000-lang. The tests are failing because epsilon has been misspelt. Could you please correct and push again? Thanks!

@soma2000-lang
Copy link
Contributor Author

@abheesht17 Done!

@jbischof
Copy link
Contributor

@abheesht17 are we happy as is or do we want to add other models to this PR?

@mattdangerw
Copy link
Member

I'm happy to keep this small, keep the PRs flowing, but make sure we keep tracking follow up work.

@abheesht17 maybe you could edit the description on the initial issue to include changes that would need to be made to other models? Or we could open a separate tracking issue. No strong preference.

@mattdangerw mattdangerw merged commit 99edccf into keras-team:master Jan 17, 2023
@mattdangerw
Copy link
Member

Thank you!

@soma2000-lang soma2000-lang deleted the Epsilon branch February 7, 2023 15:55
@abheesht17 abheesht17 mentioned this pull request Feb 9, 2023
Closed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants