Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate embedding for unnatural amino acid #162

Open
Frank-LIU-520 opened this issue Dec 5, 2024 · 1 comment
Open

Generate embedding for unnatural amino acid #162

Frank-LIU-520 opened this issue Dec 5, 2024 · 1 comment

Comments

@Frank-LIU-520
Copy link

I have sequences with unnatural amino acid for example, [β-alanine], and δ-[aminolevulinic acid]. How can I generate embedding with such sequences? How can I add extra tokens for such amino acids and training them using an self-supervised manner?

@mheinzinger
Copy link
Collaborator

Here is a description of how to add new tokens to the vocabulary:
https://www.depends-on-the-definition.com/how-to-add-new-tokens-to-huggingface-transformers/
We are currently working on a script for simplyfing continuation of self-supervised pre-training- I'll post it in the README once we have all set up :)
One thought that I always had on adding those exotic AAs: you most likely only have few example sequences that have those AAs which might make the usual self-supervised pretraining not sufficient for this use-case (it usually requires many samples). Maybe one of the two things helps:

  • do not do random masking but favor those new tokens for masking. If you do not mask them, there will be no loss computed for them (depends also a bit on your setup but usually no loss gets computed for non-corrupted input).
  • instead of randomly initializing the embeddings of the new tokens, try to recycle existing non-contextualized AA-embeddings (those in the very first layer - no context added at this point, just plain single-AA embeddings). Your new AAs will most likely be more similar to some AAs than to others, so you could try to initialize the new tokens using a weighted average over existing AA-embeddings where the weight is defined by biochem. similarity between the AAs.

Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants