You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have sequences with unnatural amino acid for example, [β-alanine], and δ-[aminolevulinic acid]. How can I generate embedding with such sequences? How can I add extra tokens for such amino acids and training them using an self-supervised manner?
The text was updated successfully, but these errors were encountered:
Here is a description of how to add new tokens to the vocabulary: https://www.depends-on-the-definition.com/how-to-add-new-tokens-to-huggingface-transformers/
We are currently working on a script for simplyfing continuation of self-supervised pre-training- I'll post it in the README once we have all set up :)
One thought that I always had on adding those exotic AAs: you most likely only have few example sequences that have those AAs which might make the usual self-supervised pretraining not sufficient for this use-case (it usually requires many samples). Maybe one of the two things helps:
do not do random masking but favor those new tokens for masking. If you do not mask them, there will be no loss computed for them (depends also a bit on your setup but usually no loss gets computed for non-corrupted input).
instead of randomly initializing the embeddings of the new tokens, try to recycle existing non-contextualized AA-embeddings (those in the very first layer - no context added at this point, just plain single-AA embeddings). Your new AAs will most likely be more similar to some AAs than to others, so you could try to initialize the new tokens using a weighted average over existing AA-embeddings where the weight is defined by biochem. similarity between the AAs.
I have sequences with unnatural amino acid for example, [β-alanine], and δ-[aminolevulinic acid]. How can I generate embedding with such sequences? How can I add extra tokens for such amino acids and training them using an self-supervised manner?
The text was updated successfully, but these errors were encountered: