Generate embedding for unnatural amino acid #162

Frank-LIU-520 · 2024-12-05T06:53:01Z

I have sequences with unnatural amino acid for example, [β-alanine], and δ-[aminolevulinic acid]. How can I generate embedding with such sequences? How can I add extra tokens for such amino acids and training them using an self-supervised manner?

mheinzinger · 2024-12-06T13:41:31Z

Here is a description of how to add new tokens to the vocabulary:
https://www.depends-on-the-definition.com/how-to-add-new-tokens-to-huggingface-transformers/
We are currently working on a script for simplyfing continuation of self-supervised pre-training- I'll post it in the README once we have all set up :)
One thought that I always had on adding those exotic AAs: you most likely only have few example sequences that have those AAs which might make the usual self-supervised pretraining not sufficient for this use-case (it usually requires many samples). Maybe one of the two things helps:

do not do random masking but favor those new tokens for masking. If you do not mask them, there will be no loss computed for them (depends also a bit on your setup but usually no loss gets computed for non-corrupted input).
instead of randomly initializing the embeddings of the new tokens, try to recycle existing non-contextualized AA-embeddings (those in the very first layer - no context added at this point, just plain single-AA embeddings). Your new AAs will most likely be more similar to some AAs than to others, so you could try to initialize the new tokens using a weighted average over existing AA-embeddings where the weight is defined by biochem. similarity between the AAs.

Good luck!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate embedding for unnatural amino acid #162

Generate embedding for unnatural amino acid #162

Frank-LIU-520 commented Dec 5, 2024

mheinzinger commented Dec 6, 2024

Generate embedding for unnatural amino acid #162

Generate embedding for unnatural amino acid #162

Comments

Frank-LIU-520 commented Dec 5, 2024

mheinzinger commented Dec 6, 2024