-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Formatting for ProtT5 labels #137
Comments
Hello, On section 2.4 under "ProtT5" on our paper, we have mentioned the following:
So we followed Bert style nosing and denosing wirh single sentinel. This means if the original sequence is ""E V Q L V E S G A E", then:
|
(1) Changed Tokenization of Labels from custom dictionary to the ProtT5 Tokenization (2) Inputted masks directly into sequences (used "<extra_id_0>" as per agemagician/ProtTrans#137) (3) Spaced out amino acids (necessary for tokenization) and consequently commented out the spacing out step in the training i.e. train_df.apply(lambda row : " ".join(row["sequence"]), axis = 1) and valid_df.apply(lambda row : " ".join(row["sequence"]), axis = 1)
Hello,
I have been trying to understand the protT5 model and how to compute a loss for the full encoder-decoder.
Looking through github issues on this repository, it is suggested at multiple places that the format to predict masked residues should be, e.g. for a poly alanine sequence "AAAAA"
input: "A <extra_id_0> A A <extra_id_1>"
label: "<extra_id_0> A <extra_id_1> A"
which is similar to how HuggingFace describes T5 training: https://huggingface.co/docs/transformers/model_doc/t5#training
however, trying this results in a substantially worse loss than simply using the original sequence as label. E.g., running the following code:
shows a negative log likelihood loss of 1.2 for the first and 40 for the second case, with the first one going down as expected as the number of masked residues is reduced, while the second one stays roughly constant.
This makes me think that the correct way to further pre-train the model would be to pass the full unmasked sequence as label rather than the masked tokens, is that correct?
The text was updated successfully, but these errors were encountered: