Formatting for ProtT5 labels #137

exs-fdreyer · 2023-11-15T16:14:51Z

Hello,

I have been trying to understand the protT5 model and how to compute a loss for the full encoder-decoder.
Looking through github issues on this repository, it is suggested at multiple places that the format to predict masked residues should be, e.g. for a poly alanine sequence "AAAAA"
input: "A <extra_id_0> A A <extra_id_1>"
label: "<extra_id_0> A <extra_id_1> A"

which is similar to how HuggingFace describes T5 training: https://huggingface.co/docs/transformers/model_doc/t5#training

however, trying this results in a substantially worse loss than simply using the original sequence as label. E.g., running the following code:

from transformers import T5Tokenizer
from transformers.models.t5.modeling_t5 import T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_uniref50', do_lower_case=False)
model = T5ForConditionalGeneration.from_pretrained('Rostlab/prot_t5_xl_uniref50')
# input sequences "EVQLVESGAE" and "AAAAAAAAAA"
label_seq = tokenizer(["E V Q L V E S G A E", "A A A A A A A A A A"], return_tensors="pt").input_ids
# mask some of the residues with sentinel tokens
input_seq = tokenizer(["E V <extra_id_0> L <extra_id_1> E S G <extra_id_2> E", "A A <extra_id_0> A A <extra_id_1> A A <extra_id_2> A"], return_tensors="pt").input_ids
# suggested format for the labelling of masked tokens ("<extra_id_0> Q <extra_id_1> V <extra_id_2> A", etc)
label_seq_alt = tokenizer(["<extra_id_0> Q <extra_id_1> V <extra_id_2> A", "<extra_id_0> A <extra_id_1> A <extra_id_2> A"], return_tensors="pt").input_ids

print(model(input_ids=input_seq, labels=label_seq).loss)
print(model(input_ids=input_seq, labels=label_seq_alt).loss)

shows a negative log likelihood loss of 1.2 for the first and 40 for the second case, with the first one going down as expected as the number of masked residues is reduced, while the second one stays roughly constant.
This makes me think that the correct way to further pre-train the model would be to pass the full unmasked sequence as label rather than the masked tokens, is that correct?

The text was updated successfully, but these errors were encountered:

agemagician · 2023-11-18T17:24:14Z

Hello,

On section 2.4 under "ProtT5" on our paper, we have mentioned the following:

Contrary to the original T5 model which masks spans of multiple tokens, we adopted BERT’s denoising objective to corrupt and reconstruct single tokens using a masking probability of 15 percent.

So we followed Bert style nosing and denosing wirh single sentinel. This means if the original sequence is ""E V Q L V E S G A E", then:

Input sequence should be something like "E V <extra_id_0> L <extra_id_0> E S G <extra_id_0> E".
label sequence should be "E V Q L V E S G A E".

(1) Changed Tokenization of Labels from custom dictionary to the ProtT5 Tokenization (2) Inputted masks directly into sequences (used "<extra_id_0>" as per agemagician/ProtTrans#137) (3) Spaced out amino acids (necessary for tokenization) and consequently commented out the spacing out step in the training i.e. train_df.apply(lambda row : " ".join(row["sequence"]), axis = 1) and valid_df.apply(lambda row : " ".join(row["sequence"]), axis = 1)

agemagician closed this as completed Nov 18, 2023

This was referenced Apr 3, 2024

ProtT5 model generate #147

Open

How was ProtT5 trained? #148

Closed

tombosc mentioned this issue Nov 1, 2024

T5 fine-tuning special tokens #158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formatting for ProtT5 labels #137

Formatting for ProtT5 labels #137

exs-fdreyer commented Nov 15, 2023 •

edited

Loading

agemagician commented Nov 18, 2023

Formatting for ProtT5 labels #137

Formatting for ProtT5 labels #137

Comments

exs-fdreyer commented Nov 15, 2023 • edited Loading

agemagician commented Nov 18, 2023

exs-fdreyer commented Nov 15, 2023 •

edited

Loading