Replies: 1 comment
-
It is answered in #86. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I am trying to use the esm3 sequence logits for a downstream analysis.
When I run forward on a single sequence, the shape of out.sequence_logits (see below for code) is 1 x input length x 64. The tokenizer implies the shape should be 33, but when I look into the code, indeed the sequence output is set to be dimension 64 (https://github.com/evolutionaryscale/esm/blob/0774600af03d724e8244d577c415e10617f018fe/esm/models/esm3.py#L160C9-L160C57).
Is it the case that only the first 33 tokens are meaningful (as would seem to be the case based on the the tokenizer)?
Am I missing something or is there a better way to get the sequence logits?
Thanks!
Code:
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig
login()
model: ESM3InferenceClient = ESM3.from_pretrained("esm3_sm_open_v1").to("cpu") # or "cuda"
from esm.tokenization.sequence_tokenizer import EsmSequenceTokenizer
tokenizer = EsmSequenceTokenizer()
prompt = "DQATSLRILNNGHAFNVEFDDSQDKAOO"
enc_prompt = tokenizer.encode(prompt)
input = torch.tensor(enc_prompt, dtype=torch.int64).unsqueeze(0)
out = model(sequence_tokens=input)
out.sequence_logits.shape # result: torch.Size([1, 30, 64])
Beta Was this translation helpful? Give feedback.
All reactions