-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BERT-style masking function #55
Conversation
sub-packages/bionemo-contrib/src/bionemo/contrib/data/masking.py
Outdated
Show resolved
Hide resolved
mask_stop_1 = mask_config.mask_prob * mask_config.mask_token_prob | ||
mask_stop_2 = mask_config.mask_prob * (mask_config.mask_token_prob + mask_config.random_token_prob) | ||
|
||
random_draws = torch.rand(tokenized_sequence.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me, is this in [0,1]? Can you add a comment to that effect?
sub-packages/bionemo-contrib/src/bionemo/contrib/data/masking.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-contrib/src/bionemo/contrib/data/masking.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-contrib/src/bionemo/contrib/data/masking.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-contrib/src/bionemo/contrib/data/masking.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-contrib/src/bionemo/contrib/data/esm2/tokenizer.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-contrib/tests/bionemo/contrib/model/biobert/test_model.py
Outdated
Show resolved
Hide resolved
849878d
to
8011b65
Compare
This is probably because geneformer used random token mask rate at 0.02. Can you try with that setting? Also make sure we’re sampling the larger vocabulary. Sent from my iPhoneOn Aug 1, 2024, at 10:10 AM, Peter St. John ***@***.***> wrote:
@pstjohn commented on this pull request.
In sub-packages/bionemo-contrib/tests/bionemo/contrib/model/biobert/test_model.py:
@@ -711,7 +714,7 @@ def test_inference_loss_10m_released_checkpoint(geneformer_config: BioBertConfig
# the target is defined as described above for the 10M checkpoint based on our first pass
# of the megatron implementation. Since we manually passed experiment 1 this experiment
# will define our initial "golden value" test target.
- target: float = 2.368649959564209
+ target: float = 2.7
@jstjohn, is this OK? I'm wondering if this is from a change in the mask seeds? Or does this indicate some difference in the masking thresholds?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
8011b65
to
9b74c8e
Compare
9b74c8e
to
c1da5ef
Compare
I think we should consider Wrapper of HuggingFace AutoTokenizer that allows importing tokenizers from HF models including ESM2 and Geneformer. We already have the HF transformer but I think the drawbacks would be the overhead of import/init and network access requirement.
An example of using it for ESM2 tests: https://github.com/NVIDIA/bionemo-fw-ea/blob/9417f08f19ed2a7affd9a0d66374ec0bca330bcb/sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_model.py#L46 |
c1da5ef
to
7f41ff5
Compare
Good suggestion -- I'm realizing we don't need the tokenizer in this PR, so I pulled it out. But I'll use huggingface's in the next one |
7f41ff5
to
c03d718
Compare
c07a198
to
2dc5d3f
Compare
2dc5d3f
to
836230b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this!
collate_fn=functools.partial( | ||
collate.bert_padding_collate_fn, | ||
padding_value=self.tokenizer.token_to_id(GeneTokenizer.pad_token), | ||
max_length=self.max_len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add min_length=None,
here to kind of self-document that we're allowing for fewer than max_len
tokens if all elements of a batch are shorter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
collate_fn=functools.partial( | ||
collate.bert_padding_collate_fn, | ||
padding_value=tokenizer.token_to_id(tokenizer.pad_token), | ||
max_length=2048, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call out min_length=None,
explicitly so it's easy to see that we are not padding to a particular length if all items are shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
836230b
to
98ed102
Compare
sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/datamodule.py
Show resolved
Hide resolved
|
||
Used to seed a torch random generator from a numpy random generator. | ||
""" | ||
return rng.integers(np.iinfo(np.int64).max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: IS the seed int64 so we can get more range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rng.integers returns an int64 by default, we're just asking for any random int64.
gene_data, col_idxs, feature_ids = self.lookup_cell_by_idx(idx) | ||
return process_item( | ||
gene_data, | ||
col_idxs, | ||
feature_ids, | ||
self.tokenizer, | ||
gene_median=self.gene_medians, | ||
rng=rng, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this seed fixed? Comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rng
is defined on line 197 -- rng = np.random.default_rng([self._seed, idx])
it's deterministic as a function of the class seed and item index. Megatron's sampler assumes that datasets are deterministic, so that when you call getitem(i)
on each GPU rank, they all get the same data. I don't think we were obeying that constraint previously
@@ -671,6 +680,12 @@ def _get_loss_from_model(model_config: GeneformerConfig, seed: int) -> float: | |||
batch_size=8, | |||
shuffle=False, | |||
num_workers=0, | |||
collate_fn=functools.partial( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its properly masking in collation? nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, no -- it just pads in collation. the masking is happening in dataset.py:293, apply_bert_pretraining_mask
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job writing unit tests!!
Also corrects a number of geneformer scripts that relied on a bug in the previous masking function that assigned the wrong number of random tokens. With the new function that assigns the correct number of random tokens, we set the random token mask percentage lower to match previous results. Signed-off-by: Peter St. John <[email protected]>
98ed102
to
e0af43c
Compare
/build-ci |
Splitting this off from #49 to make it easier to talk about some general-purpose BERT masking functions