Add BERT-style masking function #55

pstjohn · 2024-07-31T18:24:26Z

Splitting this off from #49 to make it easier to talk about some general-purpose BERT masking functions

sub-packages/bionemo-contrib/src/bionemo/contrib/data/masking.py

jstjohn · 2024-07-31T19:06:09Z

sub-packages/bionemo-contrib/src/bionemo/contrib/data/masking.py

+    mask_stop_1 = mask_config.mask_prob * mask_config.mask_token_prob
+    mask_stop_2 = mask_config.mask_prob * (mask_config.mask_token_prob + mask_config.random_token_prob)
+
+    random_draws = torch.rand(tokenized_sequence.shape)


Remind me, is this in [0,1]? Can you add a comment to that effect?

sub-packages/bionemo-contrib/src/bionemo/contrib/data/masking.py

sub-packages/bionemo-contrib/src/bionemo/contrib/data/esm2/tokenizer.py

sub-packages/bionemo-contrib/tests/bionemo/contrib/model/biobert/test_model.py

jstjohn · 2024-08-02T00:34:48Z

This is probably because geneformer used random token mask rate at 0.02. Can you try with that setting? Also make sure we’re sampling the larger vocabulary. Sent from my iPhoneOn Aug 1, 2024, at 10:10 AM, Peter St. John ***@***.***> wrote: @pstjohn commented on this pull request. In sub-packages/bionemo-contrib/tests/bionemo/contrib/model/biobert/test_model.py:

@@ -711,7 +714,7 @@ def test_inference_loss_10m_released_checkpoint(geneformer_config: BioBertConfig

# the target is defined as described above for the 10M checkpoint based on our first pass # of the megatron implementation. Since we manually passed experiment 1 this experiment # will define our initial "golden value" test target. - target: float = 2.368649959564209 + target: float = 2.7 @jstjohn, is this OK? I'm wondering if this is from a change in the mask seeds? Or does this indicate some difference in the masking thresholds? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

farhadrgh · 2024-08-02T16:26:19Z

I think we should consider Wrapper of HuggingFace AutoTokenizer that allows importing tokenizers from HF models including ESM2 and Geneformer. We already have the HF transformer but I think the drawbacks would be the overhead of import/init and network access requirement.

nemo.collections.common.tokenizers.huggingface.auto_tokenizer

An example of using it for ESM2 tests: https://github.com/NVIDIA/bionemo-fw-ea/blob/9417f08f19ed2a7affd9a0d66374ec0bca330bcb/sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_model.py#L46

scripts/singlecell/geneformer/pretrain.py

sub-packages/bionemo-llm/src/bionemo/llm/data/collate.py

pstjohn · 2024-08-02T20:05:28Z

I think we should consider Wrapper of HuggingFace AutoTokenizer that allows importing tokenizers from HF models including ESM2 and Geneformer. We already have the HF transformer but I think the drawbacks would be the overhead of import/init and network access requirement.

Good suggestion -- I'm realizing we don't need the tokenizer in this PR, so I pulled it out. But I'll use huggingface's in the next one

jstjohn

Love this!

jstjohn · 2024-08-07T16:53:08Z

sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/datamodule.py

+            collate_fn=functools.partial(
+                collate.bert_padding_collate_fn,
+                padding_value=self.tokenizer.token_to_id(GeneTokenizer.pad_token),
+                max_length=self.max_len,


Can you add min_length=None, here to kind of self-document that we're allowing for fewer than max_len tokens if all elements of a batch are shorter?

jstjohn · 2024-08-07T16:54:17Z

sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py

+            collate_fn=functools.partial(
+                collate.bert_padding_collate_fn,
+                padding_value=tokenizer.token_to_id(tokenizer.pad_token),
+                max_length=2048,


I would call out min_length=None, explicitly so it's easy to see that we are not padding to a particular length if all items are shorter.

sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/datamodule.py

jomitchellnv · 2024-08-07T15:57:33Z

sub-packages/bionemo-core/src/bionemo/core/utils/random_utils.py

+
+    Used to seed a torch random generator from a numpy random generator.
+    """
+    return rng.integers(np.iinfo(np.int64).max)


Q: IS the seed int64 so we can get more range?

rng.integers returns an int64 by default, we're just asking for any random int64.

jomitchellnv · 2024-08-08T22:05:28Z

sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/dataset.py

        gene_data, col_idxs, feature_ids = self.lookup_cell_by_idx(idx)
        return process_item(
            gene_data,
            col_idxs,
            feature_ids,
            self.tokenizer,
            gene_median=self.gene_medians,
+            rng=rng,


is this seed fixed? Comment?

rng is defined on line 197 -- rng = np.random.default_rng([self._seed, idx])

it's deterministic as a function of the class seed and item index. Megatron's sampler assumes that datasets are deterministic, so that when you call getitem(i) on each GPU rank, they all get the same data. I don't think we were obeying that constraint previously

jomitchellnv · 2024-08-08T22:06:09Z

sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py

@@ -671,6 +680,12 @@ def _get_loss_from_model(model_config: GeneformerConfig, seed: int) -> float:
            batch_size=8,
            shuffle=False,
            num_workers=0,
+            collate_fn=functools.partial(


its properly masking in collation? nice

oh, no -- it just pads in collation. the masking is happening in dataset.py:293, apply_bert_pretraining_mask.

this is correct

jomitchellnv

Great job writing unit tests!!

Also corrects a number of geneformer scripts that relied on a bug in the previous masking function that assigned the wrong number of random tokens. With the new function that assigns the correct number of random tokens, we set the random token mask percentage lower to match previous results. Signed-off-by: Peter St. John <[email protected]>

pstjohn · 2024-08-09T00:07:25Z

/build-ci

pstjohn requested review from jstjohn and jomitchellnv July 31, 2024 18:24

pstjohn self-assigned this Jul 31, 2024

pstjohn commented Jul 31, 2024

View reviewed changes