Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BERT-style masking function #55

Merged
merged 1 commit into from
Aug 9, 2024

Conversation

pstjohn
Copy link
Collaborator

@pstjohn pstjohn commented Jul 31, 2024

Splitting this off from #49 to make it easier to talk about some general-purpose BERT masking functions

@pstjohn pstjohn requested review from jstjohn and jomitchellnv July 31, 2024 18:24
@pstjohn pstjohn self-assigned this Jul 31, 2024
mask_stop_1 = mask_config.mask_prob * mask_config.mask_token_prob
mask_stop_2 = mask_config.mask_prob * (mask_config.mask_token_prob + mask_config.random_token_prob)

random_draws = torch.rand(tokenized_sequence.shape)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me, is this in [0,1]? Can you add a comment to that effect?

@pstjohn pstjohn force-pushed the pstjohn/v2-main/masking-and-tokenizer branch 2 times, most recently from 849878d to 8011b65 Compare August 2, 2024 00:28
@jstjohn
Copy link
Collaborator

jstjohn commented Aug 2, 2024 via email

@pstjohn pstjohn force-pushed the pstjohn/v2-main/masking-and-tokenizer branch from 8011b65 to 9b74c8e Compare August 2, 2024 00:50
@pstjohn pstjohn force-pushed the pstjohn/v2-main/masking-and-tokenizer branch from 9b74c8e to c1da5ef Compare August 2, 2024 16:25
@farhadrgh
Copy link
Collaborator

farhadrgh commented Aug 2, 2024

I think we should consider Wrapper of HuggingFace AutoTokenizer that allows importing tokenizers from HF models including ESM2 and Geneformer. We already have the HF transformer but I think the drawbacks would be the overhead of import/init and network access requirement.

nemo.collections.common.tokenizers.huggingface.auto_tokenizer

An example of using it for ESM2 tests: https://github.com/NVIDIA/bionemo-fw-ea/blob/9417f08f19ed2a7affd9a0d66374ec0bca330bcb/sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_model.py#L46

@pstjohn pstjohn force-pushed the pstjohn/v2-main/masking-and-tokenizer branch from c1da5ef to 7f41ff5 Compare August 2, 2024 16:34
@pstjohn pstjohn changed the title Add ESM2 tokenizer and BERT-style masking function Add BERT-style masking function Aug 2, 2024
@pstjohn
Copy link
Collaborator Author

pstjohn commented Aug 2, 2024

I think we should consider Wrapper of HuggingFace AutoTokenizer that allows importing tokenizers from HF models including ESM2 and Geneformer. We already have the HF transformer but I think the drawbacks would be the overhead of import/init and network access requirement.

Good suggestion -- I'm realizing we don't need the tokenizer in this PR, so I pulled it out. But I'll use huggingface's in the next one

@pstjohn pstjohn force-pushed the pstjohn/v2-main/masking-and-tokenizer branch from 7f41ff5 to c03d718 Compare August 2, 2024 20:30
@pstjohn pstjohn requested a review from jstjohn August 2, 2024 20:30
@pstjohn pstjohn force-pushed the pstjohn/v2-main/masking-and-tokenizer branch 5 times, most recently from c07a198 to 2dc5d3f Compare August 7, 2024 16:04
@pstjohn pstjohn force-pushed the pstjohn/v2-main/masking-and-tokenizer branch from 2dc5d3f to 836230b Compare August 7, 2024 16:04
Copy link
Collaborator

@jstjohn jstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this!

collate_fn=functools.partial(
collate.bert_padding_collate_fn,
padding_value=self.tokenizer.token_to_id(GeneTokenizer.pad_token),
max_length=self.max_len,
Copy link
Collaborator

@jstjohn jstjohn Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add min_length=None, here to kind of self-document that we're allowing for fewer than max_len tokens if all elements of a batch are shorter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

collate_fn=functools.partial(
collate.bert_padding_collate_fn,
padding_value=tokenizer.token_to_id(tokenizer.pad_token),
max_length=2048,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call out min_length=None, explicitly so it's easy to see that we are not padding to a particular length if all items are shorter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@pstjohn pstjohn force-pushed the pstjohn/v2-main/masking-and-tokenizer branch from 836230b to 98ed102 Compare August 7, 2024 17:11

Used to seed a torch random generator from a numpy random generator.
"""
return rng.integers(np.iinfo(np.int64).max)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: IS the seed int64 so we can get more range?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rng.integers returns an int64 by default, we're just asking for any random int64.

gene_data, col_idxs, feature_ids = self.lookup_cell_by_idx(idx)
return process_item(
gene_data,
col_idxs,
feature_ids,
self.tokenizer,
gene_median=self.gene_medians,
rng=rng,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this seed fixed? Comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rng is defined on line 197 -- rng = np.random.default_rng([self._seed, idx])

it's deterministic as a function of the class seed and item index. Megatron's sampler assumes that datasets are deterministic, so that when you call getitem(i) on each GPU rank, they all get the same data. I don't think we were obeying that constraint previously

@@ -671,6 +680,12 @@ def _get_loss_from_model(model_config: GeneformerConfig, seed: int) -> float:
batch_size=8,
shuffle=False,
num_workers=0,
collate_fn=functools.partial(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its properly masking in collation? nice

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, no -- it just pads in collation. the masking is happening in dataset.py:293, apply_bert_pretraining_mask.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is correct

Copy link
Collaborator

@jomitchellnv jomitchellnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job writing unit tests!!

Also corrects a number of geneformer scripts that relied on a bug in the
previous masking function that assigned the wrong number of random
tokens. With the new function that assigns the correct number of random
tokens, we set the random token mask percentage lower to match previous
results.

Signed-off-by: Peter St. John <[email protected]>
@pstjohn pstjohn force-pushed the pstjohn/v2-main/masking-and-tokenizer branch from 98ed102 to e0af43c Compare August 9, 2024 00:07
@pstjohn
Copy link
Collaborator Author

pstjohn commented Aug 9, 2024

/build-ci

@pstjohn pstjohn enabled auto-merge (squash) August 9, 2024 00:07
@pstjohn pstjohn merged commit 821fc4f into v2-main Aug 9, 2024
2 checks passed
@pstjohn pstjohn deleted the pstjohn/v2-main/masking-and-tokenizer branch September 16, 2024 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants