-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightningDataModule for webdataset #100
Conversation
sub-packages/bionemo-diffdock/src/bionemo/diffdock/utils/torus.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-diffdock/src/bionemo/diffdock/utils/diffusion.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DejunL thanks for the contribution, but this PR is too large to review. It needs to be broken up into more readily reviewable chunks. If you want more context, please refer to the contributing and PR review guides in the repository. Please aim for 500-1000 lines at the maximum to make this tractable for reviewers. Thanks!
After a first pass, my suggestion is to do this by:
- having a PR for the webdataset
- having another PR for diffdock utils
Webdataset looks like it is diffdock specific, so please put this into bionemo-diffdock and hold off on putting it into bionemo-core until we have confirmed multiple use cases.
Additionally, the diffdock utils need unit tests before they can be merged in.
sub-packages/bionemo-diffdock/src/bionemo/diffdock/utils/data.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-diffdock/src/bionemo/diffdock/utils/so3.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-core/tests/bionemo/core/data/test_webdatamodule.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-core/tests/bionemo/core/data/test_webdatamodule.py
Outdated
Show resolved
Hide resolved
Thanks for the suggestion. I should clarify the motivation in the OP but let me do it here too. The webdatamodule is not diffdock specific. Nothing in the datamodule.py depends on any specific model. DiffDock-pp, which we are planning to add to BioNeMo soon, is also using webdataset. @kdidiNVIDIA is also interested in trying to use webdataset in the protein foundation model. So putting webdatamodule into bionemo-diffdock will not achieve the desirable modularization. My original thinking of putting webdatamodule into bionemo-core was seeing the In the context of @guoqing-zhou working on modularizing DiffDock specific code, I can remove the the diffdock tests and utils from this PR and only focus on the generic webdatamodule. This would also reduce the number of lines to review. What do you think? @malcolmgreaves |
Following up from the slack discussion - let's double check alignment and summarize next steps: How does this sound @DejunL ? |
a114765
to
082b681
Compare
First of all, we should stop referring this PR to "dataset" or "dataloader" as this is about lightning data module for using webdataset. I have just moved the code under bionemo-webdatamodule and remove the diffdock specific code because the webdatamodule doesn't depend on diffdock at all (the original code you were referring to was only used for testing webdatamodule). The diffdock migration to bionemo2 will be future work. @malcolmgreaves |
22d41d3
to
dd054a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of our sub-packages currently have a duplicated LICENSE file. Is this not needed?
If we clean this up, will the LICENSE still be logged properly in the package metadata? For example will this proposed test still pass? #82
…ar created ... and add a test for it
... per the requirement of the FW v2 alpha release
c851688
to
a35765f
Compare
/build-ci |
This PR achieves the following:
WebDataModule
. It takes a set of webdataset tar files and various webdataset config settings as input and setups the WebDataset and WebLoader to be used later by the Lightning.Trainer workflowsPickledDataWDS
that inherits from the aforementionedWebDataModule
that allows the user to experiment with different train/val/test splits of the input pickled data to be used in creating the webdataset tar files4. Migrate a subset of BioNeMo1 DiffDock data processing routines for the purpose of testing the data modules in 1 and 2 in the context of DiffDock model5. Tests for 4(these are postponed for the next diffdock PR)