Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add module for loading test data. #120

Merged
merged 5 commits into from
Sep 3, 2024
Merged

Conversation

pstjohn
Copy link
Collaborator

@pstjohn pstjohn commented Aug 23, 2024

Implements a system for specifying remote files in a more flexible fashion than the previous download_artifacts script.

This lets us write yaml files in sub-packages/bionemo-testing/src/bionemo/testing/data/resources/ (for instance, geneformer.yaml below) that specifies different remote data sources:

- tag: 10M_240530
  ngc: null
  pbss: "s3://bionemo-ci/models/geneformer-10M-240530-step-115430-wandb-4ij9ghox.nemo"
  sha256: 6e3670c6b61487035ebb8ce2c00fcbca7fb50128bd4d36ad035fa863791ef792 # pragma: allowlist secret
  description: >
    A pretrained 10.3M parameter geneformer (BERT) on 23M unique single cells and 25429 ENSG
    based tokens, padded to a final shape of 25472 for GPU efficiency. See 
    https://wandb.ai/clara-discovery/scFM_v9/runs/4ij9ghox.
- tag: qa
  pbss: "s3://bionemo-ci/models/geneformer-qa.nemo"
  sha256: ab49ba01e78b4fc940d528f8a4ff03cef3f68e08101de08a3c82a72f5d49be9d # pragma: allowlist secret
  description: A QA model for geneformer with randomly initialized weights.

Then calling load(filename/tag) returns a cached directory with the downloaded file(s). Compressed files are automatically unzipped, tar files are automatically unpacked (you get back the folder root of the tarfile).

>>> from bionemo.testing.data.load import load
>>> load("geneformer/qa")
PosixPath('/home/bionemo/.cache/bionemo/e50448f1b13547a020f5c6d2cd6430e5-geneformer-qa.nemo')

This should make it simpler to add new data for examples, tests, etc., and no longer requires users to run download_artifacts upfront before running the test suite.

The implementation offloads much of the heavy lifting to pooch. It also adds the default .cache folder location to the devcontainer's folder mounts, so users should be able to have these files persist across container runs.

NGC downloads

This now supports downloading from NGC as well! You need to specify both the NGC tag and the registry (i.e., model or resource). See this entry for ESM2:

- tag: nv_650m:1.0
  ngc: "nvidia/clara/esm2nv650m:1.0"
  ngc_registry: model
  pbss: "s3://bionemo-ci/models/esm2nv_650M_converted.nemo"
  sha256: 1e38063cafa808306329428dd17ea6df78c9e5d6b3d2caf04237c555a1f131b7 # pragma: allowlist secret
  owner: Farhad Ramezanghorbani <[email protected]>
  description: >
    A pretrained 650M parameter ESM2 model. See https://ngc.nvidia.com/catalog/models/nvidia:clara:esm2nv650m.

@pstjohn
Copy link
Collaborator Author

pstjohn commented Aug 23, 2024

There are a few tests I'll need to change in bionemo-scdl, but that will be easier to wait until #118 is merged. CC @polinabinder1

@pstjohn pstjohn force-pushed the pstjohn/v2-main/data-module branch from 65370a9 to a37f75e Compare August 23, 2024 21:16
@pstjohn pstjohn requested a review from jstjohn August 23, 2024 21:17
Copy link
Collaborator

@malcolmgreaves malcolmgreaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful PR! Well written and great use of libraries. I have a few suggested changes. The most important one being not having a global s3 client in the load module. Details in-line. Looking forward to using this myself soon!

@pstjohn pstjohn marked this pull request as ready for review August 26, 2024 15:53
@pstjohn pstjohn force-pushed the pstjohn/v2-main/data-module branch from a5f51af to c6eab45 Compare August 26, 2024 15:58
@pstjohn
Copy link
Collaborator Author

pstjohn commented Aug 26, 2024

/build-ci

Copy link
Collaborator

@malcolmgreaves malcolmgreaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock since we're on different coasts and I don't want to hold up your PR as you work through it :) But I would still like engagement on the review comments.

Overall, this is a well-written and clear PR 🚀

@pstjohn pstjohn force-pushed the pstjohn/v2-main/data-module branch from bbb4441 to 1f5f63a Compare August 27, 2024 19:32
Copy link
Collaborator

@skothenhill-nv skothenhill-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, seems like it will be really helpful. Could we add a README or documentation somewhere on how we are expected to interact with this?

As a first time user this is my surface expectations from looking at this code:

  • load is the setup method I'll want to call for whatever resource I need - should be toplevel in my test or a fixture
  • Need to add a new yaml entry for whatever resource I want included.

Is that right? I'm guessing it will be less clear when the diff isnt right infront of me.

Copy link
Collaborator

@jstjohn jstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@pstjohn pstjohn requested a review from ohadmo August 30, 2024 14:31
@pstjohn pstjohn force-pushed the pstjohn/v2-main/data-module branch from 1f5f63a to 853be25 Compare August 30, 2024 16:26
@pstjohn pstjohn requested a review from trvachov as a code owner August 30, 2024 16:26
Offloads much of the heavy lifting to pooch, but implements a system for
specifying and downloading files to a local cache that will persist
between devcontainer builds.

Signed-off-by: Peter St. John <[email protected]>
@pstjohn pstjohn enabled auto-merge (squash) September 3, 2024 19:54
@pstjohn pstjohn merged commit f2d2b44 into v2-main Sep 3, 2024
3 checks passed
@pstjohn pstjohn deleted the pstjohn/v2-main/data-module branch September 16, 2024 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants