Releases: NVIDIA/bionemo-framework
Releases · NVIDIA/bionemo-framework
NVIDIA BioNeMo Framework v2.2
New Features
- Small Molecule Featurization
- Implemented elementary and advanced atom, bond, and full molecule featurizers.
- GH200 Support for BioNeMo
- Added a
Dockerfile.arm
that builds a BioNeMo container that runs on GH200 machines. - Publish a version of the BioNeMo container that supports multiple architectures to NGC.
- Added a
Updates & Improvements
- Single-Cell Dataloader (SCDL)
- Changed metadata storage to
parquet
files, which creates a 30x speed up when iterating over a large dataset. - Added functionality to concatenate several
anndata
files without doubling disk memory usage.
- Changed metadata storage to
- ESM2
- Added support for
SIGTERM
preemption checkpoint saving. - Moved ESM-2 and Geneformer training scripts to new executables,
train_esm2
andtrain_geneformer
, respectively. - Moved inference script to a new executable
infer_esm2
, and deprecated the inference example in the fine-tuning tutorial. - Added new Jupyter notebook tutorials for inference and zero-shot protein design. These notebooks can be deployed on the cloud resources as a brev.dev launchable.
- Added support for
Known Issues
- Loading a checkpoint for Geneformer inference on H100 has a known regression in accuracy. Work is in progress to resolve by next release.
Changes
- Move ESM2 scripts to sub-packages by @farhadrgh in #406
- WAR: sets checkpoint filename to be more unique by @skothenhill-nv in #429
- Update NeMo and Megatron to TOT by @pstjohn in #424
- re-enable merge groups to trigger blossom-ci by @pstjohn in #431
- Revert "re-enable merge groups to trigger blossom-ci" by @pstjohn in #434
- Updated notebook, and nemo2 checkpoint with geneformer by @jstjohn in #430
- add pre-emption callback to esm2 train by @pstjohn in #433
- add rdkit dependency to bionemo-geometric by @sveccham in #432
- eliminate the need for NGC login - bionemo2 by @dorotat-nv in #440
- Add documentation and release info to README by @sirelkhatim in #447
- Bump 3rdparty/Megatron-LM from
aded519
to5438d15
by @dependabot in #444 - Launchable notebooks in docs! by @jstjohn in #451
- Cache dev build from our nightly public container by @jstjohn in #462
- set num_workers to 1 for esm2 tests by @pstjohn in #461
- ESM2 Tutorial Updates by @farhadrgh in #426
- BugFix: fix bugs on bionemo-size-aware-batching by @guoqing-zhou in #449
- Fix typos in geneformer benchmark description by @jstjohn in #470
- Pillow version bump into main by @polinabinder1 in #465
- Refactor SCDL Row Feature Index for Performance Improvement (Rebased) by @savitha-eng in #466
- pin correct tornado requirement by @polinabinder1 in #474
- Updating Brev.Dev documentation by @polinabinder1 in #483
- Add release notes for v2.1 by @tshimko-nv in #468
- Update VERSION by @polinabinder1 in #488
- Atom and bond features by @sveccham in #453
- Molecule featurizer and molecule graph by @sveccham in #484
- hillst/bionemo noodles by @skothenhill-nv in #458
- update collate mask_value by @pstjohn in #485
- override checkpoint precision by @farhadrgh in #475
- JSON -> YAML for CLI by @skothenhill-nv in #436
- [QA Bug] Remove NGC dependency by @farhadrgh in #494
- Bump 3rdparty/NeMo from
e2b0f0e
to06e6703
by @dependabot in #486 - Bump 3rdparty/Megatron-LM from
5438d15
to844119f
by @dependabot in #496 - change source for coverage report by @pstjohn in #495
- Pstjohn/stop and go test non validation by @pstjohn in #476
- Add support on num steps for learning rate scheduler by @sichu2023 in #489
- Initial compatibility testing images by @malcolmgreaves in #438
- Conda-Based Compatibility Test Images by @malcolmgreaves in #507
- Instructions on compatibility image build by @malcolmgreaves in #512
- Formatting by @malcolmgreaves in #513
- Pstjohn/fix ci by @pstjohn in #515
- [FEA][webdatamodule]: support webdataset invocable by @DejunL in #501
- GH200 support by @gagank1 in #369
- Remove quotes for Jupyter command on startup in init guide by @tshimko-nv in #523
- Reduce esm2 and geneformer test burden by @sichu2023 in #499
- [v2.2] Publish release notes for BioNeMo FW v2.2. by @cspades in #522
- Disable validation/test stages in ESM-2 and Geneformer by @sichu2023 in #492
- CI HOTFIX: ignore inrun_pytest.sh a notebook by @dorotat-nv in #526
- added NeMoLogger unit tests by @dorotat-nv in #511
- Bump 3rdparty/Megatron-LM from
844119f
to99f23d2
by @dependabot in #528 - [cye/wandb-fix] Fix WandB issue. by @cspades in #530
- xFail known bad tests on H100 and fix CVEs by @gagank1 in #547
New Contributors
- @sveccham made their first contribution in #432
- @sirelkhatim made their first contribution in #447
Full Changelog: v2.1...v2.2
NVIDIA BioNeMo Framework 2.1
New Features:
- ESM2 Implementation
- Updated the ESM-2 Model Card with detailed performance benchmarks comparing BioNeMo2 training against vanilla pytorch.
- Added ESM-2 inference endpoint for evaluating pre-trained models
- Size-Aware Batching
- Added SizeAwareBatchSampler, a pytorch data sampler that batches elements of varying sizes while ensuring that the total size of each batch does not exceed a specified maximum.
- Added BucketBatchSampler, another pytorch data sampler that groups elements of varying sizes based on predefined bucket ranges, and create batches with elements from each bucket to ensure that each batch has elements with homogeneous sizes.
- CLI Support
- Added pydantic interface for pretraining jobs via parsing JSON configuration files that enables passing customized Model and DataModules classes.
- Implemented pydantic configuration for Geneformer and ESM2 pretraining and finetuning.
- Added 'recipes' for generating validated JSON files to be used with pydantic interface.
- Added installable scripts for 2/3 respectively, bionemo-esm2-recipe, bionemo-esm2-train, bionemo-geneformer-recipe, bionemo-geneformer-train.
- Geneformer support in BioNeMo2:
- Tested pre-training scripts and fine-tuning example scripts that can be used as a starting point for users to create custom derivative models.
- Geneformer 10M and 106M checkpoints ported from BioNeMo v1 into BioNeMo v2 available and included in documentation.
- Added inference scripts
- Documentation
- Cell type classification example notebook which covers the process of converting anndata into our internal format, and running inference on that data with a geneformer checkpoint, as well as making use of the inference results.
- Updated Getting Started guide, ESM-2 tutorials
- Added Frequently Asked Questions (FAQ) page
Changes
- Final October docs edits by @tshimko-nv in #331
- Update container location and tag for 2.0 release by @tshimko-nv in #337
- Remove broken Release Notes links from v2.0 docs build by @tshimko-nv in #343
- Tell pytest to ignore 3rdparty/{NeMo,MegatronLM} by @malcolmgreaves in #61
- Add back the removed bionemo-core sub-package by @malcolmgreaves in #25
- Fix bionemo-size-aware-batching, standardize pyproject.toml's & dependencies by @malcolmgreaves in #284
- Add check bug fix label workflow by @yzhang123 in #250
- Adds geneformer overview by @skothenhill-nv in #279
- Add ESM2 Dataset and Datamodule by @pstjohn in #78
- Test checkpoint IO loss is close to expected. by @jstjohn in #37
- fix post-create command by @pstjohn in #152
- Drop dependency to internal docs by @farhadrgh in #303
- Add initial configuration for mike (version management for docs) by @tshimko-nv in #330
- Update ESM2 model card with benchmarks by @pstjohn in #341
- Geneformer PEFT by @gwarmstrong in #155
- Update initialization in response to VDR by @tshimko-nv in #334
- Add GitHub workflow by @ohadmo in #9
- Reorganize bionemo-contrib into namespace packages by @malcolmgreaves in #51
- Improve ESM2 pretraining tutorial from VDR feedback by @tshimko-nv in #336
- install geometric dependencies before invalidating caches with source copy by @pstjohn in #224
- ESM2 LoRA by @gwarmstrong in #218
- chown /usr/local's dist-packages to allow editing them in the devcontainer by @pstjohn in #111
- add search highlight + code copy capabilities by @jwilber in #102
- ESM2 implementation by @farhadrgh in #28
- Fix broken docs links on mike build by @tshimko-nv in #344
- Updates to Getting Started docs by @tshimko-nv in #179
- fix post-create command by @pstjohn in #88
- refactor doc structure and look by @jwilber in #143
- Make ruff check pre-commit hook follow what CI does by @malcolmgreaves in #201
- Add bionemo-gemoetric: A component library for PyTorch Geometric Models & Data by @malcolmgreaves in #110
- [FEA] size-aware batching: a package for creating mini-batch in a memory consumption-aware manner by @DejunL in #168
- ESM2 Finetune bug fix and update by @farhadrgh in #197
- add dev tools to devcontainer build by @pstjohn in #210
- places the ptl artifacts ignore lines to the root directory only. by @skothenhill-nv in #21
- Jared/v2 main/nvidia styles by @jwilber in #101
- rename bionemo-fw-ea to bionemo-framework by @yzhang123 in #292
- Add BERT-style masking function by @pstjohn in #55
- Add perplexity logging by @sichu2023 in #144
- support nsys profiling on ESM2, add downstream improvements to hit P0 perf by @sichu2023 in #300
- trivial commit to bionemo2 by @broland-hat in #19
- Add geneformer bionemo1 disclaimer by @jstjohn in #278
- Split out the lightning example tutorial by @jstjohn in #67
- Move v2 commits over. by @jstjohn in #8
- Add documentation covering megatron and code structure rationalle by @jstjohn in #153
- try out gh page url to resolve 404 error by @jwilber in #233
- lowercase file name so mkdocs picks up correctly by @jwilber in #173
- use importlib resources for files by @pstjohn in #178
- add nemo-run as a git submodule by @pstjohn in #186
- Add module for loading test data. by @pstjohn in #120
- LightningDataModule for webdataset by @DejunL in #100
- Update dependency tags to match PR #36, and try to fix test failure by @jstjohn in #39
- Change to gelu default from relu which is what we actually used before by @jstjohn in #20
- Jwilber/load nb from subpackages by @jwilber in #128
- Use github runners to run pre-commit hooks by @pstjohn in #42
- Bump 3rdparty/NeMo from
ff7c614
to8f0d0c7
by @dependabot in #145 - Add a tested function to see if model parallel is enabled by @jstjohn in #175
- Handle special tokens in the bert masking function by @pstjohn in #99
- Fix all license headers to Apache by @trvachov in #347
- add dependabot file by @pstjohn in #161
- Checkpointing example with Geneformer by @skothenhill-nv in #24
- epoch-level shuffling in ESM2 dataset by @pstjohn in #150
- Bump 3rdparty/Megatron-LM from
0bda578
to08e80b0
by @dependabot in #183 - move CI scripts to central location by @pstjohn in #131
- setuptools sub-package local vs. publish by @malcolmgreaves in #133
- Nested weight munging fine-tuning/continue training example and test for example model and geneformer. by @jstjohn in #97
- ESM2 Golden Value Testing by @farhadrgh in #85
- Add pretraining documentation by @sichu2023 in #283
- Wandb integration by @olachinkei in #205
- Fix address in docs by @farhadrgh in #297
- update branch name bionemo2 by @dorotat-nv in #160
- Updated README docum...
NVIDIA BioNeMo Framework 2.0
New Features:
- ESM2 implementation
- State of the art training performance and equivalent accuracy to the reference implementation
- 650M, and 3B scale checkpoints available which mirror the reference model
- Flexible fine-tuning examples that can be copied and modified to accomplish a wide variety of downstream tasks
- First version of our NeMo v2 based reference implementation which re-imagines bionemo as a repository of megatron models, dataloaders, and training recipes which make use of NeMo v2 for training loops.
- Modular design and permissible Apache 2 OSS licenses enables the import and use of our framework in proprietary applications.
- NeMo2 training abstractions allows the user to focus on the model implementation while the training strategy handles distribution and model parallelism.
- Documentation and documentation build system for BioNeMo 2.
Known Issues:
- PEFT support is not yet fully functional.
- Partial implementation of Geneformer is present, use at your own risk. It will be optimized and officially released in the future.
- Command line interface is currently based on one-off training recipes and scripts. We are working on a configuration based approach that will be released in the future.
- Fine-tuning workflow is implemented for BERT based architectures and could be adapted for others, but it requires you to inherit from the biobert base model config. You can follow similar patterns in the short term to load weights from an old checkpoint partially into a new model, however in the future we will have a more direct API which is easier to follow.
- Slow memory leak occurs during ESM-2 pretraining, which can cause OOM during long pretraining runs. Training with a
microbatch size of 48 on 40 A100s raised an out-of-memory error after 5,800 training steps.- Possible workarounds include calling
gc.collect(); torch.cuda.empty_cache()
at every ~1,000 steps, which appears
to reclaim the consumed memory; or training with a lower microbatch size and re-starting training from a saved
checkpoint periodically.
- Possible workarounds include calling
External Partner Contributions
We would like to thank the following organizations for their insightful discussions guiding the development of the BioNeMo Framework and their valuable contributions to the codebase. We are grateful for your collaboration.
Changes
- Add GitHub workflow by @ohadmo in #9
- Move v2 commits over. by @jstjohn in #8
- Jstjohn/fix geneformer multinode by @jstjohn in #17
- places the ptl artifacts ignore lines to the root directory only. by @skothenhill-nv in #21
- ESM2 implementation by @farhadrgh in #28
- Update dependency tags to match PR #36, and try to fix test failure by @jstjohn in #39
- Test checkpoint IO loss is close to expected. by @jstjohn in #37
- Change to gelu default from relu which is what we actually used before by @jstjohn in #20
- Make artifact downloads more robust by @pstjohn in #41
- Add devcontainer config for bionemo2 by @pstjohn in #5
- Add license check to pre-commit hook by @ohadmo in #22
- Use github runners to run pre-commit hooks by @pstjohn in #42
- Add back the removed bionemo-core sub-package by @malcolmgreaves in #25
- trivial commit to bionemo2 by @broland-hat in #19
- Add mamba as a dependency in the dockerfile by @pstjohn in #44
- Add future TE support and mixed precision support to biobert test by @jstjohn in #43
- Add trufflehog as a github action check by @pstjohn in #45
- Adds CONTRIBUTING, CODE-REVIEW guides and pull request template by @malcolmgreaves in #10
- Use precision lowest value instead of -torch.inf by @farhadrgh in #35
- Add NeMo and Megatron-LM as git submodules by @pstjohn in #52
- Add a CLI option to restore training from a nemo1 checkpoint by @jstjohn in #54
- Add some additional ruff checks, ignoring existing violations by @pstjohn in #56
- Reorganize bionemo-contrib into namespace packages by @malcolmgreaves in #51
- Update devcontainer for new package structure by @pstjohn in #62
- Tell pytest to ignore 3rdparty/{NeMo,MegatronLM} by @malcolmgreaves in #61
- Clean up src vs test mirroring rule violations. by @jstjohn in #66
- fixing devcontainer target by @pstjohn in #64
- adding merge_group to existing actions by @pstjohn in #71
- Split out the lightning example tutorial by @jstjohn in #67
- Reconfigure the pre-commit workflow by @pstjohn in #63
- convert root_directory to a field with default_factory by @pstjohn in #58
- Checkpointing example with Geneformer by @skothenhill-nv in #24
- Updates to devcontainer by @skothenhill-nv in #77
- Adding license, and contributing guidelines from #72 and #65 by @jstjohn in #74
- adding some additional docstrings by @pstjohn in #81
- Pin ptl to <2.4.0 to fix nemo bug by @pstjohn in #86
- Add documentation build system for BioNeMo v2 by @pstjohn in #40
- Add BERT-style masking function by @pstjohn in #55
- fix post-create command by @pstjohn in #88
- Pbinder/move scdl by @polinabinder1 in #76
- Add ESM2 Dataset and Datamodule by @pstjohn in #78
- Upgrade nemo and megatron, and fix configs to reflect the change by @jstjohn in #92
- Bump 3rdparty/Megatron-LM from
104d864
tocf0f9b2
by @dependabot in #96 - ESM2 Golden Value Testing by @farhadrgh in #85
- fixing version issue by @polinabinder1 in #90
- adding github action for docs deployment by @pstjohn in #98
- Jared/v2 main/nvidia styles by @jwilber in #101
- Handle special tokens in the bert masking function by @pstjohn in #99
- add search highlight + code copy capabilities by @jwilber in #102
- add internal link for devcontainer cache by @pstjohn in #105
- Fix Geneformer huggingface links by @ohadmo in #106
- Fixing secuirty scan vulnerabilities by @ohadmo in #104
- add jupyter notebook support in documentation by @pstjohn in #109
- Adding Dataloading Test cases and documentation by @polinabinder1 in #107
- Bump 3rdparty/NeMo from
e6c0e72
toff7c614
by @dependabot in #103 - Pbinder/readme modify by @polinabinder1 in #115
- Promote nltk version to address GHSA-cgvx-9447 by @ohadmo in #114
- moving test data around by @polinabinder1 in #118
- Bump 3rdparty/Megatron-LM from
cf0f9b2
toef85bc9
by @dependabot in #124 - Establish CODEOWNERS for bionemo2 by @malcolmgreaves in #121
- chown /usr/local's dist-packages to allow editing them in the devcontainer by @pstjohn in #111
- Stop and Go harness and tests for geneformer and GPT. by @skothenhill-nv in #116
- Bump NeMo/Mcore by @skothenhill-nv in #127
- Complete ESM2 pretraining by @sichu2023 in #112
- LightningDataModule for webdataset by @DejunL in https://github.com/NVIDIA/bionemo-framework/pull...
NVIDIA BioNeMo Framework 1.10
Changes
- Migrated development from NVIDIA internal to GitHub
- License changed from NVIDIA proprietary to Apache 2.0
- 1.10 release is functionally equivalent to 1.9 release, previous Release Notes can be found in the documentation directory of the GitHub repository