Releases: google/deepvariant
DeepVariant 1.8.0
In this release:
-
Small model integration: Speed increased by ~1.7x (40% runtime reduction) for WGS, PacBio, and ONT by introduction of additional small model. The small model identifies easy-to-call sites and invokes the standard DeepVariant model for harder sites. We observe similar or improved accuracies and confidence calibration with this combination. Use of the small model can be disabled with
--disable_small_model=true
option. For details, please see small model details doc. -
Pangenome-aware variant calling: Added a new ability to directly use information from a pangenome in the process of variant calling. This improves accuracy with both BAMs mapped with standard BWA and with BAMs using vg-Giraffe to a pangenome. Error reduction is ~30% with vg-Giraffe mapped WGS, 10% with BWA-mapped WGS, and 5% for BWA-mapped WES. See details in metrics page.
-
Configure a fast pipeline: Optional mode to increase efficiency for high-throughput GPU implementations. Configurations which pipeline example generation with GPU-based variant calling to increase utilization of GPU resources. See case study for details.
-
Introduced new Mas-Seq models for variant calling with Kinnex kits/Mas-Seq data. See case study for details.
-
PacBio models are now trained with labels from the Platinum Pedigree, which reduces errors by 34% on this more comprehensive truth set including very difficult parts of the genome.
-
Added SPRQ data to PacBio training datasets, improving accuracy for SPRQ chemistry. Updated the PacBio case study data to 2024 SPRQ release. Reduced error on SPRQ chemistry by 27% percent relative to DeepVariant v1.6. Updating to DeepVariant v1.8 is recommended for SPRQ.
-
Updated how model file metadata is specified, to accommodate more flexible ways of specifying channels. Custom models now require an accompanying example_info.json file containing the image shape details generated during training image generation in make_examples and call_variants stage. An example use of custom model is T7 cas-study where you can see
example_info.json
file is downloaded in this section to successfully run DeepVariant.
We are thankful for the contributions from:
- Mobin Asri (@mobinasri) and Juan Carlos Mier (@jmier2) on pangenome-aware DeepVariant work.
- Ralf W. Grosse-Kunstleve (@rwgk) for helping to migrate from CLIF to pybind.
- Shiyi Yin (@yinshiyi) for Mas-Seq model work.
- Maya Venkatraman (@mv2731) for helping to explore model architectures.
- Ben Soudry (@ben-soudry) for helping to streamline channel inputs.
- Atilla Kiraly (@akiraly1) and Yuchen Zhou (@Yuchen-95) on explainability work.
- Jorge Gonzalez Mendez (@jgonzalezmendez) on improving the C++ code quality.
- Stephanie Steele (@stesteele) for helping migrate python code to C++.
DeepVariant 1.6.1
In this release:
- We fixed a bug in
call_variants
that caused the step to freeze in cases where there were no examples. This bug was observed and reported in #764, #769, google/deepsomatic#8. - Updated
libssw
library from 1.2.4 to 1.2.5. - The same model files are used for v1.6.0 and v1.6.1 for all technologies.
DeepVariant 1.6.0
- Improved support for haploid regions, chrX and chY. Users can specify haploid regions with a flag. Updated case studies show usage and metrics.
- Added pangenome workflow (FASTQ-to-VCF mapping with VG and DeepVariant calling). Case study demonstrates improved accuracy
- Substantial improvements to DeepTrio de novo accuracy by specifically training DeepTrio for this use case (for chr20 at 30x HG002-HG003-HG004, false negatives reduced from 8 to 0 with DeepTrio v1.4, false positives reduced from 5 to 0).
- We have added multi-processing ability in
postprocess_variants
which reduces 48 minutes to 30 minutes for Illumina WGS and 56 minutes to 33 minutes with PacBio. - We have added new models trained with Complete genomics data, and added case studies.
- We have added NovaSeqX to the training data for the WGS model.
- We have migrated our training and inference platform from Slim to Keras.
- Force calling with approximate phasing is now available.
We are sincerely grateful to
- @wkwan and @paulinesho for the contribution to helping in Keras move.
- @lucasbrambrink for enabling multiprocessing in
postprocess_variants
. - @MSamman, @akiraly1 for their contributions.
- PacBio: William Rowell (@williamrowell), Nathaniel Echols for their feedback and testing.
- UCSC: Benedict Paten(@benedictpaten), Shloka Negi (@shlokanegi), Jimin Park (@jimin001), Mobin Asri (@mobinasri) for the feedback.
DeepVariant 1.5.0
- New model datatype:
--model_type ONT_R104
is a new option. Starting from v1.5, DeepVariant natively supports ONT R10.4 simplex and duplex data.- For older ONT chemistry, please continue to use PEPPER-Margin-DeepVariant.
- Incorporated PacBio Revio training data in DeepVariant PacBio model. In our evaluations this single model performs well on both Sequel II and Revio datatypes. Please use DeepVariant v1.5 and later for Revio data.
- Incorporated Element Biosciences data in WGS models. We found that we could jointly train a short-read WGS model with both Illumina and Element data. Inclusion of Element data improves accuracy on Element without negative effect on Illumina. Please use the WGS model for best results on either Illumina or Element data.
- Added vg/Giraffe-mapped BAMs to DeepVariant WGS training data (alongside existing BWA). We observed that a single model can be trained for strong results with both BWA and vg/Giraffe.
- Improved DeepVariant WES model for 100bps exome sequencing thanks to user-reported issues (including #586 and #592).
- Thanks to Tong Zhu from Nvidia for his suggestion to improve the logic for shuffling reads.
- Thanks to Doron Shem-Tov (@doron-st) and Ilya Soifer (@ilyasoifer) from Ultima Genomics for adding new functionalities enabled by flags
--enable_joint_realignment
and--p_error
. - Thanks to Dennis Yelizarov for improving Google-internal infrastructure for running make_examples.
- Updated TensorFlow version to 2.11.0. Updated htslib version to 1.13.
DeepVariant 1.4.0
- Simplified DeepVariant PacBio by introducing approximate haplotagging. This means PacBio users who run DeepVariant no longer need to run DeepVariant+WhatsHap+DeepVariant. See PacBio case study for more information.
- For Illumina WGS and WES, we add an additional feature of read insert size (
insert_size
) . This reduces errors by 4-10% for Illumina WGS and WES model. Thanks @lucasbrambrink for implementing this feature. - Reduced the runtime of the
postprocess_variants
step by 10-30%. Thanks @MosheWagner for optimizing the code. - Included experimental code which explores use of Keras for model architecture. This is not used in production methods, but may be informative to developers seeking examples of Keras applied to similar problems. Thanks @wkwan and @paulinesho for their contributions.
- We did not include OpenVINO by default in the Docker images we released. Users can still build their own Docker images with the option turned on as needed.
- Updated 2022-10-17: We have released an Illumina RNA-seq model and added an RNA-seq case study.
DeepVariant 1.3.0
- Improved the DeepTrio PacBio models on PacBio Sequel II Chemistry v2.2 by including this data in the training dataset.
- Improved
call_variants
speed for PacBio models (both DeepVariant and DeepTrio) by reducing the default window width from 221 to 199, without tradeoff on accuracy. Thanks to @lucasbrambrink for conducting the experiments to find a better window width for PacBio. - Introduced a new flag
--normalize_reads
inmake_examples
, which normalizes Indel candidates at the reads level.This flag is useful to reduce rare cases where an indel variant is not left-normalized. This feature is mainly relevant to joint calling of large cohorts for joint calling, or cases where read mappings have been surjected from one reference to another. It is currently set to False by default. To enable it, add--normalize_reads=true
directly to themake_examples
binary. If you’re using therun_deepvariant
one-step approach, add--make_examples_extra_args="normalize_reads=true"
. Currently we don’t recommend turning this flag on for long reads due to potential runtime increase. - Added an
--aux_fields_to_keep
flag to themake_examples
step, and set the default to only the auxiliary fields that DeepVariant currently uses. This reduces memory use for input BAM files that have large auxiliary fields that aren’t used in variant calling. Thanks to @williamrowell and @rhallPB for reporting this issue. - Reduced the frequency of logging in
make_examples
as well ascall_variants
to address the issue reported in #491.
DeepVariant 1.2.0
The DeepVariant v1.2 release contains the following major improvements:
- A major code refactor for
make_examples
better modularizes common components between DeepVariant, DeepTrio, and potential future applications. This enables DeepTrio to inherit improvements such as--add_hp_channel
(introduced to the DeepVariant PacBio model in v1.1; see blog), improving DeepTrio’s PacBio accuracy. - The DeepVariant PacBio model has substantially improved accuracy for PacBio Sequel II Chemistry v2.2, achieved by including this data in the training dataset.
- We updated several dependencies: Python version to 3.8, TensorFlow version to 2.5.0, and GPU support version to CUDA 11.3 and cuDNN 8.2. The greater computational efficiency of these dependencies results in improvements to speed.
- In the "training" model for make_examples, we committed (4a11046) that fixed an issue introduced in an earlier commit (a4a6547) where make_examples might generate fewer REF (class0) examples than expected.
- Improvements to accuracy for Illumina WGS models for various, shorter read lengths. Thanks to the following contributors and their teams for the idea:
- Dr. Masaru Koido (The University of Tokyo and RIKEN)
- Dr. Yoichiro Kamatani (The University of Tokyo and RIKEN)
- Mr. Kohei Tomizuka (RIKEN)
- Dr. Chikashi Terao (RIKEN)
Additional detail for improvements in DeepVariant v1.2:
Improvements for training:
- We augmented the training data for Illumina WGS model by adding BAMs with trimmed reads (125bps and 100bps) to improve our model’s robustness on different read lengths.
Improvements for make_examples
:
For more details on flags, run /opt/deepvariant/bin/make_examples --help
for more details.
- Major refactoring to ensure useful features (such as --add_hp_channel) can be shared between DeepVariant and DeepTrio make_examples.
- Add MED_DP (median of DP) in the gVCF output. See this section for more details.
- New
--split_skip_reads
flag: if True, make_examples will split reads with large SKIP cigar operations into individual reads. Resulting read parts that are less than 15 bp are filtered out. - We now sort the realigned BAM output mentioned in this section when you use
--emit_realigned_reads=true --realigner_diagnostics=/output/realigned_reads
for make_examples. You will still need to runsamtools index
to get the index file, but no longer need to sort the BAM. - Added an experimental prototype for multi-sample make_examples.
- This is an experimental prototype for working with multiple samples in DeepVariant, a proof of concept enabled by the refactoring to join together DeepVariant and DeepTrio, generalizing the functionality of make_examples to work with multiple samples. Usage information is in multisample_make_examples.py, but note that this is experimental.
- Improved logic for read allele counts calculation for sites with low base quality indels, which resulted in Indel accuracy improvement for PacBio models.
- Improvements to the realigner code to fix certain uncommon edge cases.
Improvements for the one-step run_deepvariant
:
For more details on flags, run /opt/deepvariant/bin/run_deepvariant --help
for more details.
- New
--runtime_report
which enables runtime report output to--logging_dir
. This makes it easier for users to get the runtime by region report for make_examples. - New
--dry_run
flag is now added for printing out all commands to be executed, without running them. This is mentioned in the Quick Start section.
DeepVariant 1.1.0
The v1.1 release introduces DeepTrio, which uses a model specifically trained to call a mother-father-child trio or parent-child duo. DeepTrio has superior accuracy compared to DeepVariant. Pre-trained models are available for Illumina WGS, Illumina exome, and PacBio HiFi.
In addition, DeepVariant v1.1 contains the following improvements:
- Accuracy improvements on PacBio, reducing Indel errors by ~21% on the case study. This is achieved by adding an input channel which specifically encodes haplotype information, as opposed to only sorting by haplotype in v1.0. The flag is
--add_hp_channel
which is enabled by default for PacBio. - Speed improvements for long read data by more efficient handling of long CIGAR strings.
- New functionality to add detailed logs for runtime of make_examples by genomic region, viewable in an interactive visualization.
- We now fully withhold HG003 from all training, and report all accuracy evaluations on HG003. We continue to withhold chromosome20 from training in all samples.
New optional flags to increase speed:
A team at Intel has adapted DeepVariant to use the OpenVINO toolkit, which further accelerates
TensorFlow applications. This further speeds up the call_variants stage by ~25% for any model when run in CPU mode on an Intel machine. DeepVariant runs of OpenVINO have the same accuracy and are nearly identical to runs without. Runs with OpenVINO are fully reproducible on OpenVINO.
To use OpenVINO, add the following flag too the DeepVariant command:
--call_variants_extra_args "use_openvino=true"
We thank Intel for their contribution, and acknowledge the extensive work their team put in, captured in (#363)
DeepVariant 1.0.0
DeepVariant v1.0 releases new features and accuracy improvements sufficiently substantial to indicate a major version of v1.0. Compared to DeepVariant v0.10, these changes reduce Illumina WGS errors by 24%, exome errors by 19%, and PacBio errors by 52%.
- Added ALT-aligned pileups, which creates additional input channels where reads are also aligned to the candidate ALT alleles. This is controlled by the flag
--alt_aligned_pileup
.--alt_aligned_pileup=diff_channels
is now default for DeepVariant PacBio model. This substantially improves INDEL accuracy for PacBio data. - Added new flag
--sort_by_haplotypes
to optionally allow creating pileup images with reads sorted by haplotype. Haplotype sorting is based on the HP tag that must be present in input BAM, and--parse_sam_aux_fields
needs to be set as well. This substantially improves INDEL accuracy for PacBio data. - The PacBio case study now includes instructions for two-pass calling, which allows users to take advantage of the
--sort_by_haplotypes
by phasing variants and the input reads. Accuracy metrics for both single pass calling and two-pass calling are shown. Users may choose whether to run a second time for higher accuracy. - Default of
--min_mapping_quality
in make_examples.py changed from 10 to 5. This improves accuracy of all models (WGS, WES, and PACBIO). - Included a new hybrid illumina+pacbio model and documentation.
- Added show_examples, a tool for showing examples as pileup image files, with documentation.
- Cleaned up unused experimental flags:
--sequencing_type_image
and--custom_pileup_image
- Added
--only_keep_pass
flag to postprocess_variants.py to optionally only keep PASS calls in output VCF. - Addressed GitHub issues:
- Fixed the
binarize
function in modelling.py. (#286 fixed in db87d77) - Fixed quoting issues for
--regions
when using run_deepvariant.py. (#305 fixed in fbacd35) - Added
--version
to run_deepvariant.py. (#332 fixed in f101492) - Added
--sample_name
flag to postprocess_variant.py and applied it in run_deepvariant.py as well. (#334 fixed in a81d629)
- Fixed the
DeepVariant 0.10.0
- Update to Python3 and TensorFlow2: We use Python3.6, and pin to TensorFlow 2.0.0.
- Improved PacBio model for amplified libraries: the PacBio HiFi training data now includes amplified libraries at both standard and high coverages. This provides a substantial accuracy boost to variant detection from amplified HiFi data.
- Turned off
ws_use_window_selector_model
by default: This flag was turned on by default in v0.7.0. After the discussion in issue #272, we decided to turn this off to improve consistency and accuracy, at the trade-off of a 7% increase in runtime of themake_examples
step.
Users may add--make_examples_extra_args "ws_use_window_selector_model=true"
to save some runtime at the expense of accuracy.