Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HaplotypeCaller: omnibus updates and fixes for 4.0 #3519

Merged
merged 1 commit into from
Aug 26, 2017

Conversation

droazen
Copy link
Contributor

@droazen droazen commented Aug 25, 2017

-Reduce memory usage of AssemblyRegion traversal by an order of magnitude
by loading the reads for each shard more lazily.

-Add a sharding mode that creates one shard per user interval (or per contig,
if there are no explicit intervals), and make it the default for both HaplotypeCaller
and Mutect2

-When determining active regions, only consider loci within the user's intervals (but
still include surrounding reads in the final region). This mimics GATK3.x behavior.

-Serve up empty pileup objects for uncovered loci (this also mimics GATK3.x behavior).
The fact that we weren't doing this before was responsible for much of the remaining
difference vs. the GATK 3.x HaplotypeCaller.

-Ported GATK 3 PR 1389 (use median rather than the second-best likelihood for the
NON_REF allele)

-Ported a change to the ReferenceConfidenceModel from GATK3

-Fixed a bug in ReadLikelihoods that was causing ArrayIndexOutOfBoundsException

-Added special handling of RawMQ to HaplotypeCaller (mirrors the handling of RawMQ
from GenotypeGVCFs)

-Added updated concordance test data generated with HaplotypeCaller 3.8-4-g7b0250253f

Resolves #1950
Resolves #3516
Resolves #3517
Resolves #3518
Resolves #3233
Resolves #2848

@droazen
Copy link
Contributor Author

droazen commented Aug 25, 2017

@jamesemery please review

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments here and there that we have discussed. Merge after addressing comments.

/**
* Trivial wrapper around a GATKRead iterator that saves all reads returned in a cache,
* which can be periodically returned and emptied by the client.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note warning potential users to empty the cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

public class AllLocusIteratorUnitTest extends BaseTest {

@Test
public void testIteratorReturnsPileupsForAllLoci() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests for intervals that lie before and after the interval

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also tests for the sanity checks in the constructor (like having the wrong contig)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

locusIterator.forEachRemaining(pileup -> returnedPileups.add(pileup));

Assert.assertEquals(returnedPileups.size(), providedInterval.size(), "Wrong number of pileup objects returned");

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an assertion here that specifies that this is inclusive of the staring position for the provided interval

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Ignore sites that are not contained within this shard's interval
if ( ! readShard.getInterval().contains(pileup) ) {
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically unnecessary because the AllLocusIterator already asserts this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but it doesn't hurt.

}

// Add the current pileup to the activity profile
if ( readShard.getPaddedInterval().contains(pileup.getLocation()) ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second layer of redundancy which has already been asserted above on line 137

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

}

if ( ! pendingRegions.isEmpty() ) {
nextRegion = pendingRegions.poll();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could overwrite the nextRegion if it was specified by the first loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a guard

private void fillNextAssemblyRegionWithReads( final AssemblyRegion region ) {
// First we need to check the previous region for reads that also belong in this region
if ( previousRegion != null ) {
for ( final GATKRead previousRegionRead : previousRegion.getReads() ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you entirely sure that previousRegion.getReads() is safe in this case? Is it possible that the reads might have been removed or filtered before the next region reaches into it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it might not be safe. Switching to storing the previous region's reads directly instead.

*
* Loads the reads from the shard as lazily as possible to minimize memory usage.
*
* This iterator represents the core of the {@link AssemblyRegionWalker} traversal.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add to the contract of this method a warning about the needed unmapped reads filters to have been applied to the shard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Assert.assertTrue(region.getSpan().size() <= maxRegionSize, "region size " + region.getSpan().size() + " exceeds the configured maximum: " + maxRegionSize);

final int regionContigLength = readsDictionary.getSequence(region.getSpan().getContig()).getSequenceLength();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you ever assert that there are no gaps between regions? I see that you assert their ordering and their overhangs, but what if they overlap/have gaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@droazen droazen force-pushed the dr_hc_fixes_combined branch from 7a1c4f7 to 95ac782 Compare August 25, 2017 21:31
@droazen
Copy link
Contributor Author

droazen commented Aug 25, 2017

All comments addressed -- will merge after tests pass.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
-Reduce memory usage of AssemblyRegion traversal by an order of magnitude
 by loading the reads for each shard more lazily.

-Add a sharding mode that creates one shard per user interval (or per contig,
 if there are no explicit intervals), and make it the default for both HaplotypeCaller
 and Mutect2

-When determining active regions, only consider loci within the user's intervals (but
 still include surrounding reads in the final region). This mimics GATK3.x behavior.

-Serve up empty pileup objects for uncovered loci (this also mimics GATK3.x behavior).
 The fact that we weren't doing this before was responsible for much of the remaining
 difference vs. the GATK 3.x HaplotypeCaller.

-Ported GATK 3 PR 1389 (use median rather than the second-best likelihood for the
 NON_REF allele)

-Ported a change to the ReferenceConfidenceModel from GATK3

-Fixed a bug in ReadLikelihoods that was causing ArrayIndexOutOfBoundsException

-Added special handling of RawMQ to HaplotypeCaller (mirrors the handling of RawMQ
 from GenotypeGVCFs)

-Added updated concordance test data generated with HaplotypeCaller 3.8-4-g7b0250253f

Resolves #1950
Resolves #3516
Resolves #3517
Resolves #3518
Resolves #3233
Resolves #2848
@droazen droazen force-pushed the dr_hc_fixes_combined branch from 95ac782 to 6bed18c Compare August 25, 2017 21:35
@codecov-io
Copy link

Codecov Report

Merging #3519 into master will increase coverage by 0.087%.
The diff coverage is 84.729%.

@@               Coverage Diff               @@
##              master     #3519       +/-   ##
===============================================
+ Coverage     80.098%   80.184%   +0.087%     
- Complexity     17691     17764       +73     
===============================================
  Files           1184      1187        +3     
  Lines          64158     64307      +149     
  Branches        9959      9992       +33     
===============================================
+ Hits           51389     51564      +175     
+ Misses          8813      8769       -44     
- Partials        3956      3974       +18
Impacted Files Coverage Δ Complexity Δ
...oadinstitute/hellbender/engine/AssemblyRegion.java 84.358% <ø> (ø) 59 <0> (ø) ⬇️
...stitute/hellbender/tools/HaplotypeCallerSpark.java 79.048% <ø> (ø) 24 <0> (ø) ⬇️
...dinstitute/hellbender/utils/pileup/ReadPileup.java 92.414% <ø> (ø) 64 <0> (ø) ⬇️
...der/tools/walkers/annotator/RMSMappingQuality.java 98.039% <100%> (+0.08%) 25 <2> (+2) ⬆️
...itute/hellbender/tools/walkers/mutect/Mutect2.java 92% <100%> (-0.593%) 15 <2> (-1)
...kers/haplotypecaller/ReferenceConfidenceModel.java 92.778% <100%> (+0.04%) 61 <0> (ø) ⬇️
...ypecaller/AssemblyBasedCallerGenotypingEngine.java 77.5% <100%> (+0.285%) 59 <0> (+1) ⬆️
...titute/hellbender/tools/walkers/GenotypeGVCFs.java 89.916% <100%> (-0.084%) 46 <0> (-1)
...stitute/hellbender/utils/clipping/ReadClipper.java 71.038% <100%> (ø) 75 <4> (ø) ⬇️
...tools/walkers/haplotypecaller/HaplotypeCaller.java 93.75% <100%> (-0.368%) 17 <1> (-1)
... and 28 more

@droazen droazen merged commit 1ef09b3 into master Aug 26, 2017
@droazen droazen deleted the dr_hc_fixes_combined branch August 26, 2017 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants