-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HaplotypeCaller: omnibus updates and fixes for 4.0 #3519
Conversation
@jamesemery please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments here and there that we have discussed. Merge after addressing comments.
/** | ||
* Trivial wrapper around a GATKRead iterator that saves all reads returned in a cache, | ||
* which can be periodically returned and emptied by the client. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a note warning potential users to empty the cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
public class AllLocusIteratorUnitTest extends BaseTest { | ||
|
||
@Test | ||
public void testIteratorReturnsPileupsForAllLoci() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests for intervals that lie before and after the interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also tests for the sanity checks in the constructor (like having the wrong contig)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
locusIterator.forEachRemaining(pileup -> returnedPileups.add(pileup)); | ||
|
||
Assert.assertEquals(returnedPileups.size(), providedInterval.size(), "Wrong number of pileup objects returned"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an assertion here that specifies that this is inclusive of the staring position for the provided interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Ignore sites that are not contained within this shard's interval | ||
if ( ! readShard.getInterval().contains(pileup) ) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically unnecessary because the AllLocusIterator already asserts this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but it doesn't hurt.
} | ||
|
||
// Add the current pileup to the activity profile | ||
if ( readShard.getPaddedInterval().contains(pileup.getLocation()) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second layer of redundancy which has already been asserted above on line 137
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
} | ||
|
||
if ( ! pendingRegions.isEmpty() ) { | ||
nextRegion = pendingRegions.poll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could overwrite the nextRegion if it was specified by the first loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in a guard
private void fillNextAssemblyRegionWithReads( final AssemblyRegion region ) { | ||
// First we need to check the previous region for reads that also belong in this region | ||
if ( previousRegion != null ) { | ||
for ( final GATKRead previousRegionRead : previousRegion.getReads() ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you entirely sure that previousRegion.getReads() is safe in this case? Is it possible that the reads might have been removed or filtered before the next region reaches into it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it might not be safe. Switching to storing the previous region's reads directly instead.
* | ||
* Loads the reads from the shard as lazily as possible to minimize memory usage. | ||
* | ||
* This iterator represents the core of the {@link AssemblyRegionWalker} traversal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add to the contract of this method a warning about the needed unmapped reads filters to have been applied to the shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
Assert.assertTrue(region.getSpan().size() <= maxRegionSize, "region size " + region.getSpan().size() + " exceeds the configured maximum: " + maxRegionSize); | ||
|
||
final int regionContigLength = readsDictionary.getSequence(region.getSpan().getContig()).getSequenceLength(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you ever assert that there are no gaps between regions? I see that you assert their ordering and their overhangs, but what if they overlap/have gaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
7a1c4f7
to
95ac782
Compare
All comments addressed -- will merge after tests pass. |
-Reduce memory usage of AssemblyRegion traversal by an order of magnitude by loading the reads for each shard more lazily. -Add a sharding mode that creates one shard per user interval (or per contig, if there are no explicit intervals), and make it the default for both HaplotypeCaller and Mutect2 -When determining active regions, only consider loci within the user's intervals (but still include surrounding reads in the final region). This mimics GATK3.x behavior. -Serve up empty pileup objects for uncovered loci (this also mimics GATK3.x behavior). The fact that we weren't doing this before was responsible for much of the remaining difference vs. the GATK 3.x HaplotypeCaller. -Ported GATK 3 PR 1389 (use median rather than the second-best likelihood for the NON_REF allele) -Ported a change to the ReferenceConfidenceModel from GATK3 -Fixed a bug in ReadLikelihoods that was causing ArrayIndexOutOfBoundsException -Added special handling of RawMQ to HaplotypeCaller (mirrors the handling of RawMQ from GenotypeGVCFs) -Added updated concordance test data generated with HaplotypeCaller 3.8-4-g7b0250253f Resolves #1950 Resolves #3516 Resolves #3517 Resolves #3518 Resolves #3233 Resolves #2848
95ac782
to
6bed18c
Compare
Codecov Report
@@ Coverage Diff @@
## master #3519 +/- ##
===============================================
+ Coverage 80.098% 80.184% +0.087%
- Complexity 17691 17764 +73
===============================================
Files 1184 1187 +3
Lines 64158 64307 +149
Branches 9959 9992 +33
===============================================
+ Hits 51389 51564 +175
+ Misses 8813 8769 -44
- Partials 3956 3974 +18
|
-Reduce memory usage of AssemblyRegion traversal by an order of magnitude
by loading the reads for each shard more lazily.
-Add a sharding mode that creates one shard per user interval (or per contig,
if there are no explicit intervals), and make it the default for both HaplotypeCaller
and Mutect2
-When determining active regions, only consider loci within the user's intervals (but
still include surrounding reads in the final region). This mimics GATK3.x behavior.
-Serve up empty pileup objects for uncovered loci (this also mimics GATK3.x behavior).
The fact that we weren't doing this before was responsible for much of the remaining
difference vs. the GATK 3.x HaplotypeCaller.
-Ported GATK 3 PR 1389 (use median rather than the second-best likelihood for the
NON_REF allele)
-Ported a change to the ReferenceConfidenceModel from GATK3
-Fixed a bug in ReadLikelihoods that was causing ArrayIndexOutOfBoundsException
-Added special handling of RawMQ to HaplotypeCaller (mirrors the handling of RawMQ
from GenotypeGVCFs)
-Added updated concordance test data generated with HaplotypeCaller 3.8-4-g7b0250253f
Resolves #1950
Resolves #3516
Resolves #3517
Resolves #3518
Resolves #3233
Resolves #2848