Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permutect dataset engine outputs contig and read group indices, not names #8860

Merged
merged 2 commits into from
Jun 4, 2024

Conversation

davidbenjamin
Copy link
Contributor

This is important for Infogain and other potential hybrid data because it will allow Permutect to separately normalize within each read group and otherwise keep track of different read groups downstream.

It also makes the data structures in Permutect much more convenient because numeric data is easier to write to a memory map.

Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick question about if sequence_dictionary can be null, but otherwise looks good 👍

@@ -106,7 +128,7 @@ public void addData(final ReferenceContext ref, final VariantContext vc, Optiona
final M2ArgumentCollection.Mutect3DatasetMode mutect3DatasetMode) {
final String refBases = ReferenceBases.annotate(ref, vc);
final String refAllele = vc.getReference().getBaseString();
final String contig = vc.getContig();
final int contigIndex = sequenceDictionary.getSequenceIndex(vc.getContig());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the sequence dictionary is null? Is that checked for somewhere? Or does GATK not allow input BAMs with no sequence dictionary?

Copy link
Contributor Author

@davidbenjamin davidbenjamin Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sequence dictionary ultimately comes from Mutect2::getBestAvailableSequenceDictionary(), so it's not going to be null even if the BAM doesn't have one. Would you like me to put in some Utils.nonNull checks somewhere? (for future-proofing and/or peace of mind)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it sounds like getBestAvailableSequenceDictionary() is already handling that check, so that should be fine.

@davidbenjamin davidbenjamin force-pushed the db_permutect_tensor_read_groups branch from 416e0d0 to 0f3f51a Compare June 4, 2024 14:04
@davidbenjamin davidbenjamin merged commit 2a420e4 into master Jun 4, 2024
21 checks passed
@davidbenjamin davidbenjamin deleted the db_permutect_tensor_read_groups branch June 4, 2024 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants