Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow missing vcf samples or gvcfs in Module00c #207

Merged
merged 3 commits into from
Jul 26, 2021
Merged

Conversation

mwalker174
Copy link
Collaborator

For large cohorts, it is useful in practice to be able to omit certain samples from BAF generation (in many cases owing to issues with SNP/indel calling on a small subset of samples). This typically does not affect results, as BAF is used only for training in Module03.

This PR adds a ignore_missing_baf_samples option to 00c that is off by default. Setting to true skips cross-checks between the sample list and provided VCF/gVCF samples (an error is thrown if an expected sample is not present in the snp data). When using gVCFs for BAF generation, the input type has been changed from Array[File]? to Array[File?]? to allow for null inputs. For the sharded VCF method of BAF generation, samples can simply be absent.

This branch was tested using the default 00c test_small, and a modified version of the Module00cTest.test_baf_from_vcf.json input where the first sample gVCF was omitted.

Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I think your reorganization of GenerateBAF and GatherBAF makes sense. I assume the intention behind passing samples to GenerateBAF for GVCFs is so that it will error out for missing samples if ignore_missing_gvcfs is not set?

I noticed there are some linting errors that should be addressed, but could probably be merged without remaking the dockers for now.

To provide gvcfs for most but not all samples, would the input look like this? It might be worth documenting this as I doubt most people have worked with Array[File?]? inputs and I didn't see anything about it in the WDL spec.

"Module00c.gvcfs": ["gvcf.sample1.vcf.gz", , "gvcf.sample3.vcf.gz"]

Has Module03 been tested with missing BAF for some samples in the past?

@epiercehoffman
Copy link
Collaborator

Another thought - should we provide guidance on a minimum % of samples that need BAF?

@mwalker174
Copy link
Collaborator Author

This looks good to me. I think your reorganization of GenerateBAF and GatherBAF makes sense. I assume the intention behind passing samples to GenerateBAF for GVCFs is so that it will error out for missing samples if ignore_missing_gvcfs is not set?

Yes it is for cross-checking with the SNP calls.

I noticed there are some linting errors that should be addressed, but could probably be merged without remaking the dockers for now.

Fixed!

To provide gvcfs for most but not all samples, would the input look like this? It might be worth documenting this as I doubt most people have worked with Array[File?]? inputs and I didn't see anything about it in the WDL spec.

"Module00c.gvcfs": ["gvcf.sample1.vcf.gz", , "gvcf.sample3.vcf.gz"]

Should look like ["Module00c.gvcfs": ["gvcf.sample1.vcf.gz", null, "gvcf.sample3.vcf.gz"]. Added a note in the documentation.

Has Module03 been tested with missing BAF for some samples in the past?

Yes we ran gnomAD-v3 with some samples missing BAF. Anything around ~1% should be very tolerable, but we haven't tested beyond that. Added a quick note in the documentation.

@mwalker174 mwalker174 merged commit 5cff5ad into master Jul 26, 2021
@mwalker174 mwalker174 deleted the mw_missing_baf branch July 26, 2021 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants