Allow missing vcf samples or gvcfs in Module00c #207
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For large cohorts, it is useful in practice to be able to omit certain samples from BAF generation (in many cases owing to issues with SNP/indel calling on a small subset of samples). This typically does not affect results, as BAF is used only for training in Module03.
This PR adds a
ignore_missing_baf_samples
option to 00c that is off by default. Setting to true skips cross-checks between the sample list and provided VCF/gVCF samples (an error is thrown if an expected sample is not present in the snp data). When using gVCFs for BAF generation, the input type has been changed fromArray[File]?
toArray[File?]?
to allow for null inputs. For the sharded VCF method of BAF generation, samples can simply be absent.This branch was tested using the default 00c test_small, and a modified version of the
Module00cTest.test_baf_from_vcf.json
input where the first sample gVCF was omitted.