diff --git a/README.md b/README.md
index 15c966238..7f1ba19fc 100644
--- a/README.md
+++ b/README.md
@@ -329,7 +329,10 @@ Generates variant metrics for filtering.
## FilterBatch
*Formerly Module03*
-Filters poor quality variants and filters outlier samples.
+Filters poor quality variants and filters outlier samples. This workflow can be run all at once with the WDL at `wdl/FilterBatch.wdl`, or it can be run in three steps to enable tuning of outlier filtration cutoffs. The three subworkflows are:
+1. FilterBatchSites: Per-batch variant filtration
+2. PlotSVCountsPerSample: Visualize SV counts per sample per type to help choose an IQR cutoff for outlier filtering, and preview outlier samples for a given cutoff
+3. FilterBatchSamples: Per-batch outlier sample filtration; provide an appropriate `outlier_cutoff_nIQR` based on the SV count plots and outlier previews from step 2.
#### Prerequisites:
* [GenerateBatchMetrics](#generate-batch-metrics)
@@ -441,7 +444,7 @@ gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2
```
* BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches
-* FilterOutlierSamples - remove outlier samples with unusually high or low number of SVs
+* FilterOutlierSamplesPostMinGQ - remove outlier samples with unusually high or low number of SVs
* FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation
## AnnotateVcf (in development)
diff --git a/input_templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl b/input_templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl
index 3146913b8..1698c05da 100644
--- a/input_templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl
+++ b/input_templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl
@@ -55,14 +55,19 @@ The following workflows are included in this workspace, to be executed in this o
4. `04-GatherBatchEvidence`: Per-batch copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation
5. `05-ClusterBatch`: Per-batch variant clustering
6. `06-GenerateBatchMetrics`: Per-batch variant filtering, metric generation
-7. `07-FilterBatch`: Per-batch variant filtering; outlier exclusion
-8. (Skip for a single batch) `08-MergeBatchSites`: Site merging of SVs discovered across batches, run on a cohort-level `sample_set_set`
-9. `09-GenotypeBatch`: Per-batch genotyping of all sites in the cohort. Use `09-GenotypeBatch_SingleBatch` if you only have one batch.
-10. `10-RegenotypeCNVs`: Cohort-level genotype refinement of some depth calls. Use `10-RegenotypeCNVs_SingleBatch` if you only have one batch.
-11. `11-MakeCohortVcf`: Cohort-level cross-batch integration; complex variant resolution and re-genotyping; VCF cleanup. Use `11-MakeCohortVcf_SingleBatch` if you only have one batch.
-12. `12-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and AF annotation with external population callsets. Use `12-AnnotateVcf_SingleBatch` if you only have one batch.
+7. `07a-FilterBatchSites`: Per-batch variant filtering
+8. `07b-PlotSVCountsPerSample`: Plot SV counts per sample per SV type to enable choice of IQR cutoff for outlier filtration in `07c-FilterBatchSamples`
+9. `07c-FilterBatchSamples`: Per-batch outlier sample filtration
+10. (Skip for a single batch) `08-MergeBatchSites`: Site merging of SVs discovered across batches, run on a cohort-level `sample_set_set`
+11. `09-GenotypeBatch`: Per-batch genotyping of all sites in the cohort. Use `09-GenotypeBatch_SingleBatch` if you only have one batch.
+12. `10-RegenotypeCNVs`: Cohort-level genotype refinement of some depth calls. Use `10-RegenotypeCNVs_SingleBatch` if you only have one batch.
+13. `11-MakeCohortVcf`: Cohort-level cross-batch integration; complex variant resolution and re-genotyping; VCF cleanup. Use `11-MakeCohortVcf_SingleBatch` if you only have one batch.
+14. `12-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and AF annotation with external population callsets. Use `12-AnnotateVcf_SingleBatch` if you only have one batch.
-Additional modules, such as those for filtering and visualization, are under development. They are not included in this workspace at this time, but the source code can be found in the [GATK-SV GitHub repository](https://github.com/broadinstitute/gatk-sv).
+Additional downstream modules, such as those for filtering and visualization, are under development. They are not included in this workspace at this time, but the source code can be found in the [GATK-SV GitHub repository](https://github.com/broadinstitute/gatk-sv). See **Downstream steps** towards the bottom of this page for more information.
+
+Extra workflows (Not part of canonical pipeline, but included for your convenience. May require manual configuration):
+* `FilterOutlierSamples`: Filter outlier samples (in terms of SV counts) from a single VCF. Recommended to run `07b-PlotSVCountsPerSample` beforehand (reconfigured with the single VCF you want to filter) to enable IQR cutoff choice.
For detailed instructions on running the pipeline in Terra, see **Step-by-step instructions** below.
@@ -178,24 +183,26 @@ Read the full documentation for these modules [here](https://github.com/broadins
* Use the same `sample_set` definitions you used for `03-TrainGCNV` and `04-GatherBatchEvidence`.
-#### 07-FilterBatch
+#### 07a-FilterBatchSites, 07b-PlotSVCountsPerSample, 07c-FilterBatchSamples
-Read the full FilterBatch documentation [here](https://github.com/broadinstitute/gatk-sv#filter-batch).
+These three workflows make up FilterBatch; they are subdivided in this workspace to enable tuning of outlier filtration cutoffs. Read the full FilterBatch documentation [here](https://github.com/broadinstitute/gatk-sv#filter-batch).
* Use the same `sample_set` definitions you used for `03-TrainGCNV` through `06-GenerateBatchMetrics`.
-* The default value for `outlier_cutoff_nIQR`, which is used to filter samples that have an abnormal number of SV calls, is 10000. This essentially means that no samples are filtered. You should adjust this value depending on your scientific needs.
+* `07a-FilterBatchSites` does not require user intervention
+* `07b-PlotSVCountsPerSample` produces SV count plots and files, as well as a preview of the outlier samples to be filtered, but it does not perform any filtering of the VCFs. The input `N_IQR_cutoff` is used to visualize filtration thresholds on the SV count plots and preview the samples to be filtered; the default value is set to 6. You can adjust this value depending on your needs, and you can re-run the workflow with new `N_IQR_cutoff` values until the plots and outlier sample lists suit the purposes of your study. Once you have chosen an IQR cutoff, provide it to the `N_IQR_cutoff` input in `07c-FilterBatchSamples` to filter the VCFs using the chosen cutoff.
+* `07c-FilterBatchSamples` performs outlier sample filtration, removing samples with an abnormal number of SV calls of at least one SV type. To tune the filtering threshold to your needs, edit the `N_IQR_cutoff` input value based on the plots and outlier sample preview lists from `07b-PlotSVCountsPerSample`. The default value for `N_IQR_cutoff` in this step is 10000, which essentially means that no samples are filtered.
#### 08-MergeBatchSites
Read the full MergeBatchSites documentation [here](https://github.com/broadinstitute/gatk-sv#merge-batch-sites).
* If you only have one batch, skip this workflow.
-* For a multi-batch cohort, `08-MergeBatchSites` is a cohort-level workflow, so it is run on a `sample_set_set` containing all of the batches in the cohort. You can create this `sample_set_set` while you are launching the `08-MergeBatchSites` workflow: click "Select Data", choose "Create new sample_set_set [...]", check all the batches to include (all of the ones used in `03-TrainGCNV` through `07-FilterBatch`), and give it a name that follows the **Sample ID requirements**.
+* For a multi-batch cohort, `08-MergeBatchSites` is a cohort-level workflow, so it is run on a `sample_set_set` containing all of the batches in the cohort. You can create this `sample_set_set` while you are launching the `08-MergeBatchSites` workflow: click "Select Data", choose "Create new sample_set_set [...]", check all the batches to include (all of the ones used in `03-TrainGCNV` through `07c-FilterBatchSamples`), and give it a name that follows the **Sample ID requirements**.
#### 09-GenotypeBatch
Read the full GenotypeBatch documentation [here](https://github.com/broadinstitute/gatk-sv#genotype-batch).
-* Use the same `sample_set` definitions you used for `03-TrainGCNV` through `07-FilterBatch`.
+* Use the same `sample_set` definitions you used for `03-TrainGCNV` through `07c-FilterBatchSamples`.
* If you only have one batch, use the `09-GenotypeBatch_SingleBatch` version of the workflow.
#### 10-RegenotypeCNVs, 11-MakeCohortVcf, and 12-AnnotateVcf
diff --git a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterBatch.json.tmpl b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterBatch.json.tmpl
deleted file mode 100644
index 72c2feab2..000000000
--- a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterBatch.json.tmpl
+++ /dev/null
@@ -1,19 +0,0 @@
-{
- "FilterBatch.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
- "FilterBatch.sv_base_mini_docker": "${workspace.sv_base_mini_docker}",
- "FilterBatch.linux_docker" : "${workspace.linux_docker}",
-
- "FilterBatch.outlier_cutoff_nIQR": "10000",
-
- "FilterBatch.primary_contigs_list": "${workspace.primary_contigs_list}",
- "FilterBatch.sv_pipeline_base_docker": "${workspace.sv_pipeline_base_docker}",
- "FilterBatch.ped_file": "${workspace.cohort_ped_file}",
-
- "FilterBatch.batch": "${this.sample_set_id}",
- "FilterBatch.depth_vcf" : "${this.clustered_depth_vcf}",
- "FilterBatch.manta_vcf" : "${this.clustered_manta_vcf}",
- "FilterBatch.wham_vcf" : "${this.clustered_wham_vcf}",
- "FilterBatch.melt_vcf" : "${this.clustered_melt_vcf}",
- "FilterBatch.evidence_metrics": "${this.metrics}",
- "FilterBatch.evidence_metrics_common": "${this.metrics_common}"
-}
diff --git a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterBatchSamples.json.tmpl b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterBatchSamples.json.tmpl
new file mode 100644
index 000000000..f2922ce18
--- /dev/null
+++ b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterBatchSamples.json.tmpl
@@ -0,0 +1,11 @@
+{
+ "FilterBatchSamples.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
+ "FilterBatchSamples.sv_base_mini_docker": "${workspace.sv_base_mini_docker}",
+ "FilterBatchSamples.linux_docker" : "${workspace.linux_docker}",
+
+ "FilterBatchSamples.N_IQR_cutoff": "10000",
+
+ "FilterBatchSamples.batch": "${this.sample_set_id}",
+ "FilterBatchSamples.vcfs" : "${this.sites_filtered_vcfs}",
+ "FilterBatchSamples.sv_counts": "${this.sv_counts}"
+}
diff --git a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterBatchSites.json.tmpl b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterBatchSites.json.tmpl
new file mode 100644
index 000000000..755dd447d
--- /dev/null
+++ b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterBatchSites.json.tmpl
@@ -0,0 +1,11 @@
+{
+ "FilterBatchSites.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
+
+ "FilterBatchSites.batch": "${this.sample_set_id}",
+ "FilterBatchSites.depth_vcf" : "${this.clustered_depth_vcf}",
+ "FilterBatchSites.manta_vcf" : "${this.clustered_manta_vcf}",
+ "FilterBatchSites.wham_vcf" : "${this.clustered_wham_vcf}",
+ "FilterBatchSites.melt_vcf" : "${this.clustered_melt_vcf}",
+ "FilterBatchSites.evidence_metrics": "${this.metrics}",
+ "FilterBatchSites.evidence_metrics_common": "${this.metrics_common}"
+}
diff --git a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterOutlierSamples.json.tmpl b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterOutlierSamples.json.tmpl
new file mode 100644
index 000000000..e521455cd
--- /dev/null
+++ b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/FilterOutlierSamples.json.tmpl
@@ -0,0 +1,10 @@
+{
+ "FilterOutlierSamples.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
+ "FilterOutlierSamples.sv_base_mini_docker": "${workspace.sv_base_mini_docker}",
+ "FilterOutlierSamples.linux_docker" : "${workspace.linux_docker}",
+
+ "FilterOutlierSamples.N_IQR_cutoff": "6",
+
+ "FilterOutlierSamples.name": "${this.sample_set_set_id}",
+ "FilterOutlierSamples.vcf" : "${this.output_vcf}"
+}
diff --git a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/GenotypeBatch.SingleBatch.json.tmpl b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/GenotypeBatch.SingleBatch.json.tmpl
index c1f36e952..6154ee1d6 100644
--- a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/GenotypeBatch.SingleBatch.json.tmpl
+++ b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/GenotypeBatch.SingleBatch.json.tmpl
@@ -16,14 +16,14 @@
"GenotypeBatch.batch": "${this.sample_set_id}",
"GenotypeBatch.rf_cutoffs": "${this.cutoffs}",
- "GenotypeBatch.batch_depth_vcf": "${this.filtered_depth_vcf}",
- "GenotypeBatch.batch_pesr_vcf": "${this.filtered_pesr_vcf}",
+ "GenotypeBatch.batch_depth_vcf": "${this.outlier_filtered_depth_vcf}",
+ "GenotypeBatch.batch_pesr_vcf": "${this.outlier_filtered_pesr_vcf}",
"GenotypeBatch.ped_file": "${workspace.cohort_ped_file}",
"GenotypeBatch.bin_exclude": "${workspace.bin_exclude}",
"GenotypeBatch.discfile": "${this.merged_PE}",
"GenotypeBatch.coveragefile": "${this.merged_bincov}",
"GenotypeBatch.splitfile": "${this.merged_SR}",
"GenotypeBatch.medianfile": "${this.median_cov}",
- "GenotypeBatch.cohort_depth_vcf": "${this.filtered_depth_vcf}",
- "GenotypeBatch.cohort_pesr_vcf": "${this.filtered_pesr_vcf}"
+ "GenotypeBatch.cohort_depth_vcf": "${this.outlier_filtered_depth_vcf}",
+ "GenotypeBatch.cohort_pesr_vcf": "${this.outlier_filtered_pesr_vcf}"
}
diff --git a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/GenotypeBatch.json.tmpl b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/GenotypeBatch.json.tmpl
index b617fa964..9f6951399 100644
--- a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/GenotypeBatch.json.tmpl
+++ b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/GenotypeBatch.json.tmpl
@@ -16,8 +16,8 @@
"GenotypeBatch.batch": "${this.sample_set_id}",
"GenotypeBatch.rf_cutoffs": "${this.cutoffs}",
- "GenotypeBatch.batch_depth_vcf": "${this.filtered_depth_vcf}",
- "GenotypeBatch.batch_pesr_vcf": "${this.filtered_pesr_vcf}",
+ "GenotypeBatch.batch_depth_vcf": "${this.outlier_filtered_depth_vcf}",
+ "GenotypeBatch.batch_pesr_vcf": "${this.outlier_filtered_pesr_vcf}",
"GenotypeBatch.ped_file": "${workspace.cohort_ped_file}",
"GenotypeBatch.bin_exclude": "${workspace.bin_exclude}",
"GenotypeBatch.discfile": "${this.merged_PE}",
diff --git a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/MergeBatchSites.json.tmpl b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/MergeBatchSites.json.tmpl
index 95a7daf97..6284f8184 100644
--- a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/MergeBatchSites.json.tmpl
+++ b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/MergeBatchSites.json.tmpl
@@ -1,6 +1,6 @@
{
"MergeBatchSites.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
"MergeBatchSites.cohort": "${this.sample_set_set_id}",
- "MergeBatchSites.pesr_vcfs": "${this.sample_sets.filtered_pesr_vcf}",
- "MergeBatchSites.depth_vcfs": "${this.sample_sets.filtered_depth_vcf}"
+ "MergeBatchSites.pesr_vcfs": "${this.sample_sets.outlier_filtered_pesr_vcf}",
+ "MergeBatchSites.depth_vcfs": "${this.sample_sets.outlier_filtered_depth_vcf}"
}
diff --git a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/PlotSVCountsPerSample.json.tmpl b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/PlotSVCountsPerSample.json.tmpl
new file mode 100644
index 000000000..8e66ff73a
--- /dev/null
+++ b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/PlotSVCountsPerSample.json.tmpl
@@ -0,0 +1,9 @@
+{
+ "PlotSVCountsPerSample.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
+
+ "PlotSVCountsPerSample.N_IQR_cutoff": "6",
+
+ "PlotSVCountsPerSample.prefix": "${this.sample_set_id}",
+ "PlotSVCountsPerSample.vcfs" : "${this.sites_filtered_vcfs}",
+ "PlotSVCountsPerSample.vcf_identifiers" : "${this.algorithms_filtersites}"
+}
diff --git a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/RegenotypeCNVs.SingleBatch.json.tmpl b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/RegenotypeCNVs.SingleBatch.json.tmpl
index 2c8307867..c94ee2139 100644
--- a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/RegenotypeCNVs.SingleBatch.json.tmpl
+++ b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/RegenotypeCNVs.SingleBatch.json.tmpl
@@ -12,10 +12,10 @@
"RegenotypeCNVs.RD_depth_sepcutoffs": "${this.trained_genotype_depth_depth_sepcutoff}",
- "RegenotypeCNVs.cohort_depth_vcf": "${this.filtered_depth_vcf}",
+ "RegenotypeCNVs.cohort_depth_vcf": "${this.outlier_filtered_depth_vcf}",
"RegenotypeCNVs.ped_file": "${workspace.cohort_ped_file}",
- "RegenotypeCNVs.batch_depth_vcfs": "${this.filtered_depth_vcf}",
+ "RegenotypeCNVs.batch_depth_vcfs": "${this.outlier_filtered_depth_vcf}",
"RegenotypeCNVs.depth_vcfs": "${this.genotyped_depth_vcf}",
"RegenotypeCNVs.coveragefiles": "${this.merged_bincov}",
diff --git a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/RegenotypeCNVs.json.tmpl b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/RegenotypeCNVs.json.tmpl
index 82584f798..975f5448f 100644
--- a/input_templates/terra_workspaces/cohort_mode/workflow_configurations/RegenotypeCNVs.json.tmpl
+++ b/input_templates/terra_workspaces/cohort_mode/workflow_configurations/RegenotypeCNVs.json.tmpl
@@ -15,7 +15,7 @@
"RegenotypeCNVs.cohort_depth_vcf": "${workspace.cohort_depth_vcf}",
"RegenotypeCNVs.ped_file": "${workspace.cohort_ped_file}",
- "RegenotypeCNVs.batch_depth_vcfs": "${this.sample_sets.filtered_depth_vcf}",
+ "RegenotypeCNVs.batch_depth_vcfs": "${this.sample_sets.outlier_filtered_depth_vcf}",
"RegenotypeCNVs.depth_vcfs": "${this.sample_sets.genotyped_depth_vcf}",
"RegenotypeCNVs.coveragefiles": "${this.sample_sets.merged_bincov}",
diff --git a/input_values/test_batch_large.json b/input_values/test_batch_large.json
index c7cf4ba86..d9e8ef0bf 100644
--- a/input_values/test_batch_large.json
+++ b/input_values/test_batch_large.json
@@ -1840,6 +1840,14 @@
"TCGA-W9-A837-10A-01D-A706-36"
],
"samples_post_filtering_file" : "gs://gatk-sv-resources/test/module03/large/output/test_large.post03_outliers_excluded.samples.list",
+ "sites_filtered_svcounts_depth": "gs://gatk-sv-resources/test/module03/large/SVCounts/test_large.depth.svcounts.txt",
+ "sites_filtered_svcounts_manta": "gs://gatk-sv-resources/test/module03/large/SVCounts/test_large.manta.svcounts.txt",
+ "sites_filtered_svcounts_melt": "gs://gatk-sv-resources/test/module03/large/SVCounts/test_large.melt.svcounts.txt",
+ "sites_filtered_svcounts_wham": "gs://gatk-sv-resources/test/module03/large/SVCounts/test_large.wham.svcounts.txt",
+ "sites_filtered_depth_vcf": "gs://gatk-sv-resources/test/module03/large/FilterBatchSites/test_large.depth.with_evidence.vcf.gz",
+ "sites_filtered_manta_vcf": "gs://gatk-sv-resources/test/module03/large/FilterBatchSites/test_large.manta.with_evidence.vcf.gz",
+ "sites_filtered_melt_vcf": "gs://gatk-sv-resources/test/module03/large/FilterBatchSites/test_large.melt.with_evidence.vcf.gz",
+ "sites_filtered_wham_vcf": "gs://gatk-sv-resources/test/module03/large/FilterBatchSites/test_large.wham.with_evidence.vcf.gz",
"snp_vcfs" : [
"gs://gatk-sv-resources/test/module00a/large/inputs/vcf/test_large.0.vcf.gz",
"gs://gatk-sv-resources/test/module00a/large/inputs/vcf/test_large.1.vcf.gz",
diff --git a/scripts/test/terra_validation.py b/scripts/test/terra_validation.py
index 14814a631..34bacc672 100644
--- a/scripts/test/terra_validation.py
+++ b/scripts/test/terra_validation.py
@@ -103,7 +103,7 @@ def main():
parser.add_argument("-j", "--womtool-jar", help="Path to womtool jar", required=True)
parser.add_argument("-n", "--num-input-jsons",
help="Number of Terra input JSONs expected",
- required=False, default=16, type=int)
+ required=False, default=19, type=int)
parser.add_argument("--log-level",
help="Specify level of logging information, ie. info, warning, error (not case-sensitive)",
required=False, default="INFO")
diff --git a/test_input_templates/FilterBatch/FilterBatchSamples.json.tmpl b/test_input_templates/FilterBatch/FilterBatchSamples.json.tmpl
new file mode 100644
index 000000000..fb3e5763e
--- /dev/null
+++ b/test_input_templates/FilterBatch/FilterBatchSamples.json.tmpl
@@ -0,0 +1,24 @@
+{
+ "FilterBatchSamples.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},
+ "FilterBatchSamples.sv_base_mini_docker":{{ dockers.sv_base_mini_docker | tojson }},
+ "FilterBatchSamples.linux_docker" : {{ dockers.linux_docker | tojson }},
+
+ "FilterBatchSamples.N_IQR_cutoff": "10000",
+ "FilterBatchSamples.outlier_cutoff_table" : {{ test_batch.outlier_cutoff_table | tojson }},
+
+ "FilterBatchSamples.batch": {{ test_batch.batch_name | tojson }},
+ "FilterBatchSamples.vcfs" : [
+ {{ test_batch.sites_filtered_manta_vcf | tojson }},
+ null,
+ {{ test_batch.sites_filtered_wham_vcf | tojson }},
+ {{ test_batch.sites_filtered_melt_vcf | tojson }},
+ {{ test_batch.sites_filtered_depth_vcf | tojson }}
+ ],
+ "FilterBatchSamples.sv_counts": [
+ {{ test_batch.sites_filtered_svcounts_manta | tojson }},
+ null,
+ {{ test_batch.sites_filtered_svcounts_wham | tojson }},
+ {{ test_batch.sites_filtered_svcounts_melt | tojson }},
+ {{ test_batch.sites_filtered_svcounts_depth | tojson }}
+ ]
+}
diff --git a/test_input_templates/FilterBatch/FilterBatchSites.json.tmpl b/test_input_templates/FilterBatch/FilterBatchSites.json.tmpl
new file mode 100644
index 000000000..55c5f8b50
--- /dev/null
+++ b/test_input_templates/FilterBatch/FilterBatchSites.json.tmpl
@@ -0,0 +1,11 @@
+{
+ "FilterBatchSites.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},
+
+ "FilterBatchSites.batch": {{ test_batch.batch_name | tojson }},
+ "FilterBatchSites.depth_vcf" : {{ test_batch.merged_depth_vcf | tojson }},
+ "FilterBatchSites.manta_vcf" : {{ test_batch.merged_manta_vcf | tojson }},
+ "FilterBatchSites.wham_vcf" : {{ test_batch.merged_wham_vcf | tojson }},
+ "FilterBatchSites.melt_vcf" : {{ test_batch.merged_melt_vcf | tojson }},
+ "FilterBatchSites.evidence_metrics": {{ test_batch.evidence_metrics | tojson }},
+ "FilterBatchSites.evidence_metrics_common": {{ test_batch.evidence_metrics_common | tojson }}
+}
diff --git a/test_input_templates/FilterBatch/PlotSVCountsPerSample.json.tmpl b/test_input_templates/FilterBatch/PlotSVCountsPerSample.json.tmpl
new file mode 100644
index 000000000..78382c466
--- /dev/null
+++ b/test_input_templates/FilterBatch/PlotSVCountsPerSample.json.tmpl
@@ -0,0 +1,14 @@
+{
+ "PlotSVCountsPerSample.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},
+
+ "PlotSVCountsPerSample.N_IQR_cutoff": "6",
+
+ "PlotSVCountsPerSample.prefix": {{ test_batch.batch_name | tojson }},
+ "PlotSVCountsPerSample.vcfs" : [
+ {{ test_batch.sites_filtered_manta_vcf | tojson }},
+ {{ test_batch.sites_filtered_wham_vcf | tojson }},
+ {{ test_batch.sites_filtered_melt_vcf | tojson }},
+ {{ test_batch.sites_filtered_depth_vcf | tojson }}
+ ],
+ "PlotSVCountsPerSample.vcf_identifiers" : ["manta", "wham", "melt", "depth"]
+}
diff --git a/test_input_templates/FilterOutlierSamples/FilterOutlierSamples.json.tmpl b/test_input_templates/FilterOutlierSamples/FilterOutlierSamples.json.tmpl
new file mode 100644
index 000000000..35c47cccc
--- /dev/null
+++ b/test_input_templates/FilterOutlierSamples/FilterOutlierSamples.json.tmpl
@@ -0,0 +1,11 @@
+{
+ "FilterOutlierSamples.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},
+ "FilterOutlierSamples.sv_base_mini_docker": {{ dockers.sv_base_mini_docker | tojson }},
+ "FilterOutlierSamples.linux_docker" : {{ dockers.linux_docker | tojson }},
+
+ "FilterOutlierSamples.N_IQR_cutoff": "6",
+
+ "FilterOutlierSamples.name": "test_large",
+ "FilterOutlierSamples.vcf_identifier": "cohort_outlier_filtered",
+ "FilterOutlierSamples.vcf" : {{ test_batch.baseline_final_vcf | tojson }}
+}
diff --git a/test_input_templates/FilterOutlierSamples/PlotSVCountsPerSample.json.tmpl b/test_input_templates/FilterOutlierSamples/PlotSVCountsPerSample.json.tmpl
new file mode 100644
index 000000000..bf1a6c127
--- /dev/null
+++ b/test_input_templates/FilterOutlierSamples/PlotSVCountsPerSample.json.tmpl
@@ -0,0 +1,11 @@
+{
+ "PlotSVCountsPerSample.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},
+
+ "PlotSVCountsPerSample.N_IQR_cutoff": "6",
+
+ "PlotSVCountsPerSample.prefix": {{ test_batch.batch_name | tojson }},
+ "PlotSVCountsPerSample.vcfs" : [
+ {{ test_batch.baseline_final_vcf | tojson }}
+ ],
+ "PlotSVCountsPerSample.vcf_identifiers" : ["cohort_outlier_filtered"]
+}
diff --git a/wdl/FilterBatch.wdl b/wdl/FilterBatch.wdl
index 5bee48f76..6dd9ed5b8 100644
--- a/wdl/FilterBatch.wdl
+++ b/wdl/FilterBatch.wdl
@@ -1,6 +1,8 @@
version 1.0
-import "FilterOutliers.wdl" as filter_outliers
+import "FilterBatchSites.wdl" as filter_sites
+import "PlotSVCountsPerSample.wdl" as sv_counts
+import "FilterBatchSamples.wdl" as filter_outliers
import "Utils.wdl" as util
import "FilterBatchMetrics.wdl" as metrics
@@ -37,96 +39,81 @@ workflow FilterBatch {
RuntimeAttr? runtime_attr_ids_from_vcf
RuntimeAttr? runtime_attr_merge_pesr_vcfs
+ RuntimeAttr? runtime_attr_count_svs
+ RuntimeAttr? runtime_attr_plot_svcounts
+ RuntimeAttr? runtime_attr_cat_outliers_preview
+
RuntimeAttr? runtime_attr_identify_outliers
RuntimeAttr? runtime_attr_exclude_outliers
RuntimeAttr? runtime_attr_cat_outliers
RuntimeAttr? runtime_attr_filter_samples
}
- Array[String] algorithms = ["manta", "delly", "wham", "melt", "depth"]
- Array[File?] vcfs_array = [manta_vcf, delly_vcf, wham_vcf, melt_vcf, depth_vcf]
- Int num_algorithms = length(algorithms)
-
- call AdjudicateSV {
+ call filter_sites.FilterBatchSites {
input:
- metrics = evidence_metrics,
batch = batch,
+ manta_vcf = manta_vcf,
+ melt_vcf = melt_vcf,
+ delly_vcf = delly_vcf,
+ depth_vcf = depth_vcf,
+ wham_vcf = wham_vcf,
+ evidence_metrics = evidence_metrics,
+ evidence_metrics_common = evidence_metrics_common,
sv_pipeline_docker = sv_pipeline_docker,
- runtime_attr_override = runtime_attr_adjudicate
+ runtime_attr_adjudicate = runtime_attr_adjudicate,
+ runtime_attr_rewrite_scores = runtime_attr_rewrite_scores,
+ runtime_attr_filter_annotate_vcf = runtime_attr_filter_annotate_vcf
}
- call RewriteScores {
+ call sv_counts.PlotSVCountsPerSample {
input:
- metrics = evidence_metrics_common,
- cutoffs = AdjudicateSV.cutoffs,
- scores = AdjudicateSV.scores,
- batch = batch,
+ prefix = batch,
+ vcfs = FilterBatchSites.sites_filtered_vcfs,
+ vcf_identifiers = FilterBatchSites.algorithms_filtersites,
+ N_IQR_cutoff = outlier_cutoff_nIQR,
sv_pipeline_docker = sv_pipeline_docker,
- runtime_attr_override = runtime_attr_rewrite_scores
- }
-
- scatter (i in range(num_algorithms)) {
- if (defined(vcfs_array[i])) {
- call FilterAnnotateVcf {
- input:
- vcf = select_first([vcfs_array[i]]),
- metrics = evidence_metrics,
- prefix = "${batch}.${algorithms[i]}",
- scores = RewriteScores.updated_scores,
- cutoffs = AdjudicateSV.cutoffs,
- sv_pipeline_docker = sv_pipeline_docker,
- runtime_attr_override = runtime_attr_filter_annotate_vcf
- }
- }
- }
-
- call util.GetSampleIdsFromVcf {
- input:
- vcf = select_first(vcfs_array),
- sv_base_mini_docker = sv_base_mini_docker,
- runtime_attr_override = runtime_attr_ids_from_vcf
+ runtime_attr_count_svs = runtime_attr_count_svs,
+ runtime_attr_plot_svcounts = runtime_attr_plot_svcounts,
+ runtime_attr_cat_outliers_preview = runtime_attr_cat_outliers_preview
}
- call filter_outliers.FilterOutlierSamples as FilterOutlierSamples {
+ call filter_outliers.FilterBatchSamples {
input:
batch = batch,
- algorithms = algorithms,
outlier_cutoff_table = outlier_cutoff_table,
N_IQR_cutoff = outlier_cutoff_nIQR,
- vcfs = FilterAnnotateVcf.annotated_vcf,
- samples = GetSampleIdsFromVcf.out_array,
+ vcfs = FilterBatchSites.sites_filtered_vcfs,
+ sv_counts = PlotSVCountsPerSample.sv_counts,
linux_docker = linux_docker,
sv_pipeline_docker = sv_pipeline_docker,
sv_base_mini_docker = sv_base_mini_docker,
runtime_attr_identify_outliers = runtime_attr_identify_outliers,
runtime_attr_exclude_outliers= runtime_attr_exclude_outliers,
runtime_attr_cat_outliers = runtime_attr_cat_outliers,
- runtime_attr_filter_samples = runtime_attr_filter_samples
- }
-
- call MergePesrVcfs {
- input:
- manta_vcf = FilterOutlierSamples.vcfs_noOutliers[0],
- delly_vcf = FilterOutlierSamples.vcfs_noOutliers[1],
- wham_vcf = FilterOutlierSamples.vcfs_noOutliers[2],
- melt_vcf = FilterOutlierSamples.vcfs_noOutliers[3],
- batch = batch,
- sv_base_mini_docker = sv_base_mini_docker,
- runtime_attr_override = runtime_attr_merge_pesr_vcfs
+ runtime_attr_filter_samples = runtime_attr_filter_samples,
+ runtime_attr_ids_from_vcf = runtime_attr_ids_from_vcf,
+ runtime_attr_merge_pesr_vcfs = runtime_attr_merge_pesr_vcfs,
+ runtime_attr_count_svs = runtime_attr_count_svs
}
Boolean run_module_metrics_ = if defined(run_module_metrics) then select_first([run_module_metrics]) else true
if (run_module_metrics_) {
+ call util.GetSampleIdsFromVcf {
+ input:
+ vcf = select_first([depth_vcf, wham_vcf, manta_vcf, melt_vcf, delly_vcf]),
+ sv_base_mini_docker = sv_base_mini_docker,
+ runtime_attr_override = runtime_attr_ids_from_vcf
+ }
call metrics.FilterBatchMetrics {
input:
name = batch,
samples = GetSampleIdsFromVcf.out_array,
- filtered_pesr_vcf = MergePesrVcfs.merged_pesr_vcf,
- filtered_depth_vcf = select_first([FilterOutlierSamples.vcfs_noOutliers[4]]),
- cutoffs = AdjudicateSV.cutoffs,
- outlier_list = FilterOutlierSamples.outlier_samples_excluded_file,
+ filtered_pesr_vcf = select_first([FilterBatchSamples.outlier_filtered_pesr_vcf]),
+ filtered_depth_vcf = select_first([FilterBatchSamples.outlier_filtered_depth_vcf]),
+ cutoffs = FilterBatchSites.cutoffs,
+ outlier_list = FilterBatchSamples.outlier_samples_excluded_file,
ped_file = select_first([ped_file]),
- samples_post_filtering_file = FilterOutlierSamples.filtered_batch_samples_file,
+ samples_post_filtering_file = FilterBatchSamples.filtered_batch_samples_file,
baseline_filtered_pesr_vcf = baseline_filtered_pesr_vcf,
baseline_filtered_depth_vcf = baseline_filtered_depth_vcf,
contig_list = select_first([primary_contigs_list]),
@@ -137,251 +124,23 @@ workflow FilterBatch {
}
output {
- File? filtered_manta_vcf = FilterOutlierSamples.vcfs_noOutliers[0]
- File? filtered_delly_vcf = FilterOutlierSamples.vcfs_noOutliers[1]
- File? filtered_wham_vcf = FilterOutlierSamples.vcfs_noOutliers[2]
- File? filtered_melt_vcf = FilterOutlierSamples.vcfs_noOutliers[3]
- File? filtered_depth_vcf = FilterOutlierSamples.vcfs_noOutliers[4]
- File filtered_pesr_vcf = MergePesrVcfs.merged_pesr_vcf
- File cutoffs = AdjudicateSV.cutoffs
- File scores = RewriteScores.updated_scores
- File RF_intermediate_files = AdjudicateSV.RF_intermediate_files
- Array[String] outlier_samples_excluded = FilterOutlierSamples.outlier_samples_excluded
- Array[String] batch_samples_postOutlierExclusion = FilterOutlierSamples.filtered_batch_samples_list
- File outlier_samples_excluded_file = FilterOutlierSamples.outlier_samples_excluded_file
- File batch_samples_postOutlierExclusion_file = FilterOutlierSamples.filtered_batch_samples_file
+ File? filtered_manta_vcf = FilterBatchSamples.outlier_filtered_manta_vcf
+ File? filtered_delly_vcf = FilterBatchSamples.outlier_filtered_delly_vcf
+ File? filtered_wham_vcf = FilterBatchSamples.outlier_filtered_wham_vcf
+ File? filtered_melt_vcf = FilterBatchSamples.outlier_filtered_melt_vcf
+ File? filtered_depth_vcf = FilterBatchSamples.outlier_filtered_depth_vcf
+ File? filtered_pesr_vcf = FilterBatchSamples.outlier_filtered_pesr_vcf
+ File cutoffs = FilterBatchSites.cutoffs
+ File scores = FilterBatchSites.scores
+ File RF_intermediate_files = FilterBatchSites.RF_intermediate_files
+ Array[File?] sv_counts = PlotSVCountsPerSample.sv_counts
+ Array[File?] sv_count_plots = PlotSVCountsPerSample.sv_count_plots
+ Array[String] outlier_samples_excluded = FilterBatchSamples.outlier_samples_excluded
+ Array[String] batch_samples_postOutlierExclusion = FilterBatchSamples.filtered_batch_samples_list
+ File outlier_samples_excluded_file = FilterBatchSamples.outlier_samples_excluded_file
+ File batch_samples_postOutlierExclusion_file = FilterBatchSamples.filtered_batch_samples_file
File? metrics_file_filterbatch = FilterBatchMetrics.metrics_file
}
}
-task AdjudicateSV {
- input {
- File metrics
- String batch
- String sv_pipeline_docker
- RuntimeAttr? runtime_attr_override
- }
-
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 10,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
- output {
- File scores = "${batch}.scores"
- File cutoffs = "${batch}.cutoffs"
- File RF_intermediate_files = "${batch}.RF_intermediate_files.tar.gz"
- }
- command <<<
-
- set -euo pipefail
- svtk adjudicate ~{metrics} ~{batch}.scores ~{batch}.cutoffs
- mkdir ~{batch}.RF_intermediate_files
- mv *_trainable.txt ~{batch}.RF_intermediate_files/
- mv *_testable.txt ~{batch}.RF_intermediate_files/
- tar -czvf ~{batch}.RF_intermediate_files.tar.gz ~{batch}.RF_intermediate_files
-
- >>>
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_pipeline_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-}
-
-task RewriteScores {
- input {
- File metrics
- File cutoffs
- File scores
- String batch
- String sv_pipeline_docker
- RuntimeAttr? runtime_attr_override
- }
-
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 10,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
- output {
- File updated_scores = "${batch}.updated_scores"
- }
- command <<<
-
- set -euo pipefail
- Rscript /opt/sv-pipeline/03_variant_filtering/scripts/modify_cutoffs.R \
- -c ~{cutoffs} \
- -m ~{metrics} \
- -s ~{scores} \
- -o ~{batch}.updated_scores
-
- >>>
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_pipeline_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-}
-
-task UpdatePedFile {
- input {
- File ped_file
- Array[String] excluded_samples
- String batch
- String linux_docker
- RuntimeAttr? runtime_attr_override
- }
-
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 10,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
- output {
- File filtered_ped_file = "${batch}.outlier_samples_removed.fam"
- }
- command <<<
-
- fgrep -wvf ~{write_lines(excluded_samples)} ~{ped_file} > ~{batch}.outlier_samples_removed.fam
-
- >>>
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: linux_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-
-}
-
-task MergePesrVcfs {
- input {
- File? manta_vcf
- File? delly_vcf
- File? wham_vcf
- File? melt_vcf
- String batch
- String sv_base_mini_docker
- RuntimeAttr? runtime_attr_override
- }
-
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 10,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
- Array[File] vcfs_array = select_all([manta_vcf, delly_vcf, wham_vcf, melt_vcf])
-
- output {
- File merged_pesr_vcf = "${batch}.filtered_pesr_merged.vcf.gz"
- }
- command <<<
-
- set -euo pipefail
- vcf-concat ~{sep=" " vcfs_array} \
- | vcf-sort -c \
- | bgzip -c \
- > ~{batch}.filtered_pesr_merged.vcf.gz
-
- >>>
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_base_mini_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-
-}
-
-task FilterAnnotateVcf {
- input {
- File vcf
- File metrics
- File scores
- File cutoffs
- String prefix
- String sv_pipeline_docker
- RuntimeAttr? runtime_attr_override
- }
-
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 10,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
- output {
- File annotated_vcf = "${prefix}.with_evidence.vcf.gz"
- }
- command <<<
-
- set -euo pipefail
- cat \
- <(sed -e '1d' ~{scores} | fgrep -e DEL -e DUP | awk '($3>=0.5)' | cut -f1 | fgrep -w -f - <(zcat ~{vcf})) \
- <(sed -e '1d' ~{scores} | fgrep -e INV -e BND -e INS | awk '($3>=0.5)' | cut -f1 | fgrep -w -f - <(zcat ~{vcf}) | sed -e 's/SVTYPE=DEL/SVTYPE=BND/' -e 's/SVTYPE=DUP/SVTYPE=BND/' -e 's///' -e 's///') \
- | cat <(sed -n -e '/^#/p' <(zcat ~{vcf})) - \
- | vcf-sort -c \
- | bgzip -c \
- > filtered.vcf.gz
-
- /opt/sv-pipeline/03_variant_filtering/scripts/rewrite_SR_coords.py filtered.vcf.gz ~{metrics} ~{cutoffs} stdout \
- | vcf-sort -c \
- | bgzip -c \
- > filtered.corrected_coords.vcf.gz
-
- /opt/sv-pipeline/03_variant_filtering/scripts/annotate_RF_evidence.py filtered.corrected_coords.vcf.gz ~{scores} ~{prefix}.with_evidence.vcf
- bgzip ~{prefix}.with_evidence.vcf
-
- >>>
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_pipeline_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-
-}
-
diff --git a/wdl/FilterBatchSamples.wdl b/wdl/FilterBatchSamples.wdl
new file mode 100644
index 000000000..95f656d70
--- /dev/null
+++ b/wdl/FilterBatchSamples.wdl
@@ -0,0 +1,160 @@
+version 1.0
+
+import "FilterOutlierSamples.wdl" as filter_outliers
+import "Structs.wdl"
+import "Utils.wdl" as util
+import "IdentifyOutlierSamples.wdl" as identify_outliers
+
+# Workflow to identify & filter outliers from VCFs as part of FilterBatch after FilterBatchSites & PlotSVCountsPerSample
+workflow FilterBatchSamples {
+ input {
+ String batch
+ Array[File?] vcfs # in order of algorithms array: ["manta", "delly", "wham", "melt", "depth"]. To skip one, use null keyword - ex. ["manta.vcf.gz", null, "wham.vcf.gz", null, "depth.vcf.gz"]
+ Array[File?] sv_counts # one SV counts file from PlotSVCountsPerSample per VCF in the same order
+ Int N_IQR_cutoff
+ File? outlier_cutoff_table
+ String sv_pipeline_docker
+ String sv_base_mini_docker
+ String linux_docker
+ RuntimeAttr? runtime_attr_identify_outliers
+ RuntimeAttr? runtime_attr_exclude_outliers
+ RuntimeAttr? runtime_attr_cat_outliers
+ RuntimeAttr? runtime_attr_filter_samples
+ RuntimeAttr? runtime_attr_ids_from_vcf
+ RuntimeAttr? runtime_attr_count_svs
+ RuntimeAttr? runtime_attr_merge_pesr_vcfs
+ }
+
+ Array[String] algorithms = ["manta", "delly", "wham", "melt", "depth"] # fixed algorithms to enable File outputs to be determined
+ Int num_algorithms = length(algorithms)
+
+ scatter (i in range(num_algorithms)) {
+ if (defined(vcfs[i])) {
+ call identify_outliers.IdentifyOutlierSamples {
+ input:
+ vcf = select_first([vcfs[i]]),
+ name = batch,
+ sv_counts = sv_counts[i],
+ N_IQR_cutoff = N_IQR_cutoff,
+ outlier_cutoff_table = outlier_cutoff_table,
+ vcf_identifier = algorithms[i],
+ sv_pipeline_docker = sv_pipeline_docker,
+ linux_docker = linux_docker,
+ runtime_attr_identify_outliers = runtime_attr_identify_outliers,
+ runtime_attr_cat_outliers = runtime_attr_cat_outliers,
+ runtime_attr_count_svs = runtime_attr_count_svs
+ }
+ }
+ }
+
+ # Merge list of outliers from all algorithms
+ call identify_outliers.CatOutliers {
+ input:
+ outliers = select_all(IdentifyOutlierSamples.outlier_samples_file),
+ batch = batch,
+ linux_docker = linux_docker,
+ runtime_attr_override = runtime_attr_cat_outliers
+ }
+
+ scatter (i in range(num_algorithms)) {
+ if (defined(vcfs[i])) {
+ call filter_outliers.ExcludeOutliers {
+ input:
+ vcf = select_first([vcfs[i]]),
+ outliers_list = CatOutliers.outliers_list,
+ outfile = "${batch}.${algorithms[i]}.outliers_removed.vcf.gz",
+ sv_base_mini_docker = sv_base_mini_docker,
+ runtime_attr_override = runtime_attr_exclude_outliers
+ }
+ }
+ }
+
+ call util.GetSampleIdsFromVcf {
+ input:
+ vcf = select_first(vcfs),
+ sv_base_mini_docker = sv_base_mini_docker,
+ runtime_attr_override = runtime_attr_ids_from_vcf
+ }
+
+ # Write new list of samples without outliers
+ call filter_outliers.FilterSampleList {
+ input:
+ original_samples = GetSampleIdsFromVcf.out_array,
+ outlier_samples = CatOutliers.outliers_list,
+ batch = batch,
+ linux_docker = linux_docker,
+ runtime_attr_override = runtime_attr_filter_samples
+ }
+
+ call MergePesrVcfs {
+ input:
+ manta_vcf = ExcludeOutliers.vcf_no_outliers[0],
+ delly_vcf = ExcludeOutliers.vcf_no_outliers[1],
+ wham_vcf = ExcludeOutliers.vcf_no_outliers[2],
+ melt_vcf = ExcludeOutliers.vcf_no_outliers[3],
+ batch = batch,
+ sv_base_mini_docker = sv_base_mini_docker,
+ runtime_attr_override = runtime_attr_merge_pesr_vcfs
+ }
+
+ output {
+ File? outlier_filtered_manta_vcf = ExcludeOutliers.vcf_no_outliers[0]
+ File? outlier_filtered_delly_vcf = ExcludeOutliers.vcf_no_outliers[1]
+ File? outlier_filtered_wham_vcf = ExcludeOutliers.vcf_no_outliers[2]
+ File? outlier_filtered_melt_vcf = ExcludeOutliers.vcf_no_outliers[3]
+ File? outlier_filtered_depth_vcf = ExcludeOutliers.vcf_no_outliers[4]
+ File? outlier_filtered_pesr_vcf = MergePesrVcfs.merged_pesr_vcf
+ Array[String] filtered_batch_samples_list = FilterSampleList.filtered_samples_list
+ File filtered_batch_samples_file = FilterSampleList.filtered_samples_file
+ Array[String] outlier_samples_excluded = CatOutliers.outliers_list
+ File outlier_samples_excluded_file = CatOutliers.outliers_file
+ }
+}
+
+task MergePesrVcfs {
+ input {
+ File? manta_vcf
+ File? delly_vcf
+ File? wham_vcf
+ File? melt_vcf
+ String batch
+ String sv_base_mini_docker
+ RuntimeAttr? runtime_attr_override
+ }
+
+ RuntimeAttr default_attr = object {
+ cpu_cores: 1,
+ mem_gb: 3.75,
+ disk_gb: 10,
+ boot_disk_gb: 10,
+ preemptible_tries: 3,
+ max_retries: 1
+ }
+ RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
+
+ Array[File] vcfs_array = select_all([manta_vcf, delly_vcf, wham_vcf, melt_vcf])
+
+ output {
+ File merged_pesr_vcf = "${batch}.filtered_pesr_merged.vcf.gz"
+ }
+ command <<<
+
+ set -euo pipefail
+ vcf-concat ~{sep=" " vcfs_array} \
+ | vcf-sort -c \
+ | bgzip -c \
+ > ~{batch}.filtered_pesr_merged.vcf.gz
+
+ >>>
+ runtime {
+ cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
+ memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
+ disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
+ bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
+ docker: sv_base_mini_docker
+ preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
+ maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
+ }
+
+}
+
diff --git a/wdl/FilterBatchSites.wdl b/wdl/FilterBatchSites.wdl
new file mode 100644
index 000000000..b39962674
--- /dev/null
+++ b/wdl/FilterBatchSites.wdl
@@ -0,0 +1,213 @@
+version 1.0
+
+import "Structs.wdl"
+
+workflow FilterBatchSites {
+ input {
+ String batch
+ File? manta_vcf
+ File? delly_vcf
+ File? wham_vcf
+ File? melt_vcf
+ File? depth_vcf
+ File evidence_metrics
+ File evidence_metrics_common
+
+ String sv_pipeline_docker
+ RuntimeAttr? runtime_attr_adjudicate
+ RuntimeAttr? runtime_attr_rewrite_scores
+ RuntimeAttr? runtime_attr_filter_annotate_vcf
+ RuntimeAttr? runtime_attr_merge_pesr_vcfs
+
+ }
+
+ Array[String] algorithms = ["manta", "delly", "wham", "melt", "depth"]
+ Array[File?] vcfs_array = [manta_vcf, delly_vcf, wham_vcf, melt_vcf, depth_vcf]
+ Int num_algorithms = length(algorithms)
+
+ call AdjudicateSV {
+ input:
+ metrics = evidence_metrics,
+ batch = batch,
+ sv_pipeline_docker = sv_pipeline_docker,
+ runtime_attr_override = runtime_attr_adjudicate
+ }
+
+ call RewriteScores {
+ input:
+ metrics = evidence_metrics_common,
+ cutoffs = AdjudicateSV.cutoffs,
+ scores = AdjudicateSV.scores,
+ batch = batch,
+ sv_pipeline_docker = sv_pipeline_docker,
+ runtime_attr_override = runtime_attr_rewrite_scores
+ }
+
+ scatter (i in range(num_algorithms)) {
+ if (defined(vcfs_array[i])) {
+ call FilterAnnotateVcf {
+ input:
+ vcf = select_first([vcfs_array[i]]),
+ metrics = evidence_metrics,
+ prefix = "${batch}.${algorithms[i]}",
+ scores = RewriteScores.updated_scores,
+ cutoffs = AdjudicateSV.cutoffs,
+ sv_pipeline_docker = sv_pipeline_docker,
+ runtime_attr_override = runtime_attr_filter_annotate_vcf
+ }
+ }
+ }
+
+ output {
+ Array[File?] sites_filtered_vcfs = FilterAnnotateVcf.annotated_vcf
+ File cutoffs = AdjudicateSV.cutoffs
+ File scores = RewriteScores.updated_scores
+ File RF_intermediate_files = AdjudicateSV.RF_intermediate_files
+ Array[String] algorithms_filtersites = algorithms
+ }
+}
+
+task AdjudicateSV {
+ input {
+ File metrics
+ String batch
+ String sv_pipeline_docker
+ RuntimeAttr? runtime_attr_override
+ }
+
+ RuntimeAttr default_attr = object {
+ cpu_cores: 1,
+ mem_gb: 3.75,
+ disk_gb: 10,
+ boot_disk_gb: 10,
+ preemptible_tries: 3,
+ max_retries: 1
+ }
+ RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
+
+ output {
+ File scores = "${batch}.scores"
+ File cutoffs = "${batch}.cutoffs"
+ File RF_intermediate_files = "${batch}.RF_intermediate_files.tar.gz"
+ }
+ command <<<
+
+ set -euo pipefail
+ svtk adjudicate ~{metrics} ~{batch}.scores ~{batch}.cutoffs
+ mkdir ~{batch}.RF_intermediate_files
+ mv *_trainable.txt ~{batch}.RF_intermediate_files/
+ mv *_testable.txt ~{batch}.RF_intermediate_files/
+ tar -czvf ~{batch}.RF_intermediate_files.tar.gz ~{batch}.RF_intermediate_files
+
+ >>>
+ runtime {
+ cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
+ memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
+ disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
+ bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
+ docker: sv_pipeline_docker
+ preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
+ maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
+ }
+}
+
+task RewriteScores {
+ input {
+ File metrics
+ File cutoffs
+ File scores
+ String batch
+ String sv_pipeline_docker
+ RuntimeAttr? runtime_attr_override
+ }
+
+ RuntimeAttr default_attr = object {
+ cpu_cores: 1,
+ mem_gb: 3.75,
+ disk_gb: 10,
+ boot_disk_gb: 10,
+ preemptible_tries: 3,
+ max_retries: 1
+ }
+ RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
+
+ output {
+ File updated_scores = "${batch}.updated_scores"
+ }
+ command <<<
+
+ set -euo pipefail
+ Rscript /opt/sv-pipeline/03_variant_filtering/scripts/modify_cutoffs.R \
+ -c ~{cutoffs} \
+ -m ~{metrics} \
+ -s ~{scores} \
+ -o ~{batch}.updated_scores
+
+ >>>
+ runtime {
+ cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
+ memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
+ disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
+ bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
+ docker: sv_pipeline_docker
+ preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
+ maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
+ }
+}
+
+task FilterAnnotateVcf {
+ input {
+ File vcf
+ File metrics
+ File scores
+ File cutoffs
+ String prefix
+ String sv_pipeline_docker
+ RuntimeAttr? runtime_attr_override
+ }
+
+ RuntimeAttr default_attr = object {
+ cpu_cores: 1,
+ mem_gb: 3.75,
+ disk_gb: 10,
+ boot_disk_gb: 10,
+ preemptible_tries: 3,
+ max_retries: 1
+ }
+ RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
+
+ output {
+ File annotated_vcf = "${prefix}.with_evidence.vcf.gz"
+ }
+ command <<<
+
+ set -euo pipefail
+ cat \
+ <(sed -e '1d' ~{scores} | fgrep -e DEL -e DUP | awk '($3>=0.5)' | cut -f1 | fgrep -w -f - <(zcat ~{vcf})) \
+ <(sed -e '1d' ~{scores} | fgrep -e INV -e BND -e INS | awk '($3>=0.5)' | cut -f1 | fgrep -w -f - <(zcat ~{vcf}) | sed -e 's/SVTYPE=DEL/SVTYPE=BND/' -e 's/SVTYPE=DUP/SVTYPE=BND/' -e 's///' -e 's///') \
+ | cat <(sed -n -e '/^#/p' <(zcat ~{vcf})) - \
+ | vcf-sort -c \
+ | bgzip -c \
+ > filtered.vcf.gz
+
+ /opt/sv-pipeline/03_variant_filtering/scripts/rewrite_SR_coords.py filtered.vcf.gz ~{metrics} ~{cutoffs} stdout \
+ | vcf-sort -c \
+ | bgzip -c \
+ > filtered.corrected_coords.vcf.gz
+
+ /opt/sv-pipeline/03_variant_filtering/scripts/annotate_RF_evidence.py filtered.corrected_coords.vcf.gz ~{scores} ~{prefix}.with_evidence.vcf
+ bgzip ~{prefix}.with_evidence.vcf
+
+ >>>
+ runtime {
+ cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
+ memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
+ disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
+ bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
+ docker: sv_pipeline_docker
+ preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
+ maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
+ }
+
+}
+
diff --git a/wdl/FilterOutlierSamples.wdl b/wdl/FilterOutlierSamples.wdl
index b3e6a9eab..852f57881 100644
--- a/wdl/FilterOutlierSamples.wdl
+++ b/wdl/FilterOutlierSamples.wdl
@@ -1,378 +1,144 @@
-version 1.0
+version 1.0
import "Structs.wdl"
+import "Utils.wdl" as util
+import "IdentifyOutlierSamples.wdl" as identify_outliers
-# This is an analysis WDL to identify & filter outliers from VCFs
-# after minGQ filtering at the end of the Talkowski SV pipeline
-
-# Treats PCR+ and PCR- samples separately
-
-workflow FilterOutlierSamplesPostMinGQ {
- input{
+# Filter outlier samples by IQR or cutoff table for a single VCF. Recommended to run PlotSVCountsPerSample first to choose cutoff
+workflow FilterOutlierSamples {
+ input {
+ String name # batch or cohort
File vcf
- File vcf_idx
- File? pcrplus_samples_list
- Int? n_iqr_cutoff_pcrplus
- Int n_iqr_cutoff_pcrminus
- String prefix
- File autosomes_list
+ File? sv_counts # SV counts file from PlotSVCountsPerSample - if not provided, will create
+ Int N_IQR_cutoff
+ File? outlier_cutoff_table
+ String? vcf_identifier # required (enter algorithm here) if providing outlier_cutoff_table, otherwise used in some file prefixes
String sv_pipeline_docker
- }
- Array[Array[String]] contigs=read_tsv(autosomes_list)
- Boolean PCRPLUS = defined(pcrplus_samples_list)
-
- # Write original list of unfiltered samples and split by PCR status
- call WriteSamplesList {
- input:
- vcf=vcf,
- vcf_idx=vcf_idx,
- pcrplus_samples_list=pcrplus_samples_list,
- prefix=prefix,
- sv_pipeline_docker=sv_pipeline_docker
+ String sv_base_mini_docker
+ String linux_docker
+ RuntimeAttr? runtime_attr_identify_outliers
+ RuntimeAttr? runtime_attr_exclude_outliers
+ RuntimeAttr? runtime_attr_cat_outliers
+ RuntimeAttr? runtime_attr_filter_samples
+ RuntimeAttr? runtime_attr_ids_from_vcf
+ RuntimeAttr? runtime_attr_count_svs
}
- # Get count of biallelic autosomal variants per sample
- scatter ( contig in contigs ) {
- call CountSvtypes {
- input:
- vcf=vcf,
- vcf_idx=vcf_idx,
- prefix=prefix,
- contig=contig[0],
- sv_pipeline_docker=sv_pipeline_docker
- }
- }
- call CombineCounts {
- input:
- svcounts=CountSvtypes.sv_counts,
- prefix=prefix,
- sv_pipeline_docker=sv_pipeline_docker
- }
-
- # Get outliers
- if (PCRPLUS) {
- call IdentifyOutliers as identify_PCRPLUS_outliers {
- input:
- svcounts=CombineCounts.summed_svcounts,
- n_iqr_cutoff=select_first([n_iqr_cutoff_pcrplus]),
- samples_list=WriteSamplesList.plus_samples_list,
- prefix="~{prefix}.PCRPLUS",
- sv_pipeline_docker=sv_pipeline_docker
- }
- }
- call IdentifyOutliers as identify_PCRMINUS_outliers {
+ call identify_outliers.IdentifyOutlierSamples {
input:
- svcounts=CombineCounts.summed_svcounts,
- n_iqr_cutoff=n_iqr_cutoff_pcrminus,
- samples_list=WriteSamplesList.minus_samples_list,
- prefix="~{prefix}.PCRMINUS",
- sv_pipeline_docker=sv_pipeline_docker
+ vcf = vcf,
+ name = name,
+ sv_counts = sv_counts,
+ N_IQR_cutoff = N_IQR_cutoff,
+ outlier_cutoff_table = outlier_cutoff_table,
+ vcf_identifier = vcf_identifier,
+ sv_pipeline_docker = sv_pipeline_docker,
+ linux_docker = linux_docker,
+ runtime_attr_identify_outliers = runtime_attr_identify_outliers,
+ runtime_attr_cat_outliers = runtime_attr_cat_outliers,
+ runtime_attr_count_svs = runtime_attr_count_svs
}
- # Exclude outliers from vcf
call ExcludeOutliers {
input:
- vcf=vcf,
- vcf_idx=vcf_idx,
- plus_outliers_list=identify_PCRPLUS_outliers.outliers_list,
- minus_outliers_list=identify_PCRMINUS_outliers.outliers_list,
- outfile="~{prefix}.outliers_removed.vcf.gz",
- prefix=prefix,
- sv_pipeline_docker=sv_pipeline_docker
+ vcf = vcf,
+ outliers_list = IdentifyOutlierSamples.outlier_samples_list,
+ outfile = "${name}.outliers_removed.vcf.gz",
+ sv_base_mini_docker = sv_base_mini_docker,
+ runtime_attr_override = runtime_attr_exclude_outliers
+ }
+
+ call util.GetSampleIdsFromVcf {
+ input:
+ vcf = vcf,
+ sv_base_mini_docker = sv_base_mini_docker,
+ runtime_attr_override = runtime_attr_ids_from_vcf
}
# Write new list of samples without outliers
call FilterSampleList {
input:
- original_samples_list=WriteSamplesList.samples_list,
- outlier_samples=ExcludeOutliers.merged_outliers_list,
- prefix=prefix,
- sv_pipeline_docker=sv_pipeline_docker
+ original_samples = GetSampleIdsFromVcf.out_array,
+ outlier_samples = IdentifyOutlierSamples.outlier_samples_list,
+ batch = name,
+ linux_docker = linux_docker,
+ runtime_attr_override = runtime_attr_filter_samples
}
- # Final outputs
output {
- File vcf_noOutliers = ExcludeOutliers.vcf_no_outliers
- File vcf_noOutliers_idx = ExcludeOutliers.vcf_no_outliers_idx
- File nooutliers_samples_list = FilterSampleList.filtered_samples_list
- File excluded_samples_list = ExcludeOutliers.merged_outliers_list
- File svcounts_per_sample_data = CombineCounts.summed_svcounts
- File? svcounts_per_sample_plots_PCRPLUS = identify_PCRPLUS_outliers.svcount_distrib_plots
- File svcounts_per_sample_plots_PCRMINUS = identify_PCRMINUS_outliers.svcount_distrib_plots
+ File outlier_filtered_vcf = ExcludeOutliers.vcf_no_outliers
+ Array[String] filtered_samples_list = FilterSampleList.filtered_samples_list
+ File filtered_samples_file = FilterSampleList.filtered_samples_file
+ Array[String] outlier_samples_excluded = IdentifyOutlierSamples.outlier_samples_list
+ File outlier_samples_excluded_file = IdentifyOutlierSamples.outlier_samples_file
+ File sv_counts_file = IdentifyOutlierSamples.sv_counts_file
}
}
-# Write original list of samples
-task WriteSamplesList {
- input{
- File vcf
- File vcf_idx
- File? pcrplus_samples_list
- String prefix
- String sv_pipeline_docker
- RuntimeAttr? runtime_attr_override
- }
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 50,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
- command <<<
- set -euo pipefail
- tabix -H ~{vcf} | fgrep -v "##" \
- | cut -f10- | sed 's/\t/\n/g' > "~{prefix}.samples.list"
- if [ ! -z "~{pcrplus_samples_list}" ];then
- fgrep -wf ~{pcrplus_samples_list} "~{prefix}.samples.list" \
- > "~{prefix}.PCRPLUS.samples.list" || true
- fgrep -wvf ~{pcrplus_samples_list} "~{prefix}.samples.list" \
- > "~{prefix}.PCRMINUS.samples.list" || true
- else
- cp ~{prefix}.samples.list "~{prefix}.PCRMINUS.samples.list"
- touch "~{prefix}.PCRPLUS.samples.list"
- fi
- >>>
-
- output {
- File samples_list = "~{prefix}.samples.list"
- File plus_samples_list = "~{prefix}.PCRPLUS.samples.list"
- File minus_samples_list = "~{prefix}.PCRMINUS.samples.list"
- }
-
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_pipeline_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-}
-
-
-# Count biallelic SV per sample for a single chromosome
-task CountSvtypes {
- input{
+# Exclude outliers from VCF
+task ExcludeOutliers {
+ input {
File vcf
- File vcf_idx
- String prefix
- String contig
- String sv_pipeline_docker
+ Array[String] outliers_list
+ String outfile
+ String sv_base_mini_docker
RuntimeAttr? runtime_attr_override
}
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 50,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
- command <<<
- set -euo pipefail
- tabix --print-header "~{vcf}" "~{contig}" \
- | fgrep -v "MULTIALLELIC" \
- | fgrep -v "PESR_GT_OVERDISPERSION" \
- | svtk count-svtypes --no-header stdin \
- | awk -v OFS="\t" -v chr="~{contig}" '{ print $0, chr }' \
- > "~{prefix}.~{contig}.svcounts.txt"
- >>>
- output {
- File sv_counts = "~{prefix}.~{contig}.svcounts.txt"
- }
-
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_pipeline_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-}
-
-
-# Combine SV count files across chromosomes
-task CombineCounts {
- input{
- Array[File] svcounts
- String prefix
- String sv_pipeline_docker
- RuntimeAttr? runtime_attr_override
- }
RuntimeAttr default_attr = object {
cpu_cores: 1,
mem_gb: 3.75,
- disk_gb: 30,
+ disk_gb: 10,
boot_disk_gb: 10,
preemptible_tries: 3,
max_retries: 1
}
RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
- command <<<
- set -euo pipefail
- while read file; do
- cat "$file"
- done < ~{write_lines(svcounts)} \
- > merged_svcounts.txt
- /opt/sv-pipeline/scripts/downstream_analysis_and_filtering/sum_svcounts_perSample.R \
- merged_svcounts.txt \
- "~{prefix}.summed_svcounts_per_sample.txt"
- >>>
-
-
output {
- File summed_svcounts = "~{prefix}.summed_svcounts_per_sample.txt"
- }
-
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_pipeline_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-}
-
-
-# Identify the list of outlier samples & generate distribution plots
-task IdentifyOutliers {
- input{
- File svcounts
- Int n_iqr_cutoff
- File samples_list
- String prefix
- String sv_pipeline_docker
- RuntimeAttr? runtime_attr_override
- }
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 20,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
+ File vcf_no_outliers = "${outfile}"
}
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
command <<<
- # Subset input data to specified samples
- sed -n '1p' ~{svcounts} > filtered_counts.input.txt
- sed '1d' ~{svcounts} | fgrep -wf ~{samples_list} >> filtered_counts.input.txt
- # Return list of samples exceeding cutoff for at least one sv class
- /opt/sv-pipeline/scripts/downstream_analysis_and_filtering/determine_svcount_outliers.R \
- -p "~{prefix}" \
- -I "~{n_iqr_cutoff}" \
- filtered_counts.input.txt \
- "~{prefix}_svcount_outlier_plots/"
- cat "~{prefix}_svcount_outlier_plots/~{prefix}.SV_count_outlier_samples.txt" \
- | fgrep -v "#" | cut -f1 | sort | uniq \
- > "~{prefix}.SV_count_outliers.samples.list"
- tar -cvzf "~{prefix}_svcount_outlier_plots.tar.gz" "~{prefix}_svcount_outlier_plots/"
- >>>
- output {
- File outliers_list = "~{prefix}.SV_count_outliers.samples.list"
- File svcount_distrib_plots = "~{prefix}_svcount_outlier_plots.tar.gz"
- }
-
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_pipeline_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-}
-
-
-# Exclude outliers from VCF
-task ExcludeOutliers {
- input{
- File vcf
- File vcf_idx
- File? plus_outliers_list
- File minus_outliers_list
- String outfile
- String prefix
- String sv_pipeline_docker
- RuntimeAttr? runtime_attr_override
- }
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 100,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
-
- command <<<
- set -euo pipefail
- cat ~{plus_outliers_list} ~{minus_outliers_list} \
- | sort -Vk1,1 | uniq \
- > "~{prefix}.SV_count_outliers.samples.list" || true
- tabix -H ~{vcf} | fgrep -v "##" | \
- sed 's/\t/\n/g' | awk -v OFS="\t" '{ print $1, NR }' | \
- fgrep -wf "~{prefix}.SV_count_outliers.samples.list" | cut -f2 > \
- indexes_to_exclude.txt || true
- if [ $( cat indexes_to_exclude.txt | wc -l ) -gt 0 ]; then
+ set -eu
+ OUTLIERS=~{write_lines(outliers_list)}
+ if [ $( wc -c < $OUTLIERS ) -gt 1 ]; then
+ zcat ~{vcf} | fgrep "#" | fgrep -v "##" \
+ | sed 's/\t/\n/g' | awk -v OFS="\t" '{ print $1, NR }' \
+ | fgrep -wf $OUTLIERS | cut -f2 \
+ > indexes_to_exclude.txt
zcat ~{vcf} | \
- cut --complement -f$( cat indexes_to_exclude.txt | paste -s -d, ) | \
- bgzip -c \
- > "~{prefix}.subsetted_preEmptyRemoval.vcf.gz" || true
- /opt/sv-pipeline/scripts/drop_empty_records.py \
- "~{prefix}.subsetted_preEmptyRemoval.vcf.gz" \
- stdout | \
- bgzip -c > ~{outfile} || true
+ cut --complement -f$( cat indexes_to_exclude.txt | paste -s -d, ) \
+ | vcftools --mac 1 --vcf - --recode --recode-INFO-all --stdout \
+ | bgzip -c > ~{outfile}
else
cp ~{vcf} ~{outfile}
fi
- tabix -p vcf -f "~{outfile}"
+
>>>
-
- output {
- File merged_outliers_list = "~{prefix}.SV_count_outliers.samples.list"
- File vcf_no_outliers = "~{outfile}"
- File vcf_no_outliers_idx = "~{outfile}.tbi"
- }
-
runtime {
cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_pipeline_docker
+ docker: sv_base_mini_docker
preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
}
-}
+}
-# Write new list of samples per prefix after outlier filtering
+# Write new list of samples per batch after outlier filtering
task FilterSampleList {
- input{
- File original_samples_list
- File outlier_samples
- String prefix
- String sv_pipeline_docker
+ input {
+ Array[String] original_samples
+ Array[String] outlier_samples
+ String batch
+ String linux_docker
RuntimeAttr? runtime_attr_override
}
+
RuntimeAttr default_attr = object {
cpu_cores: 1,
mem_gb: 3.75,
@@ -383,22 +149,24 @@ task FilterSampleList {
}
RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
- command <<<
- set -euo pipefail
- fgrep -wvf ~{outlier_samples} ~{original_samples_list} > \
- ~{prefix}.outliers_excluded.samples.list || true
- >>>
-
output {
- File filtered_samples_list = "~{prefix}.outliers_excluded.samples.list"
+ Array[String] filtered_samples_list = read_lines("${batch}.outliers_excluded.samples.list")
+ File filtered_samples_file = "${batch}.outliers_excluded.samples.list"
}
+ command <<<
+
+ fgrep -wvf ~{write_lines(outlier_samples)} ~{write_lines(original_samples)} > ~{batch}.outliers_excluded.samples.list
+
+ >>>
runtime {
cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_pipeline_docker
+ docker: linux_docker
preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
}
+
}
+
diff --git a/wdl/Module07FilterOutlierSamples.wdl b/wdl/FilterOutlierSamplesPostMinGQ.wdl
similarity index 96%
rename from wdl/Module07FilterOutlierSamples.wdl
rename to wdl/FilterOutlierSamplesPostMinGQ.wdl
index 947de0aca..b3e6a9eab 100644
--- a/wdl/Module07FilterOutlierSamples.wdl
+++ b/wdl/FilterOutlierSamplesPostMinGQ.wdl
@@ -1,7 +1,3 @@
-##########################
-## EXPERIMENTAL WORKFLOW
-##########################
-
version 1.0
import "Structs.wdl"
@@ -55,7 +51,7 @@ workflow FilterOutlierSamplesPostMinGQ {
# Get outliers
if (PCRPLUS) {
- call IdentifyOutliers as IdentifyPcrPlusOutliers {
+ call IdentifyOutliers as identify_PCRPLUS_outliers {
input:
svcounts=CombineCounts.summed_svcounts,
n_iqr_cutoff=select_first([n_iqr_cutoff_pcrplus]),
@@ -64,7 +60,7 @@ workflow FilterOutlierSamplesPostMinGQ {
sv_pipeline_docker=sv_pipeline_docker
}
}
- call IdentifyOutliers as IdentifyPcrMinusOutliers {
+ call IdentifyOutliers as identify_PCRMINUS_outliers {
input:
svcounts=CombineCounts.summed_svcounts,
n_iqr_cutoff=n_iqr_cutoff_pcrminus,
@@ -78,8 +74,8 @@ workflow FilterOutlierSamplesPostMinGQ {
input:
vcf=vcf,
vcf_idx=vcf_idx,
- plus_outliers_list=IdentifyPcrPlusOutliers.outliers_list,
- minus_outliers_list=IdentifyPcrMinusOutliers.outliers_list,
+ plus_outliers_list=identify_PCRPLUS_outliers.outliers_list,
+ minus_outliers_list=identify_PCRMINUS_outliers.outliers_list,
outfile="~{prefix}.outliers_removed.vcf.gz",
prefix=prefix,
sv_pipeline_docker=sv_pipeline_docker
@@ -101,8 +97,8 @@ workflow FilterOutlierSamplesPostMinGQ {
File nooutliers_samples_list = FilterSampleList.filtered_samples_list
File excluded_samples_list = ExcludeOutliers.merged_outliers_list
File svcounts_per_sample_data = CombineCounts.summed_svcounts
- File? svcounts_per_sample_plots_PCRPLUS = IdentifyPcrPlusOutliers.svcount_distrib_plots
- File svcounts_per_sample_plots_PCRMINUS = IdentifyPcrMinusOutliers.svcount_distrib_plots
+ File? svcounts_per_sample_plots_PCRPLUS = identify_PCRPLUS_outliers.svcount_distrib_plots
+ File svcounts_per_sample_plots_PCRMINUS = identify_PCRMINUS_outliers.svcount_distrib_plots
}
}
diff --git a/wdl/FilterOutliers.wdl b/wdl/FilterOutliers.wdl
deleted file mode 100644
index 3f72b9d3e..000000000
--- a/wdl/FilterOutliers.wdl
+++ /dev/null
@@ -1,324 +0,0 @@
-version 1.0
-
-# Author: Ryan Collins
-
-import "Structs.wdl"
-
-# Workflow to identify & filter outliers from VCFs after module 03 (random forest)
-workflow FilterOutlierSamples {
- input {
- String batch
- Array[File?] vcfs
- Array[String] samples
- Array[String] algorithms
- Int N_IQR_cutoff
- File? outlier_cutoff_table
- String sv_pipeline_docker
- String sv_base_mini_docker
- String linux_docker
- RuntimeAttr? runtime_attr_identify_outliers
- RuntimeAttr? runtime_attr_exclude_outliers
- RuntimeAttr? runtime_attr_cat_outliers
- RuntimeAttr? runtime_attr_filter_samples
- }
-
- Int num_algorithms = length(algorithms)
-
- scatter (i in range(num_algorithms)) {
- if (defined(vcfs[i])) {
- if (defined(outlier_cutoff_table)) {
- call IdentifyOutliersByCutoffTable {
- input:
- vcf = select_first([vcfs[i]]),
- outlier_cutoff_table = select_first([outlier_cutoff_table]),
- outfile = "${algorithms[i]}_outliers.txt",
- algorithm = algorithms[i],
- sv_pipeline_docker = sv_pipeline_docker,
- runtime_attr_override = runtime_attr_identify_outliers
- }
- }
- call IdentifyOutliersByIQR {
- input:
- vcf = select_first([vcfs[i]]),
- N_IQR_cutoff = N_IQR_cutoff,
- outfile = "${algorithms[i]}_outliers.txt",
- sv_pipeline_docker = sv_pipeline_docker,
- runtime_attr_override = runtime_attr_identify_outliers
- }
- }
- }
-
- # Merge list of outliers
- call CatOutliers {
- input:
- outliers = flatten([select_all(IdentifyOutliersByIQR.outliers_list),select_all(IdentifyOutliersByCutoffTable.outliers_list)]),
- batch = batch,
- linux_docker = linux_docker,
- runtime_attr_override = runtime_attr_cat_outliers
- }
-
- scatter (i in range(num_algorithms)) {
- if (defined(vcfs[i])) {
- call ExcludeOutliers {
- input:
- vcf = select_first([vcfs[i]]),
- outliers_list = CatOutliers.outliers_list,
- outfile = "${batch}.${algorithms[i]}.outliers_removed.vcf.gz",
- sv_base_mini_docker = sv_base_mini_docker,
- runtime_attr_override = runtime_attr_exclude_outliers
- }
- }
- }
-
- # Write new list of samples without outliers
- call FilterSampleList {
- input:
- original_samples = samples,
- outlier_samples = CatOutliers.outliers_list,
- batch = batch,
- linux_docker = linux_docker,
- runtime_attr_override = runtime_attr_filter_samples
- }
-
- output {
- Array[File?] vcfs_noOutliers = ExcludeOutliers.vcf_no_outliers
- Array[String] filtered_batch_samples_list = FilterSampleList.filtered_samples_list
- File filtered_batch_samples_file = FilterSampleList.filtered_samples_file
- Array[String] outlier_samples_excluded = CatOutliers.outliers_list
- File outlier_samples_excluded_file = CatOutliers.outliers_file
- }
-}
-
-task IdentifyOutliersByIQR {
- input {
- File vcf
- Int N_IQR_cutoff
- String outfile
- String sv_pipeline_docker
- RuntimeAttr? runtime_attr_override
- }
-
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 10,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
- output {
- File outliers_list = "${outfile}"
- }
- command <<<
-
- set -euo pipefail
- # Count sv per class per sample
- svtk count-svtypes ~{vcf} svcounts.txt
-
- # Return list of samples exceeding cutoff for at least one sv class
- /opt/sv-pipeline/03_variant_filtering/scripts/get_outliers_from_svcounts.helper.R \
- svcounts.txt \
- ~{N_IQR_cutoff} \
- ~{outfile}
-
- >>>
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_pipeline_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-
-}
-
-task IdentifyOutliersByCutoffTable {
- input {
- File vcf
- File outlier_cutoff_table
- String outfile
- String algorithm
- String sv_pipeline_docker
- RuntimeAttr? runtime_attr_override
- }
-
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 10,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
- output {
- File outliers_list = "${outfile}"
- }
- command <<<
-
- set -euo pipefail
- # Count sv per class per sample
- svtk count-svtypes ~{vcf} svcounts.txt
-
- # Return list of samples exceeding cutoff for at least one sv class
- /opt/sv-pipeline/03_variant_filtering/scripts/get_outliers_from_svcounts.helper_V2.R \
- svcounts.txt \
- ~{outlier_cutoff_table} \
- ~{outfile} \
- ~{algorithm}
-
- >>>
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_pipeline_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-
-}
-
-# Exclude outliers from VCF
-task ExcludeOutliers {
- input {
- File vcf
- Array[String] outliers_list
- String outfile
- String sv_base_mini_docker
- RuntimeAttr? runtime_attr_override
- }
-
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 10,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
- output {
- File vcf_no_outliers = "${outfile}"
- }
- command <<<
-
- set -eu
- OUTLIERS=~{write_lines(outliers_list)}
- if [ $( wc -c < $OUTLIERS ) -gt 1 ]; then
- zcat ~{vcf} | fgrep "#" | fgrep -v "##" \
- | sed 's/\t/\n/g' | awk -v OFS="\t" '{ print $1, NR }' \
- | fgrep -wf $OUTLIERS | cut -f2 \
- > indexes_to_exclude.txt
- zcat ~{vcf} | \
- cut --complement -f$( cat indexes_to_exclude.txt | paste -s -d, ) \
- | vcftools --mac 1 --vcf - --recode --recode-INFO-all --stdout \
- | bgzip -c > ~{outfile}
- else
- cp ~{vcf} ~{outfile}
- fi
-
- >>>
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: sv_base_mini_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-
-}
-
-# Write new list of samples per batch after outlier filtering
-task FilterSampleList {
- input {
- Array[String] original_samples
- Array[String] outlier_samples
- String batch
- String linux_docker
- RuntimeAttr? runtime_attr_override
- }
-
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 10,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
- output {
- Array[String] filtered_samples_list = read_lines("${batch}.post03_outliers_excluded.samples.list")
- File filtered_samples_file = "${batch}.post03_outliers_excluded.samples.list"
- }
- command <<<
-
- fgrep -wvf ~{write_lines(outlier_samples)} ~{write_lines(original_samples)} > ~{batch}.post03_outliers_excluded.samples.list
-
- >>>
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: linux_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-
-}
-
-# Merge outlier sample lists across algorithms
-task CatOutliers {
- input {
- Array[File] outliers
- String batch
- String linux_docker
- RuntimeAttr? runtime_attr_override
- }
-
- RuntimeAttr default_attr = object {
- cpu_cores: 1,
- mem_gb: 3.75,
- disk_gb: 10,
- boot_disk_gb: 10,
- preemptible_tries: 3,
- max_retries: 1
- }
- RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
-
- output {
- Array[String] outliers_list = read_lines("${batch}.post03_outliers.samples.list")
- File outliers_file = "${batch}.post03_outliers.samples.list"
- }
- command <<<
-
- set -euo pipefail
- while read file; do
- [ -e "$file" ] || continue
- cat $file
- done < ~{write_lines(outliers)} | sort | uniq > ~{batch}.post03_outliers.samples.list
-
- >>>
- runtime {
- cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
- memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
- disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
- bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
- docker: linux_docker
- preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
- maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
- }
-
-}
diff --git a/wdl/GATKSVPipelineBatch.wdl b/wdl/GATKSVPipelineBatch.wdl
index cab4db17d..49c3693b2 100644
--- a/wdl/GATKSVPipelineBatch.wdl
+++ b/wdl/GATKSVPipelineBatch.wdl
@@ -226,9 +226,9 @@ workflow GATKSVPipelineBatch {
call genotypebatch.GenotypeBatch as GenotypeBatch {
input:
- batch_pesr_vcf=GATKSVPipelinePhase1.filtered_pesr_vcf,
+ batch_pesr_vcf=select_first([GATKSVPipelinePhase1.filtered_pesr_vcf]),
batch_depth_vcf=select_first([GATKSVPipelinePhase1.filtered_depth_vcf]),
- cohort_pesr_vcf=GATKSVPipelinePhase1.filtered_pesr_vcf,
+ cohort_pesr_vcf=select_first([GATKSVPipelinePhase1.filtered_pesr_vcf]),
cohort_depth_vcf=select_first([GATKSVPipelinePhase1.filtered_depth_vcf]),
batch=batch,
rf_cutoffs=GATKSVPipelinePhase1.cutoffs,
diff --git a/wdl/GATKSVPipelinePhase1.wdl b/wdl/GATKSVPipelinePhase1.wdl
index 6eb5fa181..bbdd354e8 100644
--- a/wdl/GATKSVPipelinePhase1.wdl
+++ b/wdl/GATKSVPipelinePhase1.wdl
@@ -504,7 +504,7 @@ workflow GATKSVPipelinePhase1 {
File? filtered_wham_vcf = FilterBatch.filtered_wham_vcf
File? filtered_melt_vcf = FilterBatch.filtered_melt_vcf
File? filtered_depth_vcf = FilterBatch.filtered_depth_vcf
- File filtered_pesr_vcf = FilterBatch.filtered_pesr_vcf
+ File? filtered_pesr_vcf = FilterBatch.filtered_pesr_vcf
File cutoffs = FilterBatch.cutoffs
File scores = FilterBatch.scores
File RF_intermediate_files = FilterBatch.RF_intermediate_files
diff --git a/wdl/GATKSVPipelineSingleSample.wdl b/wdl/GATKSVPipelineSingleSample.wdl
index 573703009..4bb127ece 100644
--- a/wdl/GATKSVPipelineSingleSample.wdl
+++ b/wdl/GATKSVPipelineSingleSample.wdl
@@ -8,7 +8,7 @@ import "DepthPreprocessing.wdl" as dpn
import "ClusterBatch.wdl" as clusterbatch
import "GenerateBatchMetrics.wdl" as batchmetrics
import "SRTest.wdl" as SRTest
-import "FilterBatch.wdl" as filterbatch
+import "FilterBatchSamples.wdl" as filterbatch
import "GenotypeBatch.wdl" as genotypebatch
import "MakeCohortVcf.wdl" as makecohortvcf
import "AnnotateVcf.wdl" as annotate
diff --git a/wdl/IdentifyOutlierSamples.wdl b/wdl/IdentifyOutlierSamples.wdl
new file mode 100644
index 000000000..462c79746
--- /dev/null
+++ b/wdl/IdentifyOutlierSamples.wdl
@@ -0,0 +1,208 @@
+version 1.0
+
+import "Structs.wdl"
+import "PlotSVCountsPerSample.wdl" as plot_svcounts
+
+workflow IdentifyOutlierSamples {
+ input {
+ String name # batch or cohort
+ File vcf
+ File? sv_counts # SV counts file from PlotSVCountsPerSample - if not provided, will create
+ Int N_IQR_cutoff
+ File? outlier_cutoff_table
+ String? vcf_identifier # required (enter algorithm here) if providing outlier_cutoff_table, optional otherwise to add to file prefixes (ie. as a VCF identifier)
+ String sv_pipeline_docker
+ String linux_docker
+ RuntimeAttr? runtime_attr_identify_outliers
+ RuntimeAttr? runtime_attr_cat_outliers
+ RuntimeAttr? runtime_attr_count_svs
+ }
+
+ String prefix = if (defined(vcf_identifier)) then "~{name}_~{vcf_identifier}" else name
+
+ if (!defined(sv_counts)) {
+ call plot_svcounts.CountSVsPerSamplePerType {
+ input:
+ vcf = vcf,
+ prefix = prefix,
+ sv_pipeline_docker = sv_pipeline_docker,
+ runtime_attr_override = runtime_attr_count_svs
+ }
+ }
+
+ if (defined(outlier_cutoff_table)) {
+ call IdentifyOutliersByCutoffTable {
+ input:
+ vcf = vcf,
+ svcounts = select_first([sv_counts, CountSVsPerSamplePerType.sv_counts]),
+ outlier_cutoff_table = select_first([outlier_cutoff_table]),
+ outfile = "${prefix}_outliers.txt",
+ algorithm = select_first([vcf_identifier]),
+ sv_pipeline_docker = sv_pipeline_docker,
+ runtime_attr_override = runtime_attr_identify_outliers
+ }
+ }
+ call IdentifyOutliersByIQR {
+ input:
+ vcf = vcf,
+ svcounts = select_first([sv_counts, CountSVsPerSamplePerType.sv_counts]),
+ N_IQR_cutoff = N_IQR_cutoff,
+ outfile = "${prefix}_IQR_outliers.txt",
+ sv_pipeline_docker = sv_pipeline_docker,
+ runtime_attr_override = runtime_attr_identify_outliers
+ }
+
+ # Merge list of outliers
+ call CatOutliers {
+ input:
+ outliers = select_all([IdentifyOutliersByIQR.outliers_list,IdentifyOutliersByCutoffTable.outliers_list]),
+ batch = prefix,
+ linux_docker = linux_docker,
+ runtime_attr_override = runtime_attr_cat_outliers
+ }
+
+ output {
+ File outlier_samples_file = CatOutliers.outliers_file
+ Array[String] outlier_samples_list = CatOutliers.outliers_list
+ File sv_counts_file = select_first([sv_counts, CountSVsPerSamplePerType.sv_counts])
+ }
+}
+
+task IdentifyOutliersByIQR {
+ input {
+ File vcf
+ File svcounts
+ Int N_IQR_cutoff
+ String outfile
+ String sv_pipeline_docker
+ RuntimeAttr? runtime_attr_override
+ }
+
+ RuntimeAttr default_attr = object {
+ cpu_cores: 1,
+ mem_gb: 3.75,
+ disk_gb: 10,
+ boot_disk_gb: 10,
+ preemptible_tries: 3,
+ max_retries: 1
+ }
+ RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
+
+ output {
+ File outliers_list = "${outfile}"
+ }
+ command <<<
+
+ set -euo pipefail
+
+ # Return list of samples exceeding cutoff for at least one sv class
+ /opt/sv-pipeline/03_variant_filtering/scripts/get_outliers_from_svcounts.helper.R \
+ ~{svcounts} \
+ ~{N_IQR_cutoff} \
+ ~{outfile}
+
+ >>>
+ runtime {
+ cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
+ memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
+ disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
+ bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
+ docker: sv_pipeline_docker
+ preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
+ maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
+ }
+
+}
+
+task IdentifyOutliersByCutoffTable {
+ input {
+ File vcf
+ File svcounts
+ File outlier_cutoff_table
+ String outfile
+ String algorithm
+ String sv_pipeline_docker
+ RuntimeAttr? runtime_attr_override
+ }
+
+ RuntimeAttr default_attr = object {
+ cpu_cores: 1,
+ mem_gb: 3.75,
+ disk_gb: 10,
+ boot_disk_gb: 10,
+ preemptible_tries: 3,
+ max_retries: 1
+ }
+ RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
+
+ output {
+ File outliers_list = "${outfile}"
+ }
+ command <<<
+
+ set -euo pipefail
+
+ # Return list of samples exceeding cutoff for at least one sv class
+ /opt/sv-pipeline/03_variant_filtering/scripts/get_outliers_from_svcounts.helper_V2.R \
+ ~{svcounts} \
+ ~{outlier_cutoff_table} \
+ ~{outfile} \
+ ~{algorithm}
+
+ >>>
+ runtime {
+ cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
+ memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
+ disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
+ bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
+ docker: sv_pipeline_docker
+ preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
+ maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
+ }
+
+}
+
+
+# Merge outlier sample lists across algorithms
+task CatOutliers {
+ input {
+ Array[File] outliers
+ String batch
+ String linux_docker
+ RuntimeAttr? runtime_attr_override
+ }
+
+ RuntimeAttr default_attr = object {
+ cpu_cores: 1,
+ mem_gb: 3.75,
+ disk_gb: 10,
+ boot_disk_gb: 10,
+ preemptible_tries: 3,
+ max_retries: 1
+ }
+ RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
+
+ output {
+ Array[String] outliers_list = read_lines("${batch}.outliers.samples.list")
+ File outliers_file = "${batch}.outliers.samples.list"
+ }
+ command <<<
+
+ set -euo pipefail
+ while read file; do
+ [ -e "$file" ] || continue
+ cat $file
+ done < ~{write_lines(outliers)} | sort | uniq > ~{batch}.outliers.samples.list
+
+ >>>
+ runtime {
+ cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
+ memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
+ disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
+ bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
+ docker: linux_docker
+ preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
+ maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
+ }
+
+}
diff --git a/wdl/PlotSVCountsPerSample.wdl b/wdl/PlotSVCountsPerSample.wdl
new file mode 100644
index 000000000..5c60689fa
--- /dev/null
+++ b/wdl/PlotSVCountsPerSample.wdl
@@ -0,0 +1,198 @@
+version 1.0
+
+import "Structs.wdl"
+
+# Workflow to count SVs per sample per type and
+workflow PlotSVCountsPerSample {
+ input {
+ String prefix
+ Array[File?] vcfs # in order of vcf_identifiers array. To skip one, use null keyword
+ Array[String] vcf_identifiers # VCF identifiers - could be algorithms like manta, wham, etc or pesr, depth or just cohort VCF
+ Int N_IQR_cutoff
+ String sv_pipeline_docker
+ RuntimeAttr? runtime_attr_count_svs
+ RuntimeAttr? runtime_attr_plot_svcounts
+ RuntimeAttr? runtime_attr_cat_outliers_preview
+ }
+
+ Int num_identifiers = length(vcf_identifiers)
+
+ scatter (i in range(num_identifiers)) {
+ if (defined(vcfs[i])) {
+ call CountSVsPerSamplePerType {
+ input:
+ vcf = select_first([vcfs[i]]),
+ prefix = "~{prefix}.~{vcf_identifiers[i]}",
+ sv_pipeline_docker = sv_pipeline_docker,
+ runtime_attr_override = runtime_attr_count_svs
+ }
+
+ call PlotSVCountsWithCutoff {
+ input:
+ svcounts = CountSVsPerSamplePerType.sv_counts,
+ n_iqr_cutoff = N_IQR_cutoff,
+ prefix = "~{prefix}.~{vcf_identifiers[i]}",
+ sv_pipeline_docker = sv_pipeline_docker,
+ runtime_attr_override = runtime_attr_plot_svcounts
+ }
+ }
+ }
+
+ # Merge list of outliers for previewing
+ call CatOutliersPreview {
+ input:
+ outliers = select_all(PlotSVCountsWithCutoff.outliers_list),
+ prefix = prefix,
+ sv_pipeline_docker = sv_pipeline_docker,
+ runtime_attr_override = runtime_attr_cat_outliers_preview
+ }
+
+
+ output {
+ Array[File?] sv_counts = CountSVsPerSamplePerType.sv_counts
+ Array[File?] sv_count_plots = PlotSVCountsWithCutoff.svcount_distrib_plots
+ File outlier_samples_preview = CatOutliersPreview.outliers_preview_file
+ }
+}
+
+task CountSVsPerSamplePerType {
+ input {
+ File vcf
+ String prefix
+ String sv_pipeline_docker
+ RuntimeAttr? runtime_attr_override
+ }
+
+ RuntimeAttr default_attr = object {
+ cpu_cores: 1,
+ mem_gb: 3.75,
+ disk_gb: 10,
+ boot_disk_gb: 10,
+ preemptible_tries: 3,
+ max_retries: 1
+ }
+ RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
+
+ output {
+ File sv_counts = "~{prefix}.svcounts.txt"
+ }
+ command <<<
+
+ set -euo pipefail
+ # Count sv per class per sample
+ svtk count-svtypes ~{vcf} ~{prefix}.svcounts.txt
+
+ >>>
+ runtime {
+ cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
+ memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
+ disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
+ bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
+ docker: sv_pipeline_docker
+ preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
+ maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
+ }
+
+}
+
+# Generate SV count distribution plots with IQR cutoff plotted
+task PlotSVCountsWithCutoff {
+ input{
+ File svcounts
+ Int n_iqr_cutoff
+ String prefix
+ String sv_pipeline_docker
+ RuntimeAttr? runtime_attr_override
+ }
+ RuntimeAttr default_attr = object {
+ cpu_cores: 1,
+ mem_gb: 3.75,
+ disk_gb: 10,
+ boot_disk_gb: 10,
+ preemptible_tries: 3,
+ max_retries: 1
+ }
+ RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
+
+ command <<<
+ set -euo pipefail
+ /opt/sv-pipeline/scripts/downstream_analysis_and_filtering/determine_svcount_outliers.R \
+ -p "~{prefix}" \
+ -I "~{n_iqr_cutoff}" \
+ ~{svcounts} \
+ "./"
+ >>>
+
+ output {
+ File svcount_distrib_plots = "~{prefix}.all_SVTYPEs.counts_per_sample.png"
+ File outliers_list = "~{prefix}.SV_count_outlier_samples.txt"
+ }
+
+ runtime {
+ cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
+ memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
+ disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
+ bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
+ docker: sv_pipeline_docker
+ preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
+ maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
+ }
+}
+
+# Merge outlier sample lists detected across multiple VCFs
+task CatOutliersPreview {
+ input {
+ Array[File] outliers
+ String prefix
+ String sv_pipeline_docker
+ RuntimeAttr? runtime_attr_override
+ }
+
+ RuntimeAttr default_attr = object {
+ cpu_cores: 1,
+ mem_gb: 3.75,
+ disk_gb: 10,
+ boot_disk_gb: 10,
+ preemptible_tries: 3,
+ max_retries: 1
+ }
+ RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])
+
+ output {
+ File outliers_preview_file = "${prefix}.outliers_preview.samples.txt"
+ }
+ command <<<
+
+ set -euo pipefail
+ python3 <>>
+ runtime {
+ cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
+ memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
+ disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
+ bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
+ docker: sv_pipeline_docker
+ preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
+ maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
+ }
+
+}
+
+