Update GatherSampleEvidence & TrainGCNV docs (#681)

* Extend docs. * Add Scramble to GSE. * Update TrainGCNV docs. * Document the annotated_intervals output. * Add an option to highlight text. * Extend docs on inputs and outputs of workflows. * Fix typo & add diagram for gather sample evidence. * Update header level to match inputs section. * Update website/docs/modules/gather_sample_evidence.md Co-authored-by: Mark Walker <[email protected]> * Update website/docs/modules/gather_sample_evidence.md Co-authored-by: Mark Walker <[email protected]> * Update website/docs/modules/train_gcnv.md Co-authored-by: Mark Walker <[email protected]> * Replace direct link with a reference to the resources file. * Update website/docs/modules/train_gcnv.md Co-authored-by: Mark Walker <[email protected]> * Replace direct links with references to the resources file. * Separate gatk-sv input, & add additional external docs link. * Remove links. * Add a single-line descript to avoid empty section. Needs to be extended in follow-up PRs. * update diagrams to display recommended invocation order. * add a common inputs section & remove some values. * make plural * update. * update link * clarify homogeneous * Fix a broken link. --------- Co-authored-by: Mark Walker <[email protected]>
broadinstitute · Sep 25, 2024 · 83e9464 · 83e9464
1 parent 59e62c8
commit 83e9464
Show file tree

Hide file tree

Showing 8 changed files with 409 additions and 59 deletions.
diff --git a/website/docs/gs/runtime_env.md b/website/docs/gs/runtime_env.md
@@ -48,4 +48,4 @@ and share code on forked repositories. Here are a some considerations:
 - The GATK-SV pipeline takes advantage of the massive parallelization possible in the cloud. 
   Local backends may not have the resources to execute all of the workflows. 
   Workflows that use fewer resources or that are less parallelized may be more successful. 
-  For instance, some users have been able to run [GatherSampleEvidence](#gather-sample-evidence) on a SLURM cluster.
+  For instance, some users have been able to run [GatherSampleEvidence](../modules/gse) on a SLURM cluster.
diff --git a/website/docs/modules/evidence_qc.md b/website/docs/modules/evidence_qc.md
@@ -5,6 +5,8 @@ sidebar_position: 2
 slug: eqc
 ---
 
+import { Highlight, HighlightOptionalArg } from "../../src/components/highlight.js"
+
 Runs ploidy estimation, dosage scoring, and optionally VCF QC. 
 The results from this module can be used for QC and batching.
 
@@ -17,9 +19,36 @@ for further guidance on creating batches.
 We also recommend using sex assignments generated from the ploidy 
 estimates and incorporating them into the PED file, with sex = 0 for sex aneuploidies.
 
-### Prerequisites
+The following diagram illustrates the upstream and downstream workflows of the `EvidenceQC` workflow 
+in the recommended invocation order. You may refer to 
+[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg) 
+for the overall recommended invocation order.
+
+<br/>
+
+```mermaid
+
+stateDiagram
+  direction LR
+  
+  classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
+  classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
+  classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d
+
+  gse: GatherSampleEvidence
+  eqc: EvidenceQC
+  batching: Batching, sample QC, and sex assignment
+  
+  gse --> eqc
+  eqc --> batching
+  
+  class eqc thisModule
+  class gse inModules
+  class batching outModules
+```
+
+<br/>
 
-- [Gather Sample Evidence](./gse)
 
 ### Inputs
 

diff --git a/website/docs/modules/gather_batch_evidence.md b/website/docs/modules/gather_batch_evidence.md
@@ -5,25 +5,174 @@ sidebar_position: 4
 slug: gbe
 ---
 
-Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample 
-raw evidence into a batch. See above for more information on batching.
+Runs CNV callers ([cn.MOPS](https://academic.oup.com/nar/article/40/9/e69/1136601), GATK-gCNV) 
+and combines single-sample raw evidence into a batch.
 
-### Prerequisites
+The following diagram illustrates the downstream workflows of the `GatherBatchEvidence` workflow 
+in the recommended invocation order. You may refer to 
+[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg) 
+for the overall recommended invocation order.
 
-- GatherSampleEvidence
-- (Recommended) EvidenceQC
-- gCNV training. 
+```mermaid
 
-### Inputs
-- PED file (updated with EvidenceQC sex assignments, including sex = 0 
-  for sex aneuploidies. Calls will not be made on sex chromosomes 
-  when sex = 0 in order to avoid generating many confusing calls 
-  or upsetting normalized copy numbers for the batch.)
-- Read count, BAF, PE, SD, and SR files (GatherSampleEvidence)
-- Caller VCFs (GatherSampleEvidence)
-- Contig ploidy model and gCNV model files (gCNV training)
+stateDiagram
+  direction LR
+  
+  classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
+  classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
+  classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d
 
-### Outputs
+  gbe: GatherBatchEvidence
+  t: TrainGCNV
+  cb: ClusterBatch
+  t --> gbe
+  gbe --> cb
+  
+  class gbe thisModule
+  class t inModules
+  class cb outModules
+```
+
+## Inputs
+This workflow takes as input the read counts, BAF, PE, SD, SR, and per-caller VCF files 
+produced in the GatherSampleEvidence workflow, and contig ploidy and gCNV models from 
+the TrainGCNV workflow.
+The following is the list of the inputs the GatherBatchEvidence workflow takes.
+
+
+#### `batch`
+An identifier for the batch.
+
+
+#### `samples`
+Sets the list of sample IDs. 
+
+
+#### `counts`
+Set to the [`GatherSampleEvidence.coverage_counts`](./gse#coverage-counts) output.
+
+
+#### Raw calls
+
+The following inputs set the per-caller raw SV calls, and should be set 
+if the caller was run in the [`GatherSampleEvidence`](./gse) workflow.
+You may set each of the following inputs to the linked output from 
+the GatherSampleEvidence workflow.
+
+
+- `manta_vcfs`: [`GatherSampleEvidence.manta_vcf`](./gse#manta-vcf);
+- `melt_vcfs`: [`GatherSampleEvidence.melt_vcf`](./gse#melt-vcf);
+- `scramble_vcfs`: [`GatherSampleEvidence.scramble_vcf`](./gse#scramble-vcf);
+- `wham_vcfs`: [`GatherSampleEvidence.wham_vcf`](./gse#wham-vcf).
+
+#### `PE_files`
+Set to the [`GatherSampleEvidence.pesr_disc`](./gse#pesr-disc) output.
+
+#### `SR_files`
+Set to the [`GatherSampleEvidence.pesr_split`](./gse#pesr-split)
+
+
+#### `SD_files`
+Set to the [`GatherSampleEvidence.pesr_sd`](./gse#pesr-sd)
+
+
+#### `matrix_qc_distance`
+You may refer to [this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl)
+for an example value. 
+
+
+#### `min_svsize`
+Sets the minimum size of SVs to include.
+
+
+#### `ped_file`
+A pedigree file describing the familial relationshipts between the samples in the cohort.
+Please refer to [this section](./#ped_file) for details. 
+
+
+#### `run_matrix_qc`
+Enables or disables running optional QC tasks. 
+
+
+#### `gcnv_qs_cutoff`
+You may refer to [this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl)
+for an example value. 
+
+#### cn.MOPS files
+The workflow needs the following cn.MOPS files.
+
+- `cnmops_chrom_file` and `cnmops_allo_file`: FASTA index files (`.fai`) for respectively 
+  non-sex chromosomes (autosomes) and chromosomes X and Y (allosomes). 
+  The file format is explained [on this page](https://www.htslib.org/doc/faidx.html).
+
+  You may use the following files for these fields:
+
+  ```json
+  "cnmops_chrom_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/autosome.fai"
+  "cnmops_allo_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/allosome.fai"
+  ```
+
+- `cnmops_exclude_list`: 
+  You may use [this file](https://github.com/broadinstitute/gatk-sv/blob/d66f760865a89f30dbce456a3f720dec8b70705c/inputs/values/resources_hg38.json#L10)
+  for this field.
+
+#### GATK-gCNV inputs
+
+The following inputs are configured based on the outputs generated in the [`TrainGCNV`](./gcnv) workflow.
+
+- `contig_ploidy_model_tar`: [`TrainGCNV.cohort_contig_ploidy_model_tar`](./gcnv#contig-ploidy-model-tarball)
+- `gcnv_model_tars`: [`TrainGCNV.cohort_gcnv_model_tars`](./gcnv#model-tarballs)
+
+
+The workflow also enables setting a few optional arguments of gCNV.
+The arguments and their default values are provided 
+[here](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl) 
+as the following, and each argument is documented on 
+[this page](https://gatk.broadinstitute.org/hc/en-us/articles/360037593411-PostprocessGermlineCNVCalls)
+and
+[this page](https://gatk.broadinstitute.org/hc/en-us/articles/360047217671-GermlineCNVCaller).
+
+
+#### Docker images
+
+The workflow needs the following Docker images, the latest versions of which are in 
+[this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/dockers.json).
+
+  - `cnmops_docker`;
+  - `condense_counts_docker`;
+  - `linux_docker`;
+  - `sv_base_docker`;
+  - `sv_base_mini_docker`;
+  - `sv_pipeline_docker`;
+  - `sv_pipeline_qc_docker`;
+  - `gcnv_gatk_docker`;
+  - `gatk_docker`.
+
+#### Static inputs
+
+You may refer to [this reference file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json)
+for values of the following inputs.
+
+ - `primary_contigs_fai`;
+ - `cytoband`;
+ - `ref_dict`;
+ - `mei_bed`;
+ - `genome_file`;
+ - `sd_locs_vcf`.
+
+
+#### Optional Inputs
+The following is the list of a few optional inputs of the 
+workflow, with an example of possible values. 
+
+- `"allosomal_contigs": [["chrX", "chrY"]]`
+- `"ploidy_sample_psi_scale": 0.001`
+
+
+
+
+
+## Outputs
 
 - Combined read count matrix, SR, PE, and BAF files
 - Standardized call VCFs

diff --git a/website/docs/modules/gather_sample_evidence.md b/website/docs/modules/gather_sample_evidence.md
@@ -6,20 +6,77 @@ slug: gse
 ---
 
 Runs raw evidence collection on each sample with the following SV callers: 
-Manta, Wham, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence, 
+Manta, Wham, Scramble, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence, 
 refer to the Sample Exclusion section.
 
-Note: a list of sample IDs must be provided. Refer to the sample ID 
-requirements for specifications of allowable sample IDs. 
+The following diagram illustrates the downstream workflows of the `GatherSampleEvidence` workflow 
+in the recommended invocation order. You may refer to 
+[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg) 
+for the overall recommended invocation order.
+
+
+```mermaid
+
+stateDiagram
+  direction LR
+  
+  classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
+  classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
+  classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d
+
+  gse: GatherSampleEvidence
+  eqc: EvidenceQC
+  gse --> eqc
+  
+  class gse thisModule
+  class eqc outModules
+```
+
+
+## Inputs
+
+#### `bam_or_cram_file`
+A BAM or CRAM file aligned to hg38. Index file (.bai) must be provided if using BAM.
+
+#### `sample_id`
+Refer to the [sample ID requirements](/docs/gs/inputs#sampleids) for specifications of allowable sample IDs. 
 IDs that do not meet these requirements may cause errors.
 
-### Inputs
+#### `preprocessed_intervals`
+Picard interval list.
+
+#### `sd_locs_vcf`
+(`sd`: site depth) 
+A VCF file containing allele counts at common SNP loci of the genome, which is used for calculating BAF.  
+For human genome, you may use [`dbSNP`](https://www.ncbi.nlm.nih.gov/snp/) 
+that contains a complete list of common and clinical human single nucleotide variations, 
+microsatellites, and small-scale insertions and deletions. 
+You may find a link to the file in 
+[this reference](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json).
 
-- Per-sample BAM or CRAM files aligned to hg38. Index files (.bai) must be provided if using BAMs.
 
-### Outputs
+## Outputs
 
-- Caller VCFs (Manta, MELT, and/or Wham)
 - Binned read counts file
 - Split reads (SR) file
 - Discordant read pairs (PE) file
+
+#### `manta_vcf` {#manta-vcf}
+A VCF file containing variants called by Manta. 
+
+#### `melt_vcf` {#melt-vcf}
+A VCF file containing variants called by MELT. 
+
+#### `scramble_vcf` {#scramble-vcf}
+A VCF file containing variants called by Scramble. 
+
+#### `wham_vcf` {#wham-vcf}
+A VCF file containing variants called by Wham. 
+
+#### `coverage_counts` {#coverage-counts}
+
+#### `pesr_disc` {#pesr-disc}
+
+#### `pesr_split` {#pesr-split}
+
+#### `pesr_sd` {#pesr-sd}
diff --git a/website/docs/modules/index.md b/website/docs/modules/index.md
@@ -36,3 +36,18 @@ consisting of multiple modules to be executed in the following order.
 - **Module 09 (in development)** Visualization, including scripts that generates IGV screenshots and rd plots.
 
 - Additional modules to be added: de novo and mosaic scripts
+
+
+## Pipeline Parameters
+
+Several inputs are shared across different modules of the pipeline, which are explained in this section.
+
+#### `ped_file`
+
+A pedigree file describing the familial relationships between the samples in the cohort.
+The file needs to be in the 
+[PED format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format).
+Updated with [EvidenceQC](./eqc) sex assignments, including 
+`sex = 0` for sex aneuploidies; 
+genotypes on chrX and chrY for samples with `sex = 0` in the PED file will be set to 
+`./.` and these samples will be excluded from sex-specific training steps.