Skip to content

Commit

Permalink
Update GatherSampleEvidence & TrainGCNV docs (#681)
Browse files Browse the repository at this point in the history
* Extend docs.

* Add Scramble to GSE.

* Update TrainGCNV docs.

* Document the annotated_intervals output.

* Add an option to highlight text.

* Extend docs on inputs and outputs of workflows.

* Fix typo & add diagram for gather sample evidence.

* Update header level to match inputs section.

* Update website/docs/modules/gather_sample_evidence.md

Co-authored-by: Mark Walker <[email protected]>

* Update website/docs/modules/gather_sample_evidence.md

Co-authored-by: Mark Walker <[email protected]>

* Update website/docs/modules/train_gcnv.md

Co-authored-by: Mark Walker <[email protected]>

* Replace direct link with a reference to the resources file.

* Update website/docs/modules/train_gcnv.md

Co-authored-by: Mark Walker <[email protected]>

* Replace direct links with references to the resources file.

* Separate gatk-sv input, & add additional external docs link.

* Remove links.

* Add a single-line descript to avoid empty section. Needs to be extended in follow-up PRs.

* update diagrams to display recommended invocation order.

* add a common inputs section & remove some values.

* make plural

* update.

* update link

* clarify homogeneous

* Fix a broken link.

---------

Co-authored-by: Mark Walker <[email protected]>
  • Loading branch information
VJalili and mwalker174 authored Sep 25, 2024
1 parent 59e62c8 commit 83e9464
Show file tree
Hide file tree
Showing 8 changed files with 409 additions and 59 deletions.
2 changes: 1 addition & 1 deletion website/docs/gs/runtime_env.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,4 @@ and share code on forked repositories. Here are a some considerations:
- The GATK-SV pipeline takes advantage of the massive parallelization possible in the cloud.
Local backends may not have the resources to execute all of the workflows.
Workflows that use fewer resources or that are less parallelized may be more successful.
For instance, some users have been able to run [GatherSampleEvidence](#gather-sample-evidence) on a SLURM cluster.
For instance, some users have been able to run [GatherSampleEvidence](../modules/gse) on a SLURM cluster.
33 changes: 31 additions & 2 deletions website/docs/modules/evidence_qc.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ sidebar_position: 2
slug: eqc
---

import { Highlight, HighlightOptionalArg } from "../../src/components/highlight.js"

Runs ploidy estimation, dosage scoring, and optionally VCF QC.
The results from this module can be used for QC and batching.

Expand All @@ -17,9 +19,36 @@ for further guidance on creating batches.
We also recommend using sex assignments generated from the ploidy
estimates and incorporating them into the PED file, with sex = 0 for sex aneuploidies.

### Prerequisites
The following diagram illustrates the upstream and downstream workflows of the `EvidenceQC` workflow
in the recommended invocation order. You may refer to
[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg)
for the overall recommended invocation order.

<br/>

```mermaid
stateDiagram
direction LR
classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d
gse: GatherSampleEvidence
eqc: EvidenceQC
batching: Batching, sample QC, and sex assignment
gse --> eqc
eqc --> batching
class eqc thisModule
class gse inModules
class batching outModules
```

<br/>

- [Gather Sample Evidence](./gse)

### Inputs

Expand Down
179 changes: 164 additions & 15 deletions website/docs/modules/gather_batch_evidence.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,174 @@ sidebar_position: 4
slug: gbe
---

Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample
raw evidence into a batch. See above for more information on batching.
Runs CNV callers ([cn.MOPS](https://academic.oup.com/nar/article/40/9/e69/1136601), GATK-gCNV)
and combines single-sample raw evidence into a batch.

### Prerequisites
The following diagram illustrates the downstream workflows of the `GatherBatchEvidence` workflow
in the recommended invocation order. You may refer to
[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg)
for the overall recommended invocation order.

- GatherSampleEvidence
- (Recommended) EvidenceQC
- gCNV training.
```mermaid
### Inputs
- PED file (updated with EvidenceQC sex assignments, including sex = 0
for sex aneuploidies. Calls will not be made on sex chromosomes
when sex = 0 in order to avoid generating many confusing calls
or upsetting normalized copy numbers for the batch.)
- Read count, BAF, PE, SD, and SR files (GatherSampleEvidence)
- Caller VCFs (GatherSampleEvidence)
- Contig ploidy model and gCNV model files (gCNV training)
stateDiagram
direction LR
classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d
### Outputs
gbe: GatherBatchEvidence
t: TrainGCNV
cb: ClusterBatch
t --> gbe
gbe --> cb
class gbe thisModule
class t inModules
class cb outModules
```

## Inputs
This workflow takes as input the read counts, BAF, PE, SD, SR, and per-caller VCF files
produced in the GatherSampleEvidence workflow, and contig ploidy and gCNV models from
the TrainGCNV workflow.
The following is the list of the inputs the GatherBatchEvidence workflow takes.


#### `batch`
An identifier for the batch.


#### `samples`
Sets the list of sample IDs.


#### `counts`
Set to the [`GatherSampleEvidence.coverage_counts`](./gse#coverage-counts) output.


#### Raw calls

The following inputs set the per-caller raw SV calls, and should be set
if the caller was run in the [`GatherSampleEvidence`](./gse) workflow.
You may set each of the following inputs to the linked output from
the GatherSampleEvidence workflow.


- `manta_vcfs`: [`GatherSampleEvidence.manta_vcf`](./gse#manta-vcf);
- `melt_vcfs`: [`GatherSampleEvidence.melt_vcf`](./gse#melt-vcf);
- `scramble_vcfs`: [`GatherSampleEvidence.scramble_vcf`](./gse#scramble-vcf);
- `wham_vcfs`: [`GatherSampleEvidence.wham_vcf`](./gse#wham-vcf).

#### `PE_files`
Set to the [`GatherSampleEvidence.pesr_disc`](./gse#pesr-disc) output.

#### `SR_files`
Set to the [`GatherSampleEvidence.pesr_split`](./gse#pesr-split)


#### `SD_files`
Set to the [`GatherSampleEvidence.pesr_sd`](./gse#pesr-sd)


#### `matrix_qc_distance`
You may refer to [this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl)
for an example value.


#### `min_svsize`
Sets the minimum size of SVs to include.


#### `ped_file`
A pedigree file describing the familial relationshipts between the samples in the cohort.
Please refer to [this section](./#ped_file) for details.


#### `run_matrix_qc`
Enables or disables running optional QC tasks.


#### `gcnv_qs_cutoff`
You may refer to [this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl)
for an example value.

#### cn.MOPS files
The workflow needs the following cn.MOPS files.

- `cnmops_chrom_file` and `cnmops_allo_file`: FASTA index files (`.fai`) for respectively
non-sex chromosomes (autosomes) and chromosomes X and Y (allosomes).
The file format is explained [on this page](https://www.htslib.org/doc/faidx.html).

You may use the following files for these fields:

```json
"cnmops_chrom_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/autosome.fai"
"cnmops_allo_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/allosome.fai"
```

- `cnmops_exclude_list`:
You may use [this file](https://github.com/broadinstitute/gatk-sv/blob/d66f760865a89f30dbce456a3f720dec8b70705c/inputs/values/resources_hg38.json#L10)
for this field.

#### GATK-gCNV inputs

The following inputs are configured based on the outputs generated in the [`TrainGCNV`](./gcnv) workflow.

- `contig_ploidy_model_tar`: [`TrainGCNV.cohort_contig_ploidy_model_tar`](./gcnv#contig-ploidy-model-tarball)
- `gcnv_model_tars`: [`TrainGCNV.cohort_gcnv_model_tars`](./gcnv#model-tarballs)


The workflow also enables setting a few optional arguments of gCNV.
The arguments and their default values are provided
[here](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl)
as the following, and each argument is documented on
[this page](https://gatk.broadinstitute.org/hc/en-us/articles/360037593411-PostprocessGermlineCNVCalls)
and
[this page](https://gatk.broadinstitute.org/hc/en-us/articles/360047217671-GermlineCNVCaller).


#### Docker images

The workflow needs the following Docker images, the latest versions of which are in
[this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/dockers.json).

- `cnmops_docker`;
- `condense_counts_docker`;
- `linux_docker`;
- `sv_base_docker`;
- `sv_base_mini_docker`;
- `sv_pipeline_docker`;
- `sv_pipeline_qc_docker`;
- `gcnv_gatk_docker`;
- `gatk_docker`.

#### Static inputs

You may refer to [this reference file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json)
for values of the following inputs.

- `primary_contigs_fai`;
- `cytoband`;
- `ref_dict`;
- `mei_bed`;
- `genome_file`;
- `sd_locs_vcf`.


#### Optional Inputs
The following is the list of a few optional inputs of the
workflow, with an example of possible values.

- `"allosomal_contigs": [["chrX", "chrY"]]`
- `"ploidy_sample_psi_scale": 0.001`





## Outputs

- Combined read count matrix, SR, PE, and BAF files
- Standardized call VCFs
Expand Down
71 changes: 64 additions & 7 deletions website/docs/modules/gather_sample_evidence.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,77 @@ slug: gse
---

Runs raw evidence collection on each sample with the following SV callers:
Manta, Wham, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence,
Manta, Wham, Scramble, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence,
refer to the Sample Exclusion section.

Note: a list of sample IDs must be provided. Refer to the sample ID
requirements for specifications of allowable sample IDs.
The following diagram illustrates the downstream workflows of the `GatherSampleEvidence` workflow
in the recommended invocation order. You may refer to
[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg)
for the overall recommended invocation order.


```mermaid
stateDiagram
direction LR
classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d
gse: GatherSampleEvidence
eqc: EvidenceQC
gse --> eqc
class gse thisModule
class eqc outModules
```


## Inputs

#### `bam_or_cram_file`
A BAM or CRAM file aligned to hg38. Index file (.bai) must be provided if using BAM.

#### `sample_id`
Refer to the [sample ID requirements](/docs/gs/inputs#sampleids) for specifications of allowable sample IDs.
IDs that do not meet these requirements may cause errors.

### Inputs
#### `preprocessed_intervals`
Picard interval list.

#### `sd_locs_vcf`
(`sd`: site depth)
A VCF file containing allele counts at common SNP loci of the genome, which is used for calculating BAF.
For human genome, you may use [`dbSNP`](https://www.ncbi.nlm.nih.gov/snp/)
that contains a complete list of common and clinical human single nucleotide variations,
microsatellites, and small-scale insertions and deletions.
You may find a link to the file in
[this reference](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json).

- Per-sample BAM or CRAM files aligned to hg38. Index files (.bai) must be provided if using BAMs.

### Outputs
## Outputs

- Caller VCFs (Manta, MELT, and/or Wham)
- Binned read counts file
- Split reads (SR) file
- Discordant read pairs (PE) file

#### `manta_vcf` {#manta-vcf}
A VCF file containing variants called by Manta.

#### `melt_vcf` {#melt-vcf}
A VCF file containing variants called by MELT.

#### `scramble_vcf` {#scramble-vcf}
A VCF file containing variants called by Scramble.

#### `wham_vcf` {#wham-vcf}
A VCF file containing variants called by Wham.

#### `coverage_counts` {#coverage-counts}

#### `pesr_disc` {#pesr-disc}

#### `pesr_split` {#pesr-split}

#### `pesr_sd` {#pesr-sd}
15 changes: 15 additions & 0 deletions website/docs/modules/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,18 @@ consisting of multiple modules to be executed in the following order.
- **Module 09 (in development)** Visualization, including scripts that generates IGV screenshots and rd plots.

- Additional modules to be added: de novo and mosaic scripts


## Pipeline Parameters

Several inputs are shared across different modules of the pipeline, which are explained in this section.

#### `ped_file`

A pedigree file describing the familial relationships between the samples in the cohort.
The file needs to be in the
[PED format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format).
Updated with [EvidenceQC](./eqc) sex assignments, including
`sex = 0` for sex aneuploidies;
genotypes on chrX and chrY for samples with `sex = 0` in the PED file will be set to
`./.` and these samples will be excluded from sex-specific training steps.
Loading

0 comments on commit 83e9464

Please sign in to comment.