Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for the de-novo pipeline #675

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions website/docs/concepts/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"label": "Concepts",
"position": 5,
"link": {
"type": "generated-index"
}
}
44 changes: 44 additions & 0 deletions website/docs/concepts/denovo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a lot of detail. IMO, it would be better to reference the scripts themselves and have those be sufficiently organized/commented that the methods are clear.

title: de-novo SVs
description: Downstream filtering (work in progress)
sidebar_position: 1
slug: denovo
---


The de-novo workflow identifies a variant as de-novo based the criteria defined
on site and sample specific metrics of a variant.
These criteria are discussed in details in the following sections.


## Site-specific filters

- Exclude multi-allelic copy number variants (mCNV) and breakends (BND).
- Exclude variants with gnomAD `AF > 0.01`, cohort `AF > 0.05` or parents `AF > 0.05` (except for GDs).
- Remove Wham-only calls with `GQ = 1`.

## Sample-specific filters

- De novo SV (call absent in parents)
- CNVs filters:
- Large CNVs (>5,000bp) in autosomal chromosomes:
- Remove if read depth copy number (RD_CN) of parents is equal to RD_CN of proband.
- Remove if the variant overlaps with a CNV in the parents, with minimum overlap of 0.5 (using bedtools coverage).
- Check that variant has raw depth support (uses bedtools coverage for >5000bp, and intersect for 1000-5000bp).
- Remove if variant overlaps with parental raw depth CNV (uses bedtools coverage for >5000bp, and intersect for 1000-5000bp).
- Small CNVs (<=5000bp):
- If the CNV is small, ignore RD evidence if “RD,SR” (and use only SR)
- Check that variant has raw evidence (bedtools intersect)
- Remove if variant overlaps with parental raw evidence (bedtools intersect)
- Remove if variant is SR only but doesn’t have “bothsides support”
- Remove depth only calls that are <10000bp
- Remove deletions that are >500bp and RD_CN=2 and PE only evidence
- Flag if parents GQ is smaller than min_gq (does not apply, we have been using 0)
- Flag INS that are either manta or melt and SR only and have high SR background (filter removed)
- Flag if median coverage in parents is <= 10
- Blacklist: remove variants that overlap >0.5 with “blacklist” (repetitive and bad regions)
- INS filters:
- Requires raw evidence
- Remove variant if it’s in parental raw evidence file
- No filtering is applied on complex SVs, translocations and inversions
- Reformatting is applied at the end to remove duplicated CPX SVs.
90 changes: 90 additions & 0 deletions website/docs/modules/denovo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: DeNovoSV
description: Downstream filtering (work in progress)
sidebar_position: 15
slug: denovo
---

The de-novo workflow operates on the annotated multi-sample VCF file created by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The de-novo workflow operates on the annotated multi-sample VCF file created by
The de novo SV workflow operates on the annotated multi-sample VCF file created by

the [AnnotateVCF](./av) workflow. The method the de-novo workflow implements is discussed in
[this page](/docs/concepts/denovo).


### Inputs

- `vcf_file`: output of [AnnotateVcf](./av) called output_vcf.
Note thatAll families in the vcf file must be included in the pedigree file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note thatAll families in the vcf file must be included in the pedigree file
Note that all families in the vcf file must be included in the pedigree file



- `ped_input`: Must have a header as follows:

| FamID | IndividualID | FatherID | MotherID | Gender | Affected |
|-|-|-|-|-|-|
Comment on lines +19 to +22
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


- `genomic_disorder_input`: a file in BED format that contains regions of genomic disorder;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `genomic_disorder_input`: a file in BED format that contains regions of genomic disorder;
- `genomic_disorder_input`: a file in BED format that contains genomic disorder regions;

variants that overlap these regions will not be removed from the input VCF file.

- `contigs`: Should be set to the following list.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `contigs`: Should be set to the following list.
- `contigs`: List of reference contig names, e.g. for `hg38`:


```
[ "chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX" ]`
```

- `python_config`: a text file as the following.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `python_config`: a text file as the following.
- `python_config`: a text file defining the following parameters:


```text
large_cnv_size: '1000'
gnomad_col: 'gnomAD_V2_AF'
alt_gnomad_col: 'gnomad_v2.1_sv_AF'
gnomad_AF: '0.01'
parents_AF: '0.05'
large_raw_overlap: '0.5'
small_raw_overlap: '0.5'
cohort_AF: '0.05'
coverage_cutoff: '10'
depth_only_size: '10000'
parents_overlap: '0.5'
gq_min: '0'
```

Note that you value may increase the value of `cohort_AF` if the cohort is small.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How small?


- `batch_raw_file`:
a txt file with first column as batch and second column raw file generated from
module05-ClusterBatch for all callers except depth (clustered_manta_vcf, clustered_melt_vcf, clustered_wham_vcf).
Comment on lines +53 to +54
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a txt file with first column as batch and second column raw file generated from
module05-ClusterBatch for all callers except depth (clustered_manta_vcf, clustered_melt_vcf, clustered_wham_vcf).
a txt file where the first column is the batch name and second column is the raw file generated from
the ClusterBatch workflow for all callers except depth (clustered_manta_vcf, clustered_melt_vcf, clustered_wham_vcf).


- Must match batch names in sample_batches (see below)
- These batches and the samples contained in them are relevant in regards to the bincov matrices and raw files
- If you are using the fam_ids input to only run the de novo script on certain samples, you can still include all raw files for this cohort and the script will only localize that files containing your samples of interest

- `batch_depth_raw_file`:
a txt file with first column as batch and second column raw file generated from module05-ClusterBatch for depth caller (clustered_depth_vcf)

- Must match batch names in sample_batches (see below)
- These batches and the samples contained in them are relevant in regards to the bincov matrices and raw files
- If you are using the fam_ids input to only run the de novo script on certain samples, you can still include all raw files for this cohort and the script will only localize that files containing your samples of interest

- `batch_bincov_index`:
a txt file with first column as batch name, second column as merged_bincov (output of module04-GatherBatchEvidence), and third column as merged_bincov_index (output of module04-GatherBatchEvidence) for each batch

- Must match batch names in sample_batches (see below)
- These batches and the samples contained in them are relevant in regards to the bincov matrices and raw files
- If you are using the fam_ids input to only run the de novo script on certain samples, you can still include all bincov matrcies for this cohort and the script will only localize that files containing your samples of interest

- `sample_batches`
txt file with samples in first column and batch in second column

- Must match batch names in batch_bincov, batch_raw_file, and batch_depth_raw_file
- These batches and the samples contained in them are relevant in regards to the bincov matrices and raw files

- `prefix`: choose any prefix which will become the prefix of output files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `prefix`: choose any prefix which will become the prefix of output files
- `prefix`: a prefix for output filenames


- `records_per_shard`: number of variants per sharded vcf file; 4000 variants/shard runs in ~1 hr

- `fam_ids`: optional input file of family ids that you want to run the script on
If you are running large cohorts (with one large vcf file) this option can be useful to break the vcf up into x samples per vcf file (and then the de novo calling will only be run on the vcf with those samples). You should try to keep batches together so that you are not localizing the same inputs in multiple runs.





2 changes: 1 addition & 1 deletion website/docs/troubleshooting/_category_.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"label": "Troubleshooting",
"position": 5,
"position": 7,
"link": {
"type": "generated-index"
}
Expand Down
Loading