broadinstitute · VJalili · May 13, 2024 · May 20, 2024 · May 20, 2024 · mwalker174
diff --git a/website/docs/concepts/_category_.json b/website/docs/concepts/_category_.json
@@ -0,0 +1,7 @@
+{
+  "label": "Concepts",
+  "position": 5,
+  "link": {
+    "type": "generated-index"
+  }
+}
diff --git a/website/docs/concepts/denovo.md b/website/docs/concepts/denovo.md
@@ -0,0 +1,44 @@
+---
+title: de-novo SVs
+description: Downstream filtering (work in progress)
+sidebar_position: 1
+slug: denovo
+---
+
+
+The de-novo workflow identifies a variant as de-novo based the criteria defined 
+on site and sample specific metrics of a variant.
+These criteria are discussed in details in the following sections. 
+
+
+## Site-specific filters
+
+- Exclude multi-allelic copy number variants (mCNV) and breakends (BND).
+- Exclude variants with gnomAD `AF > 0.01`, cohort `AF > 0.05` or parents `AF > 0.05` (except for GDs).
+- Remove Wham-only calls with `GQ = 1`.
+
+## Sample-specific filters
+
+- De novo SV (call absent in parents)
+- CNVs filters:
+  - Large CNVs (>5,000bp) in autosomal chromosomes: 
+    - Remove if read depth copy number (RD_CN) of parents is equal to RD_CN of proband.
+    - Remove if the variant overlaps with a CNV in the parents, with minimum overlap of 0.5 (using bedtools coverage).
+    - Check that variant has raw depth support (uses bedtools coverage for >5000bp, and intersect for 1000-5000bp).
+    - Remove if variant overlaps with parental raw depth CNV (uses bedtools coverage for >5000bp, and intersect for 1000-5000bp).
+  - Small CNVs (<=5000bp):
+    - If the CNV is small, ignore RD evidence if “RD,SR” (and use only SR)
+    - Check that variant has raw evidence (bedtools intersect)
+    - Remove if variant overlaps with parental raw evidence (bedtools intersect)
+    - Remove if variant is SR only but doesn’t have “bothsides support”
+  - Remove depth only calls that are <10000bp
+  - Remove deletions that are >500bp and RD_CN=2 and PE only evidence
+  - Flag if parents GQ is smaller than min_gq (does not apply, we have been using 0)
+  - Flag INS that are either manta or melt and SR only and have high SR background (filter removed)
+  - Flag if median coverage in parents is <= 10
+- Blacklist: remove variants that overlap >0.5 with “blacklist” (repetitive and bad regions)
+- INS filters:
+  - Requires raw evidence
+  - Remove variant if it’s in parental raw evidence file
+- No filtering is applied on complex SVs, translocations and inversions
+- Reformatting is applied at the end to remove duplicated CPX SVs.
diff --git a/website/docs/modules/denovo.md b/website/docs/modules/denovo.md
@@ -0,0 +1,90 @@
+---
+title: DeNovoSV
+description: Downstream filtering (work in progress)
+sidebar_position: 15
+slug: denovo
+---
+
+The de-novo workflow operates on the annotated multi-sample VCF file created by 
-The de-novo workflow operates on the annotated multi-sample VCF file created by 
+The de novo SV workflow operates on the annotated multi-sample VCF file created by 
-The de-novo workflow operates on the annotated multi-sample VCF file created by 
+The de novo SV workflow operates on the annotated multi-sample VCF file created by 
+the [AnnotateVCF](./av) workflow. The method the de-novo workflow implements is discussed in 
+[this page](/docs/concepts/denovo).
+
+
+### Inputs
+
+- `vcf_file`: output of [AnnotateVcf](./av) called output_vcf.
+  Note thatAll families in the vcf file must be included in the pedigree file
-  Note thatAll families in the vcf file must be included in the pedigree file
+  Note that all families in the vcf file must be included in the pedigree file
-  Note thatAll families in the vcf file must be included in the pedigree file
+  Note that all families in the vcf file must be included in the pedigree file
+
+
+- `ped_input`: Must have a header as follows:
+
+  | FamID | IndividualID | FatherID | MotherID | Gender    | Affected |
+  |-|-|-|-|-|-|
+
+- `genomic_disorder_input`: a file in BED format that contains regions of genomic disorder; 
- `genomic_disorder_input`: a file in BED format that contains regions of genomic disorder; 
+- `genomic_disorder_input`: a file in BED format that contains genomic disorder regions; 
- `genomic_disorder_input`: a file in BED format that contains regions of genomic disorder; 
+- `genomic_disorder_input`: a file in BED format that contains genomic disorder regions; 
+   variants that overlap these regions will not be removed from the input VCF file. 
+
+- `contigs`: Should be set to the following list.
- `contigs`: Should be set to the following list.
+- `contigs`: List of reference contig names, e.g. for `hg38`:
- `contigs`: Should be set to the following list.
+- `contigs`: List of reference contig names, e.g. for `hg38`:
+
+  ```
+  [ "chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX" ]`
+  ```
+
+- `python_config`: a text file as the following.
- `python_config`: a text file as the following.
+- `python_config`: a text file defining the following parameters:
- `python_config`: a text file as the following.
+- `python_config`: a text file defining the following parameters:
+
+  ```text
+  large_cnv_size: '1000'
+  gnomad_col: 'gnomAD_V2_AF'
+  alt_gnomad_col: 'gnomad_v2.1_sv_AF'
+  gnomad_AF: '0.01'
+  parents_AF: '0.05'
+  large_raw_overlap: '0.5'
+  small_raw_overlap: '0.5'
+  cohort_AF: '0.05'
+  coverage_cutoff: '10'
+  depth_only_size: '10000'
+  parents_overlap: '0.5'
+  gq_min: '0'
+  ```
+
+  Note that you value may increase the value of `cohort_AF` if the cohort is small.
+
+- `batch_raw_file`: 
+  a txt file with first column as batch and second column raw file generated from 
+  module05-ClusterBatch for all callers except depth (clustered_manta_vcf, clustered_melt_vcf, clustered_wham_vcf).
-  a txt file with first column as batch and second column raw file generated from 
-  module05-ClusterBatch for all callers except depth (clustered_manta_vcf, clustered_melt_vcf, clustered_wham_vcf).
+  a txt file where the first column is the batch name and second column is the raw file generated from 
+  the ClusterBatch workflow for all callers except depth (clustered_manta_vcf, clustered_melt_vcf, clustered_wham_vcf).
-  a txt file with first column as batch and second column raw file generated from 
-  module05-ClusterBatch for all callers except depth (clustered_manta_vcf, clustered_melt_vcf, clustered_wham_vcf).
+  a txt file where the first column is the batch name and second column is the raw file generated from 
+  the ClusterBatch workflow for all callers except depth (clustered_manta_vcf, clustered_melt_vcf, clustered_wham_vcf).
+
+  - Must match batch names in sample_batches (see below)
+  - These batches and the samples contained in them are relevant in regards to the bincov matrices and raw files
+  - If you are using the fam_ids input to only run the de novo script on certain samples, you can still include all raw files for this cohort and the script will only localize that files containing your samples of interest
+
+- `batch_depth_raw_file`:
+  a txt file with first column as batch and second column raw file generated from module05-ClusterBatch for depth caller (clustered_depth_vcf)
+
+  - Must match batch names in sample_batches (see below)
+  - These batches and the samples contained in them are relevant in regards to the bincov matrices and raw files
+  - If you are using the fam_ids input to only run the de novo script on certain samples, you can still include all raw files for this cohort and the script will only localize that files containing your samples of interest
+
+- `batch_bincov_index`: 
+  a txt file with first column as batch name, second column as merged_bincov (output of module04-GatherBatchEvidence), and third column as merged_bincov_index (output of module04-GatherBatchEvidence) for each batch
+
+  - Must match batch names in sample_batches (see below)
+  - These batches and the samples contained in them are relevant in regards to the bincov matrices and raw files
+  - If you are using the fam_ids input to only run the de novo script on certain samples, you can still include all bincov matrcies for this cohort and the script will only localize that files containing your samples of interest
+
+- `sample_batches`
+  txt file with samples in first column and batch in second column
+
+  - Must match batch names in batch_bincov, batch_raw_file, and batch_depth_raw_file
+  - These batches and the samples contained in them are relevant in regards to the bincov matrices and raw files
+
+- `prefix`: choose any prefix which will become the prefix of output files
- `prefix`: choose any prefix which will become the prefix of output files
+- `prefix`: a prefix for output filenames
- `prefix`: choose any prefix which will become the prefix of output files
+- `prefix`: a prefix for output filenames
+
+- `records_per_shard`: number of variants per sharded vcf file; 4000 variants/shard runs in ~1 hr
+
+- `fam_ids`: optional input file of family ids that you want to run the script on
+If you are running large cohorts (with one large vcf file) this option can be useful to break the vcf up into x samples per vcf file (and then the de novo calling will only be run on the vcf with those samples). You should try to keep batches together so that you are not localizing the same inputs in multiple runs.
+
+
+
+
+
diff --git a/website/docs/troubleshooting/_category_.json b/website/docs/troubleshooting/_category_.json
@@ -1,6 +1,6 @@
 {
   "label": "Troubleshooting",
-  "position": 5,
+  "position": 7,
   "link": {
     "type": "generated-index"
   }