Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithms docs formatting #1645

Closed
wants to merge 4 commits into from
Closed
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 20 additions & 16 deletions docs/source/70_algorithms.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Core Algorithms {#algorithms}
# Core Algorithms

## Read Preprocessing Algorithms

Expand Down Expand Up @@ -43,7 +43,7 @@ as well as the total number of bases within the covariate that do not match the
reference genome. From this data, we apply a correction by estimating the error
probability for each set of covariates under a beta-binomial model with uniform
prior. We have validated the concordance of our BQSR implementation against the
GATK. Across both tools, only 5000 of the 180B bases ($<0.0001\%$) in the
GATK. Across both tools, only 5000 of the 180B bases (`$<0.0001\%$`) in the
high-coverage NA12878 genome dataset differ. After investigating this
discrepancy, we have determined that this is due to an error in the GATK, where
paired-end reads are mishandled if the two reads in the pair overlap.
Expand All @@ -70,7 +70,7 @@ variants; see DePristo et al [@depristo11].
In the rest of this section, we discuss the high level implementations of these
algorithms.

### BQSR Implementation {#bqsr}
### BQSR Implementation

Base quality score recalibration seeks to identify and correct correlated
errors in base quality score estimates. At a high level, this is done by
Expand Down Expand Up @@ -141,15 +141,15 @@ estimate the observed base quality using the below equation. This represents a
Bayesian model of the mismatch probability with Binomial likelihood and a
Beta(1, 1) prior.

$$
```math
\mathbf{E}(P_{err}|{cov}) = \frac{\text{\#errors}(cov) + 1}{\text{\#observations}(cov) + 2}
$$
```

After these probabilities are estimated, we go back across the input read
dataset and reconstruct the quality scores of the read by using the covariate
assigned to the read to look into the covariate table.

### Indel Realignment Implementation {#realignment}
### Indel Realignment Implementation

Although global alignment will frequently succeed at aligning reads to the
proper region of the genome, the local alignment of the read may be incorrect.
Expand All @@ -172,15 +172,17 @@ region covered by that read.
#### Convex-Hull Finding

Once we have identified the target realignment regions, we must then find the
maximal convex hulls across the set of regions. For a set $R$ of regions, we
define a maximal convex hull as the largest region $\hat{r}$ that satisfies the
maximal convex hulls across the set of regions. For a set `$R$` of regions, we
define a maximal convex hull as the largest region `$\hat{r}$` that satisfies the
following properties:

```math
\begin{align}
\hat{r} &= \cup_{r_i \in \hat{R}} r_i \\
\hat{r} \cap r_i &\ne \emptyset, \forall r_i \in \hat{R} \\
\hat{R} &\subset R
\end{align}
```

In our problem, we seek to find all of the maximal convex hulls, given a set of
regions. For genomics, the convexity constraint described by equation
Expand Down Expand Up @@ -241,7 +243,7 @@ tail-call recursive `mergeTargetSets` function that is described in Algorithm
The set returned by this function is used as an index for mapping reads
directly to realignment targets.

#### Candidate Generation and Realignment {#consensus-model}
#### Candidate Generation and Realignment

Once we have generated the target set, we map across all the reads and check to
see if the read overlaps a realignment target. We then group together all reads
Expand Down Expand Up @@ -269,26 +271,28 @@ take each read and compute the quality score weighted Hamming edit distance of
the read placed at each site in the consensus sequence. We then take the
minimum quality score weighted edit versus the consensus sequence and the
reference genome. We aggregate these scores together for all reads against this
consensus sequence. Given a consensus sequence $c$, a reference sequence $R$,
and a set of reads $\mathbf{r}$, we calculate this score using the equation
consensus sequence. Given a consensus sequence `$c$`, a reference sequence `$R$`,
and a set of reads `$\mathbf{r}$`, we calculate this score using the equation
below.

```math
\begin{align}
q_{i, j} &= \sum_{k = 0}^{l_{r_i}} Q_k I[r_I(k) = c(j + k)] \forall r_i \in \mathbf{R}, j \in \{0, \dots, l_c - l_{r_i}\} \\
q_{i, R} &= \sum_{k = 0}^{l_{r_i}} Q_k I[r_I(k) = c(j + k)] \forall r_i \in \mathbf{R}, j = \text{pos}(r_i | R) \\
q_i &= \min(q_{i, R}, \min_{j \in \{0, \dots, l_c - l_{r_i}\}} q_{i, j}) \\
q_c &= \sum_{r_i \in \mathbf{r}} q_i
\end{align}
```

In the above equation, $s(i)$ denotes the base at position $i$ of sequence $s$,
and $l_s$ denotes the length of sequence $s$. We pick the consensus sequence
that minimizes the $q_c$ value. If the chosen consensus has a log-odds ratio
(LOD) that is greater than $5.0$ with respect to the reference, we realign the
In the above equation, `$s(i)$` denotes the base at position `$i$` of sequence `$s$`,
and `$l_s$` denotes the length of sequence `$s$`. We pick the consensus sequence
that minimizes the `$q_c$` value. If the chosen consensus has a log-odds ratio
(LOD) that is greater than `$5.0$` with respect to the reference, we realign the
reads. This is done by recomputing the CIGAR and MDTag for each new alignment.
Realigned reads have their mapping quality score increased by 10 in the Phred
scale.

### Duplicate Marking Implementation {#duplicate-marking}
### Duplicate Marking Implementation

Reads may be duplicated during sequencing, either due to clonal duplication
via PCR before sequencing, or due to optical duplication while on the
Expand Down