Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithms docs formatting #1645

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 15 additions & 11 deletions docs/source/70_algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ as well as the total number of bases within the covariate that do not match the
reference genome. From this data, we apply a correction by estimating the error
probability for each set of covariates under a beta-binomial model with uniform
prior. We have validated the concordance of our BQSR implementation against the
GATK. Across both tools, only 5000 of the 180B bases ($<0.0001\%$) in the
GATK. Across both tools, only 5000 of the 180B bases (`$<0.0001\%$`) in the
high-coverage NA12878 genome dataset differ. After investigating this
discrepancy, we have determined that this is due to an error in the GATK, where
paired-end reads are mishandled if the two reads in the pair overlap.
Expand Down Expand Up @@ -141,9 +141,9 @@ estimate the observed base quality using the below equation. This represents a
Bayesian model of the mismatch probability with Binomial likelihood and a
Beta(1, 1) prior.

$$
```math
\mathbf{E}(P_{err}|{cov}) = \frac{\text{\#errors}(cov) + 1}{\text{\#observations}(cov) + 2}
$$
```

After these probabilities are estimated, we go back across the input read
dataset and reconstruct the quality scores of the read by using the covariate
Expand Down Expand Up @@ -172,15 +172,17 @@ region covered by that read.
#### Convex-Hull Finding

Once we have identified the target realignment regions, we must then find the
maximal convex hulls across the set of regions. For a set $R$ of regions, we
define a maximal convex hull as the largest region $\hat{r}$ that satisfies the
maximal convex hulls across the set of regions. For a set `$R$` of regions, we
define a maximal convex hull as the largest region `$\hat{r}$` that satisfies the
following properties:

```math
\begin{align}
\hat{r} &= \cup_{r_i \in \hat{R}} r_i \\
\hat{r} \cap r_i &\ne \emptyset, \forall r_i \in \hat{R} \\
\hat{R} &\subset R
\end{align}
```

In our problem, we seek to find all of the maximal convex hulls, given a set of
regions. For genomics, the convexity constraint described by equation
Expand Down Expand Up @@ -269,21 +271,23 @@ take each read and compute the quality score weighted Hamming edit distance of
the read placed at each site in the consensus sequence. We then take the
minimum quality score weighted edit versus the consensus sequence and the
reference genome. We aggregate these scores together for all reads against this
consensus sequence. Given a consensus sequence $c$, a reference sequence $R$,
and a set of reads $\mathbf{r}$, we calculate this score using the equation
consensus sequence. Given a consensus sequence `$c$`, a reference sequence `$R$`,
and a set of reads `$\mathbf{r}$`, we calculate this score using the equation
below.

```math
\begin{align}
q_{i, j} &= \sum_{k = 0}^{l_{r_i}} Q_k I[r_I(k) = c(j + k)] \forall r_i \in \mathbf{R}, j \in \{0, \dots, l_c - l_{r_i}\} \\
q_{i, R} &= \sum_{k = 0}^{l_{r_i}} Q_k I[r_I(k) = c(j + k)] \forall r_i \in \mathbf{R}, j = \text{pos}(r_i | R) \\
q_i &= \min(q_{i, R}, \min_{j \in \{0, \dots, l_c - l_{r_i}\}} q_{i, j}) \\
q_c &= \sum_{r_i \in \mathbf{r}} q_i
\end{align}
```

In the above equation, $s(i)$ denotes the base at position $i$ of sequence $s$,
and $l_s$ denotes the length of sequence $s$. We pick the consensus sequence
that minimizes the $q_c$ value. If the chosen consensus has a log-odds ratio
(LOD) that is greater than $5.0$ with respect to the reference, we realign the
In the above equation, `$s(i)$` denotes the base at position `$i$` of sequence `$s$`,
and `$l_s$` denotes the length of sequence `$s$`. We pick the consensus sequence
that minimizes the `$q_c$` value. If the chosen consensus has a log-odds ratio
(LOD) that is greater than `$5.0$` with respect to the reference, we realign the
reads. This is done by recomputing the CIGAR and MDTag for each new alignment.
Realigned reads have their mapping quality score increased by 10 in the Phred
scale.
Expand Down