exercise-sheet-3.Rmd

---
title: "Exercise sheet 3: structure probability"
---

```{r, include=FALSE}
source("assets/custom_functions.R")
library(flextable)
library(officer)
```

---------------------------------

# Exercise 1 - entropy

<!--- --------------------------------- -->

We want to have a short look at the concept of entropy.
Let $X$ be a discrete random variable with probabilities $p(i), (i=1,2,\ldots,n)$.
The amount of information to be associated with an outcome of probability $p(i)$
is defined as $I(p(i)) = \log_2(\frac{1}{p(i)})$.

In information theory, the expected value of the information $I(X)$ for the random variable $X$ is called entropy.
The entropy is given by

$$H(X) = E[I(X)] = \sum_{i=1}^{n}p(i)\log_2(\frac{1}{p(i)}).$$

Note that all logarithms in this exercise are base 2 (standard in information theory).

<!--- --------------------------------- -->

### 1.1

::: {.question data-latex=""}

Compute the entropy of the probability distribution $p(1) = \frac{1}{2},\, p(2) = \frac{1}{4},\, p(3) = \frac{1}{8},\, p(4) = \frac{1}{8}$.

:::

#### {.tabset}

##### Hide

##### Hint

::: {.answer data-latex=""}

Compute the expected value of the information of the random variable $E[I(X)]$ or entropy $H(X)$ as described above.

:::

##### Solution

::: {.answer data-latex=""}

\[
\begin{align*}
H(X) & = p(1)\log\left(\frac{1}{p(1)}\right) + p(2)\log\left(\frac{1}{p(2)}\right) + p(3)\log\left(\frac{1}{p(3)}\right) + p(4)\log\left(\frac{1}{p(4)}\right)\\
& = \frac{1}{2} + \frac{2}{4} + \frac{3}{8} + \frac{3}{8}\\
& = \frac{14}{8}
\end{align*}
\]

:::

#### {-}

<!--- --------------------------------- -->

### 1.2

::: {.question data-latex=""}

Compute the entropy of the distribution $p(i) = \frac{1}{4},\: (i= 1,2,3,4)$.

:::

#### {.tabset}

##### Hide


##### Solution

::: {.answer data-latex=""}

\[
\begin{align*}
H(X) & = & 4 \cdot p(i) \log(\frac{1}{p(i)})\\
& = & 2
\end{align*}
\]

:::

#### {-}

<!--- --------------------------------- -->

### 1.3

::: {.question data-latex=""}

How can you explain the difference in entropy for both probability distributions?

:::

#### {.tabset}

##### Hide

##### Hint

::: {.answer data-latex=""}

Which of the two probability distributions has the higher information content, because you know more about an event before an event happened?

:::

##### Solution

::: {.answer data-latex=""}

For the probability distribution in part 1. the information content of the probability distribution is higher than for 2. as we get more information from the distribution itself (e.g. we know that we are more likely to observe i=1).\\
The information content for the event is higher for part 2. as we can learn more from the event (see coin example from the lecture).

This results in a higher entropy for the probability distribution in part 2..
:::

#### {-}

<!--- --------------------------------- -->

# Exercise 2 - partition function

<!--- --------------------------------- -->

Given the RNA sequence `AUCACCGC` and a minimal loop length of $1$.
Compute the partition function of the molecule using an energy function similar to the Nussinov algorithm:
$$
	E(P) = |P|
$$
that is the energy is the number of non-crossing base pairs of a given structure
(only GC and AU pairs are to be considered).

Table of Boltzmann weights for $RT = 1$:

```{r, results="asis", include=knitr::is_html_output(), echo=FALSE}
sij <- read.csv("assets/tables/exercise-sheet-3/boltzmann-weights.csv", check.names=FALSE, sep=";")
sij_ft <- flextable(sij)
sij_ft <- custom_theme(sij_ft)
index_replace(sij_ft)
```

<!--- --------------------------------- -->

### 2.1

::: {.question data-latex=""}

Compute $Z$ via exhaustive structure space enumeration. (using the provided Boltzmann weights)

:::

#### {.tabset}

##### Hide

##### Hint

::: {.answer data-latex=""}

To compute Z you need to know all possible structures.

:::

##### Solution

::: {.answer data-latex=""}

```{r, results="asis", include=knitr::is_html_output(), echo=FALSE}
sij <- read.csv("assets/tables/exercise-sheet-3/e3-1.csv", check.names=FALSE, sep=";")
sij_ft <- flextable(sij)
sij_ft <- custom_theme(sij_ft)
index_replace(sij_ft)
```

:::

#### {-}

<!--- --------------------------------- -->

### 2.2

::: {.question data-latex=""}

Compute the structure probability for each structure.

:::

#### {.tabset}

##### Hide

##### Formula

::: {.answer data-latex=""}

$$ Pr[P|S] = \frac{e^(-\frac{E(P)}{RT})}{\sum_{P'} e^(-\frac{E(P)}{RT})} $$

:::

##### Solution

::: {.answer data-latex=""}

```{r, results="asis", include=knitr::is_html_output(), echo=FALSE}
sij <- read.csv("assets/tables/exercise-sheet-3/e3-2.csv", check.names=FALSE, sep=";")
sij_ft <- flextable(sij)
sij_ft <- custom_theme(sij_ft)
index_replace(sij_ft)
```

:::

#### {-}

<!--- --------------------------------- -->

### 2.3

::: {.question data-latex=""}

Compare the structure probabilities in relation to each other. What is the structure probability of the open chain? What can you observe?

:::

#### {.tabset}

##### Hide

##### Solution

::: {.answer data-latex=""}

The open chain is more likely than any other structure including the optimal structure (here $P=\{(2,4)(5,7)\}$).

:::

#### {-}

<!--- --------------------------------- -->

### 2.4

::: {.question data-latex=""}

Is this result expected?

:::

#### {.tabset}

##### Hide
##### Hint

::: {.answer data-latex=""}
Base pairs usually stabilize a structure.

:::

##### Solution

::: {.answer data-latex=""}

This is not expected, the open chain should not have a higher probability than the optimal structure.

:::

#### {-}

<!--- --------------------------------- -->

### 2.5

::: {.question data-latex=""}

If not how can you fix it? Test your idea by recalculating the structure probabilities!

:::

#### {.tabset}

##### Hide

##### Hint

::: {.answer data-latex=""}

How can you ensure that the open chain gets the lowest Boltzmann weight without changing the idea of scoring the number of base pairs for computing the energy of a structure?

:::


##### Solution

::: {.answer data-latex=""}

The reason why the open chain has the highest probability is because the Bolzmann weight decreases with an increasing number of base pairs.
The solution is to negate the energy function, i.e. $E(P) = -|P|$

```{r, results="asis", include=knitr::is_html_output(), echo=FALSE}
sij <- read.csv("assets/tables/exercise-sheet-3/e3-3_1.csv", check.names=FALSE, sep=";")
sij_ft <- flextable(sij)
sij_ft <- custom_theme(sij_ft)
index_replace(sij_ft)
```

```{r, results="asis", include=knitr::is_html_output(), echo=FALSE}
sij <- read.csv("assets/tables/exercise-sheet-3/e3-3_2.csv", check.names=FALSE, sep=";")
sij_ft <- flextable(sij)
sij_ft <- custom_theme(sij_ft)
index_replace(sij_ft)
```

We can now search for the optimal structure by energy minimization and the mfe structure is the most probable structure, as expected.

:::

#### {-}

<!--- --------------------------------- -->

# Exercise 3 - Base pair and unpaired probabilities

<!--- --------------------------------- -->

Given the corrected probabilities of the previous exercise for sequence `AUCACCGC`.

<!--- --------------------------------- -->

### 3.1

::: {.question data-latex=""}

Compute and visualize the base pair probabilities in a dot-plot using your corrected energy function. (use the probability values instead of dots.)

:::

#### {.tabset}

##### Hide

##### Formula

::: {.answer data-latex=""}
Compute the base pair probabilities:
$$ PR[(i,j)|S] = \sum_{P \ni (i,j)} Pr[P|S] $$

:::

##### Solution

::: {.answer data-latex=""}

```{r, include=knitr::is_html_output(), echo=FALSE,  fig.align='center', out.width='50%'}
knitr::include_graphics("assets/figures/exercise-sheet-3/e4-1.png")
```

:::

#### {-}

<!--- --------------------------------- -->

### 3.2

::: {.question data-latex=""}

Compute the probability $P^u(4)$ and $P^u(5)$ to be unpaired at sequence position 4 and 5, resp., and $P^u(4,5)$ that the subsequence $4..5$ is not involved in any base pairing.

:::

#### {.tabset}

##### Hide

##### Formula

::: {.answer data-latex=""}
$$ PR_{u}[(i,j)] = \frac{Z_{i,j}^u}{Z} $$

:::

##### Solution

::: {.answer data-latex=""}

$$
\begin{align*}
P^u(4) &= 0.06 + 0.16 + 0.16 = 0.38\\
P^u(5) &= 0.06 + 0.16 + 0.16 = 0.38\\
P^u(4,5) &= 0.06 + 0.16 = 0.22
\end{align*}
$$

:::

#### {-}

<!--- --------------------------------- -->

### 3.3

::: {.question data-latex=""}

Why can't we compute $P^u(4,5)$ from $P^u(4)$ and $P^u(5)$ using their product, i.e. why is $P^u(4,5) \neq P^u(4)\cdot P^u(5)$?

:::

#### {.tabset}

##### Hide

##### Hint

::: {.answer data-latex=""}
Are $P^u(4)$ and $P^u(5)$ independent?

:::

##### Solution

::: {.answer data-latex=""}

The set of structures considered for $P^u(4,5)$ is a subset of both structure sets used to compute $P^u(4)$ and $P^u(5)$.
Thus, the two probabilities are not independent and can not be multiplied.

:::

#### {-}

<!--- --------------------------------- -->