hmm.Rmd

# Hidden Markov Models

A hidden Markov model (HMM) defines a density function for a
sequence of observations $y_1, \ldots, y_N$, where

* the observations are conditionally independent draws from a mixture
distribution with $K$ components, and
* the unobserved mixture components $z_1, \ldots, z_N \in 1:K$ form a
Markov process.

The Markov process for the mixture components is governed by

* an initial probability simplex $\phi \in \mathbb{R}^K$,
* stochastic matrix $\Theta \in \mathbb{R}^{K \times K}$, and

with
$$
p(z \mid \phi, \Theta)
= \phi_{z[1]} \cdot \prod_{n = 2}^N \theta_{z[n - 1], z[n]}.
$$
The sequence of observations $y$ is conditionally independent given
$z$,
$$
p(y \mid z) = \prod_{n=1}^N p(y \mid z_n = k).
$$
The $N \times K$ emission matrix is defined by taking
$$
\Lambda_{n, k} = p(y \mid z_n = k).
$$

The complete data density for HMMs is
$$
p(y, z \mid \phi, \Theta, \Lambda)
= \phi_{z[1]}
\cdot \prod_{n = 2}^N \Theta_{z[n - 1], z[n]}
\cdot \prod_{n = 1}^N \Lambda_{n, z[n]}.
$$
The density is defined by marginalizing out the unobserved latent
states $z$,
$$
p(y \mid \phi, \Theta, \Lambda)
= \sum_{z \in (1:K)^N} p(y, z \mid \phi, \Theta, \Lambda).
$$
The goal is to compute the derivatives of this function for a fixed
observation sequence $y$ with respect to the parameters
$\phi,$ $\Theta,$ and $\Lambda.$

The direct summation is intractable because there are $K^N$ possible
values for the sequence $z.$  The forward algorithm uses dynamic
programming to compute the marginal likelihood in $\mathcal{O}(K^2
\cdot N)$.  The forward algorithm is neatly derived from the matrix
expression for the density,
$$
p(y \mid \phi, \theta, \Lambda)
= \phi^{\top} \cdot \textrm{diag}(\Lambda_1)
\cdot \Theta \cdot \textrm{diag}(\Lambda_2)
\cdots \Theta \cdot \textrm{diag}(\Lambda_N)
\cdot \textrm{1}
$$
where $\textrm{1} = \begin{bmatrix}1 & \cdots & 1\end{bmatrix}^{\top}$ is a
vector of ones of size $K$, and
$$
\textrm{diag}(\Lambda_n)
=
\begin{bmatrix}
\Lambda_{n, 1} & 0 & \cdots & 0
\\
0 & \Lambda_{n, 2} & \cdots & 0
\\
\vdots & \vdots & \ddots & \vdots
\\
0 & 0 & \cdots & \Lambda_{n, K}
\end{bmatrix}.
$$
The forward algorithm is traditionally defined in terms of
the forward vectors,
$$
\alpha_n
=
\begin{bmatrix}
\phi^{\top} \cdot \textrm{diag}(\Lambda_1)
& \cdots &
\Theta \cdot \textrm{diag}(\Lambda_n)
\end{bmatrix}^{\top},
$$
which are column vectors formed from the prefixes of the likelihood
function.  A final multiplication by $\textrm{1}$ yields a means to
compute the likelihood function.  This forward algorithm may be
automatically differentiated and the resulting derivative calculation
also takes $\mathcal{O}(K^2 \cdot N).$ But the constant factor and
memory usage is high, so it is more efficient to work out derivatives
analytically.

The backward algorithm defines the backward row vectors,
$$
\beta_n
=
\begin{bmatrix}
\Theta \cdot \textrm{diag}(\Lambda_n)
& \cdots &
\Theta \cdot \textrm{diag}(\Lambda_N)
\cdot \textrm{1}
\end{bmatrix}.
$$
The recursive form of the backward algorithm begins with $\beta_N,$
then defines $\beta_{n - 1}$ in terms of $\beta_N.$

The derivative of the HMM density can be rendered as a sum of terms
involving forward and backward variables by repeatedly applying the
chain rule to peel pairs of terms off of the product, resulting in a
sum
$$
\begin{array}{rcl}
\displaystyle
\frac{\partial}{\partial x} p(y \mid \phi, \Theta, \Lambda)
& = &
\displaystyle
\frac{\partial}{\partial x}
\phi^{\top} \cdot \textrm{diag}(\Lambda_1)
\cdot \Theta \cdot \textrm{diag}(\Lambda_2)
\cdots \Theta \cdot \textrm{diag}(\Lambda_N)
\cdot \textrm{1}
\\[8pt]
& = &
\displaystyle
\begin{array}[t]{l}
\displaystyle
\left( \frac{\partial}{\partial x}
\phi^{\top} \cdot \textrm{diag}(\Lambda_1) \right)
\cdot \Theta \cdot \textrm{diag}(\Lambda_2)
\cdots \Theta \cdot \textrm{diag}(\Lambda_N)
\cdot \textrm{1}
\\
{ } + \displaystyle \
\phi^{\top} \cdot \textrm{diag}(\Lambda_1)
\cdot
\left( \frac{\partial}{\partial x}
    \Theta \cdot \textrm{diag}(\Lambda_2)
    \cdots \Theta \cdot \textrm{diag}(\Lambda_N) \right)
\end{array}
\\[8pt]
& = & \hfill \vdots \hfill
\\[8pt]
& = &
\displaystyle
  \left( \frac{\partial}
       {\partial x}
  \phi^{\top} \cdot \textrm{diag}(\Lambda_1) \right)
  \cdot \beta_1
+ \sum_{n = 2}^{N}
    \alpha_{n - 1}^{\top}
    \cdot \left( \frac{\partial}
                      {\partial x}
                        \Theta \cdot \textrm{diag}(\Lambda_{n})
          \right)
    \cdot \beta_{n}
\end{array}
$$
involving the forward terms $\alpha,$ backward terms $\beta,$ and the
derivatives of the parameters $\phi, \Theta, \Lambda.$

To simplify the notation, let
$$
\mathcal{L} = p(y \mid \phi, \Theta, \Lambda).
$$
The derivative with
respect to the initial distribution $\phi$ is
$$
\frac{\partial}{\partial \phi} \mathcal{L}
=
\textrm{diag}(\Lambda_1) \cdot \beta_1.
$$

The derivative with respect to the initial emission density
$\Lambda_1$ is
$$
\frac{\partial}{\partial \Lambda_1} \mathcal{L}
=
\textrm{diag}(\phi) \cdot \beta_1.
$$
The derivative with respect to the emission density $\Lambda_n$ for
$n > 1$ is
$$
\frac{\partial}{\partial \Lambda_n} \mathcal{L}
=
\alpha_{n - 1}^{\top} \cdot \Theta \cdot \beta_{n}.
$$

The deriative with respect to the stochastic transition matrix
$\Theta$ is
$$
\frac{\partial}{\partial \Theta} \mathcal{L}
=
\sum_{n = 2}^N
\alpha_{n - 1}^{\top}
\cdot \textrm{diag}(\Lambda_{n})
\cdot \beta_{n}.
$$

## References {-}

The matrix form of the likelihood and forward-backward algorithm, as
well as the matrix derivatives are based on the presentaiton in [@qin:2000].