Skip to content

Commit

Permalink
Notes on algorithm
Browse files Browse the repository at this point in the history
  • Loading branch information
dillondaudert committed Sep 13, 2018
1 parent d32f9be commit 5172531
Showing 1 changed file with 71 additions and 0 deletions.
71 changes: 71 additions & 0 deletions docs/algorithm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Overview
"UMAP uses local manifold approximations and patches together their fuzzy
simplicial set representations. This constructs a topological representation
of the data.

Given a low dimensional representation of the data, a similar process can be
used to construct an equivalent topological representation.

UMAP then optimizes the layout of the data representation in the low dimensional
space, minimizing the cross-entropy between the two topological
representations."

1. Approximate a manifold on which the data is assumed to lie
2. Construct a fuzzy simplicial set representation of the approximated manifold
3. Construct another fuzzy simplicial set representation of a low dimensional
manifold, `R^d`
4. Optimize the representation

## Estimating the manifold
Assuming the data points **x** in *X* are uniformly distributed on a manifold
*M* with respect to the Riemannian metric *g*, then a ball centered at **x_i**
that contains exactly the *k*-nearest neighbors of **x_i** should have a fixed
volume regardless of the choice of **x_i**.

From Lemma 1 of the paper, the geodesic distance of **x_i** to its neighbors
can be approximated by normalizing distances with respect to the distance to the
*k*-th nearest neighbor of **x_i**.

This results in a family of discrete metric spaces (one for each **x_i**) that
can be merged into a consistent global structure by converting the metric spaces
into fuzzy simplicial sets.

## Constructing fuzzy simplicial sets
"Each metric space in the family can be translated into a fuzzy simplicial set
via the fuzzy singular set functor, distilling the topological information
while still retaining metric information in the fuzzy structure. Ironing out
the incompatibilities of the resulting family of fuzzy simplicial sets can be
done by simply taking a (fuzzy) union across the entire family. The result is a
single fuzzy simplicial set which captures the relevant topological and
underlying metric structure of the manifold *M*."

Lemma 1 only defines distances between the chosen point **x_i** and its *k*
nearest neighbors; distances between other points **x_j**, **x_k** where i!=k
and i != j are not well-defined. Therefore, the distances between such points
in the space local to **x_i** are defined to be infinite.

The manifold is also assumed to be locally connected.

The above results in the fuzzy topological representation of a dataset:

**Definition** Let *X* = {**x_1**, ..., **x_N**} be a dataset in `R^d`. Let
{(*X*, d_i) | i = 1, ..., N} be a family of extended-pseudo-metric spaces with
common carrier set X such that

d_i(**x_j**, **x_k**) =
- d_*M*(**x_j**, **x_k**) - p if i = j or i = k,
- infinity otherwise

where p is the distance to the nearest neighbor of **x_i** and d_*M* is the
geodesic distance on the manifold *M*, either known *a priori*, or as
approximated per Lemma 1.

The **fuzzy topological representation of *X*** is

- UNION i=1:N FinSing((*X*, d_i)).

Dimensionality reduction can be performed by finding low dimensional
representations that closely match the topological structure of the data.

## Optimizing a low dimensional representation
...

0 comments on commit 5172531

Please sign in to comment.