-
Notifications
You must be signed in to change notification settings - Fork 17
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d32f9be
commit 5172531
Showing
1 changed file
with
71 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# Overview | ||
"UMAP uses local manifold approximations and patches together their fuzzy | ||
simplicial set representations. This constructs a topological representation | ||
of the data. | ||
|
||
Given a low dimensional representation of the data, a similar process can be | ||
used to construct an equivalent topological representation. | ||
|
||
UMAP then optimizes the layout of the data representation in the low dimensional | ||
space, minimizing the cross-entropy between the two topological | ||
representations." | ||
|
||
1. Approximate a manifold on which the data is assumed to lie | ||
2. Construct a fuzzy simplicial set representation of the approximated manifold | ||
3. Construct another fuzzy simplicial set representation of a low dimensional | ||
manifold, `R^d` | ||
4. Optimize the representation | ||
|
||
## Estimating the manifold | ||
Assuming the data points **x** in *X* are uniformly distributed on a manifold | ||
*M* with respect to the Riemannian metric *g*, then a ball centered at **x_i** | ||
that contains exactly the *k*-nearest neighbors of **x_i** should have a fixed | ||
volume regardless of the choice of **x_i**. | ||
|
||
From Lemma 1 of the paper, the geodesic distance of **x_i** to its neighbors | ||
can be approximated by normalizing distances with respect to the distance to the | ||
*k*-th nearest neighbor of **x_i**. | ||
|
||
This results in a family of discrete metric spaces (one for each **x_i**) that | ||
can be merged into a consistent global structure by converting the metric spaces | ||
into fuzzy simplicial sets. | ||
|
||
## Constructing fuzzy simplicial sets | ||
"Each metric space in the family can be translated into a fuzzy simplicial set | ||
via the fuzzy singular set functor, distilling the topological information | ||
while still retaining metric information in the fuzzy structure. Ironing out | ||
the incompatibilities of the resulting family of fuzzy simplicial sets can be | ||
done by simply taking a (fuzzy) union across the entire family. The result is a | ||
single fuzzy simplicial set which captures the relevant topological and | ||
underlying metric structure of the manifold *M*." | ||
|
||
Lemma 1 only defines distances between the chosen point **x_i** and its *k* | ||
nearest neighbors; distances between other points **x_j**, **x_k** where i!=k | ||
and i != j are not well-defined. Therefore, the distances between such points | ||
in the space local to **x_i** are defined to be infinite. | ||
|
||
The manifold is also assumed to be locally connected. | ||
|
||
The above results in the fuzzy topological representation of a dataset: | ||
|
||
**Definition** Let *X* = {**x_1**, ..., **x_N**} be a dataset in `R^d`. Let | ||
{(*X*, d_i) | i = 1, ..., N} be a family of extended-pseudo-metric spaces with | ||
common carrier set X such that | ||
|
||
d_i(**x_j**, **x_k**) = | ||
- d_*M*(**x_j**, **x_k**) - p if i = j or i = k, | ||
- infinity otherwise | ||
|
||
where p is the distance to the nearest neighbor of **x_i** and d_*M* is the | ||
geodesic distance on the manifold *M*, either known *a priori*, or as | ||
approximated per Lemma 1. | ||
|
||
The **fuzzy topological representation of *X*** is | ||
|
||
- UNION i=1:N FinSing((*X*, d_i)). | ||
|
||
Dimensionality reduction can be performed by finding low dimensional | ||
representations that closely match the topological structure of the data. | ||
|
||
## Optimizing a low dimensional representation | ||
... |