add transform data functionality #26

sanjmohan · 2020-04-25T01:25:33Z

This adds functionality to transform data with UMAP with respect to previously fit data.

See the README additions for usage overview.

First, a skeleton is build for the manifold containing both the old fit data and the new data to transform. This yields a graph whose first R rows/columns correspond to the old fit data (for which an embedding has already been obtained), and the remaining rows/columns correspond to the new data to be transformed. This uses the preexisting fuzzy_simplicial_set function as is.

Next, an embedding for the data to transform is initialized as a weighted mean of points from the reference embedding. This is the new method overload for the initialize_embedding function. For comparison, here is the python implementation (the weights are obtained from the graph entries, which you can see here ).

Finally, the optimize_embedding function operates as before, but only computes attractions / repulsions between the new embedding and the reference embedding. For the python version, see here They use head to refer to the new embedding, which I designated with query_inds, and tail to refer to the reference embedding, which I designated with ref_inds.

When using this via a high level umap call, the reference (fit) data is assumed to be the first R samples of the input data. However, the optimize_embedding function can refer to arbitrary indices of the input graph / input embedding as the reference data, and arbitrary indices as the query data (data to be optimized/transformed). These indices don't necessarily have to cover all indices of the graph/embedding (the remaining entries will be ignored entirely). They can also overlap (which means that data may be optimized by computing attractions/repulsions to other data that is also intended to be optimized).

I also added a few basic tests to cover these new features.

Let me know what you think about this addition, or if there are other parts I can clarify (either in this PR or documentation)!

…ct to a reference embedding - add new initialize_embedding method - add new optimize_embedding method and refactor old method to use new one - add corresponding tests

- exposes initialization and optimization w.r.t. reference embedding - add corresponding unit test

sanjmohan · 2020-04-25T01:25:44Z

I also considered a different approach to the high level API, which was letting the UMAP_ struct hold all the parameters from its fit call, and reuse these exact parameters to transform new data. This would require the struct holding the values or a reference to the original data along with the original embedding -- since a new graph encompassing old and new data must be made -- and other hyperparameters. This would be more similar to the Python OOP approach that the author of the paper has written. This would require more refactoring of the code and might be more restrictive in terms of exposing functionality, but also might simplify the API. I went with the approach here since it has a smaller footprint on the existing code. But I am willing to reconsider :)

codecov-io · 2020-04-25T04:59:53Z

Codecov Report

Merging #26 into master will increase coverage by 0.76%.
The diff coverage is 96.15%.

@@            Coverage Diff             @@
##           master      #26      +/-   ##
==========================================
+ Coverage   89.36%   90.12%   +0.76%     
==========================================
  Files           4        4              
  Lines         141      162      +21     
==========================================
+ Hits          126      146      +20     
- Misses         15       16       +1

Impacted Files	Coverage Δ
src/umap_.jl	`94.59% <91.66%> (-0.72%)`	⬇️
src/embeddings.jl	`81.66% <100.00%> (+4.11%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bd0821...270626e. Read the comment docs.

dillondaudert · 2020-04-25T19:21:37Z

Thanks for the PR, this looks great! I will give it a review in the next couple days.

sanjmohan · 2020-04-25T19:59:16Z

fixed docstring typos

dillondaudert · 2020-04-27T21:34:19Z

I took an initial look through and this seems mostly on the right track. I need to spend more time with this, so I'll have a more thorough review forthcoming.

There is one potential issue here when optimizing a new embedding with respect to the reference. Since the knn graph is constructed from the union of the reference and query data, it leaves open the possibility that some query point has no neighbors (either incoming or outgoing) in the reference data. During optimization, this means that its embedding will never be updated (the condition at line 155 in embeddings.jl will always be false.)

I wonder if you've thought about this and how the Python implementation handles it, if it does.

sanjmohan · 2020-04-27T23:03:54Z

Ah you're right, I didn't consider this. Looks like the Python implementation may actually build the graph by only searching for nearest neighbors in the reference data. (Searching for neighbors here, which uses this, which computes distances between original (reference) data and query data here)

I'll look into this more and update the PR within a few days.

sanjmohan · 2020-04-28T08:40:17Z

As I am looking through the Python implementation and the codebase here and in NearestNeighborDescent.jl, I'm having a bit of a change of heart about API. I'm now thinking that what may make the most sense is the use case

# easy option: quick data --> embedding (the current api)
embedding = umap(X, n_components, kwargs...)

# and more exposed option, allowing transforms of new data into preexisting embedded space:
model = UMAP_(X, n_components, kwargs...)
embedding = model.embedding
test_embedding = umap_transform(Xtest, model, fewer_kwargs...)

The upsides are:

we can then reuse the ApproximateKNN graph from the original embedding to use the search from NearestNeighborDescent.jl to get neighbors of query data in the original dataset. The Python implementation does some partial ApproximateKNN graph construction sort of computation here and here using the original data but it seems to me that one of the only differences here is that the number of neighbors isn't restricted to a precomputed neighbor graph, but they enforce using the same number of neighbors anyways
we can then restrict the options / kwargs in the transform api and fix the others to those used in the first manifold construction embedding (tracked by the UMAP_ struct)

The only drawback of this from the user POV is that you can't transform new data w.r.t a dataset and embedding without its neighbor graph, eg. if they were obtained externally, but I feel like this won't be a use case for most people, and even if so, recomputing the neighbor graph of the data won't be prohibitive in most cases.

The main changes then are just exposing the neighbor graph made by the knn_search functions to the UMAP_ struct, and then supporting use of the UMAP_ struct moreso than before.

I will update the PR with this version soon but let me know if you have any objections to this approach! :)

dillondaudert · 2020-04-28T16:51:47Z

I agree that we'll need to store the state from the initial fit procedure in order to use it for the transform.

This will require some changes to the UMAP_ struct, in particular I think we'll need to add indices and distances from the knn_search function as struct fields (so this call will need to be pulled out of fuzzy_simplicial_set). I don't think we should store the KNNGraph directly, since sometimes we calculate neighbors using an exact method, and we can reconstruct the knn graph using these fields later.

We'll also need add the initial data used as a field so we can calculate the distances during the transform.

Then there will need to be a new method of knn_search that does something like reconstructs the KNNGraph HeapKNNGraph(data, metric, indices, distances) and calls the search function to get the inds/dists for the new data to be transformed query_inds, query_dists = search(knngraph, queries, n_neighbors).

So I'm alright with a bit of refactoring as needed, especially in the UMAP_ struct since that was never user-facing anyway. I have an idea for how to support the case where the user passes precomputed distances, but you can ignore that for now. Let me know if you have any questions as you're working through this.

- add umap_transform function - refactor fuzzy_simplicial_set - to search for nearest neighbors for query data - add knn_search method to reconstruct ApproximateKNNGraph

…ctions

…mbedding

sanjmohan · 2020-04-30T23:23:00Z

I've gone through and implemented the things you suggested, and changed the transform function (now called umap_transform) to work with a new UMAP_ struct. The new (way simpler) usage info is in the new README. Let me know what you think!

One thing I discovered, that may need to factor into this PR or maybe a next PR:

It seems like the python implementation has been optimized a lot recently. I ran the notebook in the repo, and according to the README, python umap was a bit slower than this package. Now, this package has similar timing to the README (~150sec) while the python umap runs in only ~50 seconds on my laptop. It seems like one of the major areas of optimization was their improvement of the nearest neighbor search, which can be looked at later.

The side effect though is that I was comparing the python functionality to transforming new data to this PR, and got the following results on mnist test data, after fitting to training data:

Julia:

Python:

So using the hyperparameters, etc. in this PR, by the eye check it seems like the Python transform is more accurate. It also runs in about 10 seconds while the julia function takes ~25 seconds -- so I was not sure how to compare results while keeping a similar level of nearest neighbor approximation complexity, since the python has a faster baseline as of now. Increasing the number of neighbors used in the ApproximateKNNGraph made the julia transform plot more closely resemble the python plot, while taking a nonnegligible amount of time longer.

dillondaudert · 2020-05-01T01:42:01Z

Hi - thanks for the update, I'll take another look over the weekend!

You're right about pynndescent being faster now than when I created the readme, my impression was that it was mostly due to parallelization. I looked into adding the same to NNDescent.jl but I remember I wasn't satisfied with the approach they took.

dillondaudert

At a high level, this looks pretty good. I've made comments for things I think should be changed, and I'm willing to discuss any of them in more detail.

With regards to the type instability I pointed out for transform, if you've got an idea for how to fix it, go ahead and implement it (I recommend adding a test using Test.@inferred). Otherwise we can leave it for when I do a more thorough refactoring of UMAP_ and the type annotations more generally.

I would like one thing: could you do some benchmarking comparing this PR to master for just the umap procedure? Just to make sure there are no performance regressions.

src/umap_.jl

dillondaudert · 2020-05-03T17:38:48Z

src/umap_.jl

    function UMAP_{S, M, N}(graph, embedding) where {S<:Real,
                                                     M<:AbstractMatrix{S},
                                                     N<:AbstractMatrix{S}}
-        issymmetric(graph) || isapprox(graph, graph') || error("UMAP_ constructor expected graph to be a symmetric matrix")
-        new(graph, embedding)
+        UMAP_{S, M, N}(graph, embedding, Matrix(undef, 0, 0), Matrix{Int}(undef, 0, 0), Matrix{S}(undef, 0, 0))


I don't think it makes sense to allow constructing a UMAP_ struct without all of the required fields, so we don't need these constructors. In other words, I think there should be two general cases: inner constructor(s) that require all the fields needed to represent a "complete" struct, and the outer constructor(s) which build the struct by running the algorithm.

One thing I was trying to do was leave old constructors since UMAP_ had been exported before, so these changes wouldn't break old code. But since the struct didn't have much of a use before this may not be a big concern.

The other thing I was considering was I have had use cases where I have data and its corresponding embedding, although the embedding was not computed from the same package. I then wanted to use some transform sort of functionality, and recomputing the graph of the old data would be redundant. Constructing a partially defined struct would let a user wrap data and its embedding manually, and use these for transforming new data.
(As is, this wouldn't save a whole lot of computation since the current knn_search would recompute the ApproximateKNNGraph for large matrices anyways, but this would be an internal detail that could change later). But let me know your thoughts on this!

Maybe you could give me an example of the workflow when you'd need a partial struct? I'm not sure I understand the use case entirely.

I wouldn't worry too much about "breaking" API changes - since this package is pre-1.0, any minor version bump technically permits breaking changes. There's still a ways to go before a 1.0 release is ready and the API could be made stable.

Ok, I won't worry about the API changes 👍

The use case was that a third party published data and its embedding, and I had some related data of interest that I wanted to embed along with the published data/embedding. Reconsidering this (and since we are changing the struct API anyways), I think a separate transform method would be much better than the partial struct, eg one with some signature like transform(X, embedding, Q; <kwargs>).

But this use case is much more unusual than the typical workflow so definitely not an important addition if you don't think it's important than I'll leave it out

dillondaudert · 2020-05-03T18:01:24Z

src/utils.jl

                    metric,
                    ::Val{:pairwise}) where {S <: Real}
    num_points = size(X, 2)
-    dist_mat = Array{S}(undef, num_points, num_points)
+    if S <: AbstractFloat


What use-case is this additional logic needed for?

dillondaudert · 2020-05-03T18:12:48Z

src/embeddings.jl

+Initialize an embedding of points corresponding to the columns of the `graph`, by taking weighted means of
+the columns of `ref_embedding`, where weights are values from the rows of the `graph`.
+
+The resulting embedding will have shape `(size(ref_embedding, 1), length(query_inds))`, where `size(ref_embedding, 1)`


Should replace length(query_inds) with size(graph, 2), since query_inds isn't an argument that's passed to this. Also note that the returned value of this function isn't a matrix, but a vector of vectors (intentionally).

ah mb, forgot to update this documentation

dillondaudert · 2020-05-03T18:17:49Z

src/embeddings.jl

+samples in the resulting embedding. Its elements will have type T.
+"""
+function initialize_embedding(graph::AbstractMatrix{<:Real}, ref_embedding::AbstractMatrix{T}) where {T<:AbstractFloat}
+    embed = [zeros(T, size(ref_embedding, 1)) for _ in 1:size(graph, 2)]


You never use these zero vectors, so this is a bit unnecessary. To create an array with the correct type, you could do something like embed = Vector{Vector{T}}[]

dillondaudert · 2020-05-03T18:35:24Z

src/embeddings.jl

+"""
+function initialize_embedding(graph::AbstractMatrix{<:Real}, ref_embedding::AbstractMatrix{T}) where {T<:AbstractFloat}
+    embed = [zeros(T, size(ref_embedding, 1)) for _ in 1:size(graph, 2)]
+    for (i, col) in enumerate(eachcol(graph))


You can simplify this expression, since this is just matrix multiplication:

embed = collect(eachcol((ref_embedding * graph) ./ (sum(graph, dims=1) .+ eps(T))))

Double check it, but it should be equivalent. Also, you'll want eps(T) over eps(zero(T)).

dillondaudert · 2020-05-03T18:44:24Z

src/embeddings.jl

 - `initial_alpha`: the initial learning rate
 - `gamma`: the repulsive strength of negative samples
 - `neg_sample_rate::Integer`: the number of negative samples per positive sample
 """
 function optimize_embedding(graph,
-                            embedding,
-                            n_epochs,
+                            embedding::AbstractVector{<:AbstractVector{<:AbstractFloat}},


I'm not sure we need these type constraints. I actually think this package is overly type constrained (that's mostly my doing), when it could be much more generic.

So one issue was having the optimize_embedding methods with both one embedding argument and with two (for query and ref), the call would not be properly dispatched because of the default arguments to _a and _b, and I was resolving this with type constraints. But I can remove the default arguments or remove the overloaded methods, if you have a preference towards either of those?

If you have no preference, I will just remove the original optimize_embedding method so all calls will have to pass two embeddings (or the same one twice), I'm starting to think the overloading is more unnecessary/messy than good.

- add parameterization to UMAP_ struct - remove partial UMAP_ struct constructors - clean knn_search code - add type safety to initialize_embedding - switch model and Q in transform API

sanjmohan · 2020-05-04T23:32:08Z

I believe I addressed all of your comments and made the changes. The major ones were:

adding type parameterization to the UMAP_ struct
getting rid of the old optimize_embedding method, so the only one is the one with both reference and query embedding arguments
cleaning up the initialize_embedding function with your suggestions
cleaning up the knn_search methods

The transform function should be type stable now from begiinning to end, returning an embedding of the same type as the embedding in the model argument.

Here are the relative benchmarks:

MASTER

NEW BRANCH

(the data is the typical mnist data, (28*28, 60k) for mnist_x and (28*28, 10k) for mnist_xtest.)

Please let me know if I missed anything 😃

dillondaudert · 2020-05-05T17:47:04Z

Thanks for the quick work! I'll aim to get a final review by the weekend, and hopefully we can get this merged soon after that.

dillondaudert

With the exception of the small change to the README, this looks good to me!

It'll be good to create a new release with this functionality... There are some improvements and QoL changes this package really needs that I'd like to implement beforehand, though.

README.md

sanjmohan · 2020-05-13T20:39:19Z

cool, if you'd like to open some issues I can help out!

Co-authored-by: Sanjay Mohan <>

Sanjay Mohan added 3 commits April 24, 2020 18:02

add functions for initializing and optimizing an embedding with respe…

e2d0ea2

…ct to a reference embedding - add new initialize_embedding method - add new optimize_embedding method and refactor old method to use new one - add corresponding tests

add option to umap to embed with respect to reference embedding

42d8cb2

- exposes initialization and optimization w.r.t. reference embedding - add corresponding unit test

add reference embedding usage to readme

20afc41

add more usage info to README

507ff5c

fix docstring inconsistency

270626e

sanjmohan changed the title ~~add transform data functionality~~ [WIP] add transform data functionality Apr 28, 2020

Sanjay Mohan added 4 commits April 28, 2020 23:38

add data, knns, dists to UMAP_ struct

6b45e37

- add umap_transform function - refactor fuzzy_simplicial_set - to search for nearest neighbors for query data - add knn_search method to reconstruct ApproximateKNNGraph

update documentation, fix type bugs in UMAP_ object and embedding fun…

62712be

…ctions

add tests for umap_transform; update tests for knn_search, optimize_e…

06bcd09

…mbedding

update hyperparameters

d99a329

sanjmohan changed the title ~~[WIP] add transform data functionality~~ add transform data functionality Apr 30, 2020

dillondaudert requested changes May 3, 2020

View reviewed changes

sanjmohan changed the title ~~add transform data functionality~~ [WIP] add transform data functionality May 4, 2020

add type safety, fix PR Changes

6c69031

- add parameterization to UMAP_ struct - remove partial UMAP_ struct constructors - clean knn_search code - add type safety to initialize_embedding - switch model and Q in transform API

sanjmohan changed the title ~~[WIP] add transform data functionality~~ add transform data functionality May 4, 2020

dillondaudert approved these changes May 13, 2020

View reviewed changes

README.md Outdated Show resolved Hide resolved

[fix] README umap_transform -> transform

2e166dd

dillondaudert merged commit d4d854d into dillondaudert:master May 14, 2020

kragol pushed a commit to kragol/UMAP.jl that referenced this pull request Sep 15, 2020

add transform data functionality (dillondaudert#26)

c72e2ae

Co-authored-by: Sanjay Mohan <>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add transform data functionality #26

add transform data functionality #26

sanjmohan commented Apr 25, 2020

sanjmohan commented Apr 25, 2020

codecov-io commented Apr 25, 2020 •

edited

Loading

dillondaudert commented Apr 25, 2020

sanjmohan commented Apr 25, 2020

dillondaudert commented Apr 27, 2020 •

edited

Loading

sanjmohan commented Apr 27, 2020 •

edited

Loading

sanjmohan commented Apr 28, 2020 •

edited

Loading

dillondaudert commented Apr 28, 2020

sanjmohan commented Apr 30, 2020 •

edited

Loading

dillondaudert commented May 1, 2020

dillondaudert left a comment

dillondaudert May 3, 2020

sanjmohan May 3, 2020

dillondaudert May 4, 2020

sanjmohan May 4, 2020 •

edited

Loading

dillondaudert May 3, 2020

dillondaudert May 3, 2020

sanjmohan May 3, 2020

dillondaudert May 3, 2020

dillondaudert May 3, 2020

sanjmohan May 3, 2020

dillondaudert May 3, 2020

sanjmohan May 3, 2020

sanjmohan May 3, 2020

sanjmohan commented May 4, 2020 •

edited

Loading

dillondaudert commented May 5, 2020

dillondaudert left a comment

sanjmohan commented May 13, 2020

add transform data functionality #26

add transform data functionality #26

Conversation

sanjmohan commented Apr 25, 2020

sanjmohan commented Apr 25, 2020

codecov-io commented Apr 25, 2020 • edited Loading

Codecov Report

dillondaudert commented Apr 25, 2020

sanjmohan commented Apr 25, 2020

dillondaudert commented Apr 27, 2020 • edited Loading

sanjmohan commented Apr 27, 2020 • edited Loading

sanjmohan commented Apr 28, 2020 • edited Loading

dillondaudert commented Apr 28, 2020

sanjmohan commented Apr 30, 2020 • edited Loading

dillondaudert commented May 1, 2020

dillondaudert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanjmohan May 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanjmohan commented May 4, 2020 • edited Loading

dillondaudert commented May 5, 2020

dillondaudert left a comment

Choose a reason for hiding this comment

sanjmohan commented May 13, 2020

codecov-io commented Apr 25, 2020 •

edited

Loading

dillondaudert commented Apr 27, 2020 •

edited

Loading

sanjmohan commented Apr 27, 2020 •

edited

Loading

sanjmohan commented Apr 28, 2020 •

edited

Loading

sanjmohan commented Apr 30, 2020 •

edited

Loading

sanjmohan May 4, 2020 •

edited

Loading

sanjmohan commented May 4, 2020 •

edited

Loading