Supervised UMAP using already projected points as regression #712

theahura · 2021-06-29T19:58:22Z

Hey Leland,

Thanks for this great library, and for being so responsive with issues.

Question: Is it possible to train a (semi-)supervised UMAP model where some of the projections are already known/provided, as a regression task? This is contrast to using labels for supervision, which are categorical.

To provide an example, imagine I had a set of 100 embeddings. 50 of those embeddings have 2d coordinates associated with them; the other 50 do not. I want to be able to train a UMAP model on the 50 embeddings with 2d coordinates, and then run inference on the other 50 (or do a semi-supervised training and run on all of them at the same time).

If this isn't feasible with UMAP, do you know of any other algos/models that might be a good fit for what I'm suggesting (besides deep learning of course)?

Thanks! Amol

lmcinnes · 2021-07-05T17:16:42Z

There are two ways you can view this. The 2d coordinates could be viewed merely as target values, like labels but numeric instead of categorical. This can actually be handled just fine by setting target_metric="euclidean" and passing the coordinates in as a y value to fit. I don't think this is what you mean however.

Instead it sounds more like you want to fix some of the points to given locations in the embedding and then fit the rest of the points around that. This is actually being worked on right now. See #606 and #620 for discussion and work on that. There are some catches in exactly how to do this so I think your input into this would be most welcome.

theahura · 2021-07-13T15:42:32Z

Thanks for the response, and sorry for my delayed post.

You're right that fixing points is closer to what I need. But I can play around with the first suggestion, using a euclidean target_metric. Does that support semi-supervised cases? In the documentation (https://umap-learn.readthedocs.io/en/latest/supervised.html) you describe using -1 as a 'masked' value. Makes sense for categorical data, but how does that work for numerical data?

lmcinnes · 2021-07-19T15:43:37Z

The semi-supervised case is going to be an issue for other target_metrics since they won't specially handle "masked" values. In principle you can write your own custom metric that has special handling for masked values. I suspect in the long run this is going to really be doing what you want however.

theahura · 2021-07-19T19:25:31Z

If checks pass on #620 is that sufficient to merge into the library? Or do you feel there is more to be done there? I had a look over the code in that PR, though admittedly a lot of the math flew over my head.

lmcinnes · 2021-07-19T21:55:56Z

There are some slightly more philosophical and API issues that need to get worked out before it gets merged, but you can certainly just check out the branch from the PR and use it -- it works; it just requires a little care from the user or unexpected results may occur.

kruus · 2021-08-11T17:19:23Z

To expand on @lmcinnes API/philosopy comment: I started using #620, but now am favoring a more flexible API style, inspired by the PyMDE constraints objects.

Passing UMAP constraint objects also can avoid some jit-related code duplication in #620. So one cleaner API might just add an optional constraint parameter, where constraint objects have a few standard project_foo functions. (The constraint parameter might end up being a dict, indicating how/when the project_foo functions of several constraints get called: do they operate on gradients, or individual embedding mods, or post-epoch point-cloud, ...?) Like PyMDE, we'd supply a couple of handy constraints to get folks up and running.

In a first example, I've used a pin_mask (uggh, will change!) to anchor "my selected anchors" at left/right best/worst x limits, leaving y free. (This is not exactly the "spring force" that @theahura seems to want). A pre-alpha, simple working constraint ensures that all non-anchors remain between my best/worst x limits.

BTW, another issue (even in #620) is that UMAP init phase applies dimension-wise scale factors
even for init being an ndarray.

If I init with a previous embedding ndarray, for example, with carefully set "euclidean" distance, I don't like UMAP rescaling as if I suddenly wanted a weighted-euclidean embedding for the init!
If I want to anchor some points in my init ndarray, and use gradient masking in a constraint, they do end up anchored ... but "somewhere else". (Is this historical cruft? Didn't spectral init just do its own rescaling?)

Now, when constraints come into play, init becomes a little more complex. I'm thinking some types of constraint objects may add a function that can approximately satisfy a constraint. Some constraints that put more load on layouts.py iterations might reasonably provide some rotate/(scale?)/translate of init data to ease the load.

theahura changed the title ~~Supervised UMAP using already projected points~~ Supervised UMAP using already projected points as regression Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supervised UMAP using already projected points as regression #712

Supervised UMAP using already projected points as regression #712

theahura commented Jun 29, 2021

lmcinnes commented Jul 5, 2021

theahura commented Jul 13, 2021

lmcinnes commented Jul 19, 2021

theahura commented Jul 19, 2021

lmcinnes commented Jul 19, 2021

kruus commented Aug 11, 2021 •

edited

Loading

Supervised UMAP using already projected points as regression #712

Supervised UMAP using already projected points as regression #712

Comments

theahura commented Jun 29, 2021

lmcinnes commented Jul 5, 2021

theahura commented Jul 13, 2021

lmcinnes commented Jul 19, 2021

theahura commented Jul 19, 2021

lmcinnes commented Jul 19, 2021

kruus commented Aug 11, 2021 • edited Loading

kruus commented Aug 11, 2021 •

edited

Loading