-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Landmarked Re-Trainable Parametric UMAP #1153
Conversation
Hello @jacobgolding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-10-19 13:22:48 UTC |
This looks super-exciting. I'm not going to have time to look through it until later next week, but please don't mistake my slowness for a lack of enthusiasm. |
This looks awesome! I am on my phone at the moment and can't read through it but here is some quick feedback. I wonder if it's possible to merge with this PR (which I should have integrated months ago, sorry). This PR was based on fchollet's update to support the new keras. It didnt fully support PyTorch so this PR was written to add support. I agree about the awkward method of handing epochs and would be interested in any ideas about how to do it better. (We also talked about this in the fchollet PR). The reason epochs are weird is because each edge is sampled a different number of times (up to 200 I think) per epoch based on the edge weight. One thought is that we could create an iterator that samples from edges with a probability, rather than a fixed number of times per epoch or something. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The black changes are all fine. Its probably just a case of accumulated instances of black not getting run on files that got committed. I reviewed them all, and all is well, so we may as well just leave them in here and get that all cleaned up.
I left some minor comments in docs. The rest looks good to me, especially since we had discussed the landmark loss setup already.
I do think it makes sense to have arbitrary keywords pass through on _fit_embed_data
, especially if it cleans things up from your perspective. It is something that would also allow future extensions more easily, and costs very little in terms of complexity/changes now.
As for tests ... in an ideal world we would have some. At least a test ensuring that (re-)training with landmark loss actually runs. Ensuring that it "does what we expect" is harder. Maybe a very simple case with, say pendigits and a leave-one-out class and simply verifying that it largely forms a cluster by itself? I don't think that kind of test is necessary for this PR right now.
As for getting Tim's pending changes and yours synced up... is there any chance we can merge this now, and get your changes merged over that later Tim? I think there is also some other work on dataset handling that some other people are working on.
typo in doc string. Co-authored-by: Leland McInnes <[email protected]>
… by .fit() and .fit_transform()
Thanks Leland and Tim, Glad to hear it looks ok, and good to confirm that the black changes are good to go. I'm not fussed either way in terms of implementing this first or the keras backend one, happy either way. There are some changes that will have to be made in I'm happy to write up a small test at some point in the future, I've got some pretty lightweight ones looking at classification accuracy on the pre/post retrained embeddings. I guess it depends on whether it's cheaper to do a classifier or clustering in the testing pipeline. I had a go at the arbitrary kwarg passing to RE Training epochs, I think the only concrete suggestion I have is to look at changing the names of some of the attributes? When I see |
LGTM. Do you have a strong need to get your changes merged in first Tim, or can we merge this now and deal with the dataset changes later once you have the time to work through all the details? |
Jacob, this looks great! Very minor comments on the tutorial:
|
Looks good to me, I can merge later! |
This pull request adds major functionality and documentation for re-training Parametric UMAP models, and using landmark re-training to keep the embedding space consistent. This feature allows for the smooth updating of the embedding space, allowing for the transformation of new data without relying on good generalization. In addition to changes to the library, I've added a new documentation page and notebook motivating this update and explaining how to use it.
Changes include:
ParametricUMAP
models to be re-trained by checking in_fit_embed_data()
if an encoder is already present before making a new one._landmark_loss
function to theUMAPModel
class, along withlandmark_loss_fn
andlandmark_loss_weight
to both theUMAPModel
andParametricUMAP
classes.construct_edge_dataset()
to pass on landmark_positions as provided to.fit()
or.fit_transform()
.UMAPModel
.._history
, in line with other keras model behavior.ParametricUMAP
class to the UMAP API Guide page of the documentation, since it now has enough additional documentation that should be readily available..fit()
method.MNIST_Landmarks.ipynb
, which mirrors the documentation page.load_ParametricUMAP
function to__init__.py
so that it can be imported from the library.Questions for review:
ParametricUMAP
that involve re-training?ParametricUMAP
object itself so that they can be passed toconstruct_edge_dataset()
. This was done in an effort to avoid making changes to theUMAP
class code. There might be a cleaner way of doing this where the.fit()
and.fit_transform()
methods onUMAP
can take and pass on arbitrary kwargs toself._fit_embed_data()
. This would also reduce the awkwardness of checking if the landmarks have already been set in.fit_transform()
when you get to.fit()
. If that would be preferable I can have a crack at making a cleaner version.ParametricUMAP
at this stage. This might be a separate issue, but IMO it's magnified by the ability to re-train. Is there a change in nomenclature or notes in the documentation that could improve this?