-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spectral initialization seems to get stuck sometimes #16
Comments
Note that I also had some luck using different algorithms (LOBPCG) for some cases. I've managed to have some more detailed advice as well (on building suitable preconditioning matrices etc.) but mostly this has not been a big enough problem/bottleneck to go down that particular rabbit hole. On a different front I am also looking into a bootstrapping style initialisation, or something similar to PAGA. That's an issue for further down the road however. |
I had a look around for LOBPCG routines, but could not find an R implementation (although it seem to have been implemented everywhere else). PAGA does indeed look interesting, but I suspect I will be lucky if there's an R-accessible version of that too. |
I was expecting to have to implement my own version of PAGA style routines. That is among the reasons it is low on my priority list -- it's a lot of work. |
@lmcinnes, I haven't had time to dig into the meta-embedding routines fully, but it looks like one of the code paths involves creating the Laplacian Eigenmap based on applying a Gaussian kernel to the distance matrix between the centroids of the connected components. The centroids seem to be taken from the original data itself: what happens with non-continuous features (e.g. 0/1 binary data)? Also, how do you see this playing into mixed data types: I assume you would use just the subset of X that applies to that type when calculating the centroid. Do you consider this a key part of UMAP initialization and would feel strongly that UMAP was not being properly represented if I didn't implement this in uwot; or could I just bail out to random initialization when confronted with disconnected components? I could try to justify my lazy approach by arguing that if the graph isn't fully connected, then a global view of the data can't be provided, and the user should increase |
I've made a change so that disconnected components are initialized separately and the results merged, but with no attempt to stop them from overlapping. I squinted at the PAGA code in scanpy and it seems that it uses ForceAtlas2 or igraph to deal with positioning disconnected components. In turn, there doesn't seem to be anything particularly magic about their way of dealing with disconnections, so I am not going to purse that further for now. For some problem cases I identified, the recursive initialization does help, but there are still some results where the spectral initialization takes way too long (several hours), so I need to take a closer look at RSpectra's parameters. |
Although the shift-and-invert mode ( However, under the circumstances where it doesn't misbehave, it can converge where using That'll do for now. |
Sorry for being late getting back to you. The meta-embedding approach is not really viable for mixed data and other games (such as non-continuous data). A medoid rather than centroid based approach would be the best option there, although it certainly gets tricky with mixed data types. In an ideal world I would like an approach something like this: Use graph clusters (could be connected components, could be Louvain, or something similar) an generate representatives for each cluster (centroids, medoids, something else again) and then embed the representatives using UMAP (which requires an exact UMAP designed for small data like you have in smallvis). Generate an initialisation from there based on an approximate UMAP or spetral for each component, scaled appropriately to avoid overlaps, and then run the optimization from there. Details to be worked out of course, and implementation-wise this can get non-trivial. |
A fill UMAP on a reduced dataset of medioids is interesting. The only difficulty I see with it is deciding on a value for |
Hi! I´m using this wonderfull package to do dimension reduction on flow cytometry data. Using unsupervised uwot::umap to do dimension reduction on this data with init="spectral" I get the best separation of the populations. But, using the "label" column as y to do a supervised embedding the R get stuck on initialization. I can do supervised embedding with the other init options except spectral and normlaplacian (laplacian do works). What do you think about this issue? Thanks, |
I will take another look, although the options are limited. I can add a new I'll update the issue when/if this feature is added. |
@OuNao, It will almost certainly run a lot slower than RSpectra does under normal circumstances, but it might complete under conditions where RSpectra hangs, which might be an acceptable trade-off. On my machine, for MNIST and FMNIST (which have dimensions of N = 70,000 and p = 768), the current settings for If you are able to, can you let me know if installing the latest version of uwot from master and using
I've set it to try up to 1000 iterations. For some environments you might not see the output until it's entirely finished, but for example in RStudio, you should see each line appear as it's calculated, which should at least let you distinguish between whether it's hanging or just still grinding away. |
Thanks for the prompt response. Yes, with the ispectral init option the initialization occurs with no crash, but slower. On my old i3 notebook I get: I will try some different training sets to know if the better population separation with the ispectral option is worth the loss of performance. Thank you again. |
Thank you for checking on that. Possibly I can find a set of parameters with irlba that would make it acceptably fast, but as RSpectra takes only a few seconds (at least when it's working) and irlba is taking several minutes that could be a tough proposition. Some ways forward would be to implement LOBPCG in R, or to do some digging in what might be causing RSpectra to be so slow. Both are long-term projects. |
@OuNao, if you are still having problems with excessively long spectral initializations, I have pushed a new initialization option to master, This approach is the same as The advantages are:
Results in many cases seemed to be quite similar to the
|
The new "agspectral" seems as fast and good for me as "spectral" (but with no crash in supervised embedding!!!). Thank you again! |
Sometimes the spectral initialization takes a long time; on some occasions, it's got so stuck I've had to terminate the calculation (which can sometimes require terminating the R session).
Probably the input matrix is very poorly conditioned, so that finding the smallest eigenvalues is an exercise in numerical futility.
Recent versions of UMAP detect connected components and initialize them separately, see e.g.
https://github.com/lmcinnes/umap/blob/43cf1a820cea8d5b3218627d047dd78e4a152dd4/umap/spectral.py
This might solve the problem.
The text was updated successfully, but these errors were encountered: