-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility issue with the same data and OS #110
Comments
That sounds bad. Unfortunately, my ability to diagnose the problem is going to be limited without access to data that reproduces the issue, but I understand if you are unable to share it. So here are some things to try that might narrow it down a bit:
If you are able to try any of this let me know of any findings. |
Hi jlmelville, Thank you for quickly reviewing this issue! Sorry, I should have tried it with Here is the execution with default parameters and below are Runs 1.1 and 1.2 respectively.
This issue for iris goes away when Unfortunately, it still exists for my data with
This time I noticed rotation about the central axis of the plot and local cluster consistency. Let me know if you need more information from me. |
Given that it seems like an initialization-related problem, I wonder whether this is related to some issues with IRLBA in combination with certain non-default BLAS/LAPACK libraries, see bwlewis/irlba#14. Smells like some non-determinism related to parallelization, though I'm not too familiar with the Accelerate library. |
Thank you.
|
Thank you @SuhasSrinivasan for the details. Some observations: First, I'm not sure about the set.seed(123); plot(umap(iris))
set.seed(123); plot(umap(iris)) These should give identical results. If they don't, then there is something very strange with your installation of R and I am not sure it's a problem with uwot. Scenario 2: set.seed(123);
plot(umap(iris))
plot(umap(iris)) These will not give identical results by design. UMAP uses random numbers as part of the stochastic gradient descent. It will give different results for each run. For reproducibility you will need to fix the seed and reset it (via With that out of the way, if you are seeing this problem with
As an aside, it's not impossible for small numerical differences to result in spectral or PCA initialization to result in sign flips. I'm afraid I don't do anything to standardize that (and in fact the latest R release has changed some SVD routines which can cause this change), so the initialized output should only be expected to be the same up to a reflection. I am slightly surprised that it's triggered for the same input, but that could be a random number issue (it's been a while since I looked at IRLBA so I don't actually know if it uses a randomized algorithm). Otherwise, that is suggestive of some oddities in the BLAS library. Finally, although this is separate from the initialization problem, I notice you only have 30 observations in your dataset. The default |
If there is some parallelization happening in the BLAS library, you cannot control that through uwot's interface, because none of that information is passed to IRLBA. As far as I know, IRLBA doesn't expose any ability to control the parallelization. If your BLAS library is multi-threaded, then you will need to consult its documentation to see if there is a way to control that. For example, for Intel's MKL version of BLAS, you can set an environment variable before you start R. |
I will close this issue but can I submit a feature request for a parameter https://umap-learn.readthedocs.io/en/latest/reproducibility.html |
Thank you for digging deeper @SuhasSrinivasan. I'm not against the idea of a set.seed(123)
rnorm(10)
rnorm(10) and set.seed(123)
head(umap(iris))
head(umap(iris)) I don't think most people would have the expectation that a pseudo-random number generator like That said, I think there may be a real problem here in that Let's leave this issue open for now. |
@jlmelville thank you so much for investigating this further and for understanding my limited experience with this. I noticed another behavior which may be relevant to this, it could be a combination of
Iris
|
Ok, I've dug into this a bit more and I don't think there is a bug. > set.seed(123)
> head(.Random.seed)
[1] 10403 624 -983674937 643431772 1162448557 -959247990
> res <- rnorm(10000)
> head(.Random.seed)
[1] 10403 32 1064722786 -45720197 139516934 1211617164 So I don't think there are any bugs here. As long as you run in single-threaded mode, then calling I appreciate that needing to call an external function like |
Thank you for the clarification! And regret any inconvenience. Any insight as to why in some scenarios the result is stable for consecutive runs, when not using |
Not really. Can you give an example of what combination of parameters causes this and what you mean by "stable"? Exactly the same output coordinates? Similar interpoint distances? |
Exactly the same output coordinates |
Ah yes, sorry I didn't spot that sooner. The absolute values of the output coordinates are very large. That's presumably because of the scaling of the input data. If you do PCA on the input ( The way to remedy this is to either use FWIW, this problem isn't specific to uwot: in fact none of the t-SNE/LargeVis/UMAP-like dimensionality reduction methods are scale-invariant. Most of them will scale the input data and the output coordinates in a way to guarantee sufficiently large gradients for some kind of optimization to occur. uwot's default settings also account for this, but if you stray off the beaten path then under the "it's your funeral" principle, all bets are off. Perhaps I should add a check to look at the scale of the initial coordinates and warn if very little optimization can occur to avoid this potential confusion. |
Thank you for the very helpful explanation on the optimization step! |
Leave it open for now please @SuhasSrinivasan, thanks. |
In the master branch there is now a A corner case for reproducibility: if you have a small dataset (less than 4,096 observations) then results will not be reproducible if between running I am going to close this issue, but feel free to re-open. |
Hi jlmelville, thank you for developing and maintaining this package.
I am currently noticing a high severity issue, where the global structure of the 2D embedding changes drastically, on consecutive runs, given that the data,
umap
parameters and executing machine are all maintained constant.The local relationships among the points in the cluster are similar but difficult to immediately gauge due to the large change in global structure.
Additionally, this issue is not resolved after trying many combinations of
umap
parameters as outlined in Issues #46 and #55.Parameters checked include the following:
Below is the initial part of the R session information.
Using umap::umap, I tried to replicate the issue with
uwot::umap
to determine if it is something specific to my environment.But was not able to reproduce it, the
umap::umap
embedding maintains its local and global structure on consecutive runs.Please let me know if there are any workarounds, troubleshooting steps or if you need additional information.
Thank you!
The text was updated successfully, but these errors were encountered: