-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for multicore tsne #16
Comments
Thank you for the suggestion. I was not planning on it, but it could be a nice addition. The project you linked to seems to require OpenMP, which I think an R package can not currently assume is present on all platforms, particularly macOS. Do you have any suggestions on a different way to get this to work? |
It can be conditionally compiled. |
I've been trying to implement the changes from https://github.com/DmitryUlyanov/Multicore-TSNE in a separate branch, but so far, I have not been able to get large improvements (only a few per cent improvement), so this is going to take some more testing to figure out whether the performance improvement is worth it. |
I implemented my own based on your R wrapper https://github.com/RGLab/Rtsne.multicore |
I would expect most of those improvements to come from the use of quadtree.cpp, how big are the improvements from using OpenMP? |
I am only testing at R wrapper level. But you might be rightsince changing |
@DmitryUlyanov may be able to weight in on this |
@gfinak, According to its document, it only speeds up the scenario where data has $ ./tsne_cpp 20 0.9
Using no_dims = 2, perplexity = 30.000000, and theta = 0.5
Computing input similarities...
Building tree...
Done in 0.2214 seconds (sparsity = 0.707175)!
Learning embedding...
Iteration 50: error is 45.849951 (50 iterations in 0.217287 seconds)
Iteration 100: error is 44.965177 (50 iterations in 0.160154 seconds)
...
Iteration 999: error is 0.123850 (50 iterations in 0.234315 seconds)
Fitting performed in 3.220189 seconds.
done |
I was finally able to have a look at this. After some experiments I decided to integrate the suggested changes in https://github.com/rappdw/tsne, to be able to keep most of the functionality of Rtsne intact, while providing a speed up. The main functionality that I lost is the ability for the user to specify an arbitrary output dimensionality at run time, other than those that are defined at compile time (1,2 and 3). While this could be added back later, I hope those three cases cover most of the use cases of tsne. The changes are in this branch: https://github.com/jkrijthe/Rtsne/tree/openmp. BenchmarksIris (149x4)On iris I get about a 3x speedup on a single core, and a little help from using multiple threads (I have only 2 physical cores in this machine, so results for more than 2 threads may not be representative, as I am not sure what effect HyperThreading has): New version:
Old version (always only effectively uses 1 core):
MNIST digits (10000x784)
Before integrating this into the main branch I still need to check to what extend results are reproducible when different numbers of threads are used and to make sure I understand all the requirements to include OpenMP code in a CRAN package. |
So this feature is now merged? is there a flag I should one to utilize it? |
Unfortunately, no, it is still in the openmp branch, as I got caught up in other things during testing. Because of this, it has not been thoroughly tested yet, so I have not merged the changes. Another minor issue is that the change no longer allows for embeddings of dimension other than 1,2 or 3, breaking earlier possible (but never recommended) behaviour. So to get the speed-up you currently have to install the version in the openmp branch. I'm currently not sure when I will have time to test/merge into the main branch and send it to CRAN, but hopefully I can get to it soon. In the meantime, any testing/improvements are most welcome. |
It seems that the changes on library(Rtsne)
iris_unique <- unique(iris) # Remove duplicates
iris_matrix <- as.matrix(iris_unique[,1:4])
set.seed(42) # Setting a seed doesn't help here, if I execute this again.
tsne_out <- Rtsne(iris_matrix, num_threads=2, verbose=TRUE) # Run TSNE
plot(tsne_out$Y,col=iris_unique$Species) The culprit seems to be I don't know if there is any openMP setting that can solve this - |
Do you get consistency across different numbers of threads with the solution you propose? And what would be the difference in terms of performance? |
Some testing suggests that the proposed solution restores consistency to the application across runs and across numbers of threads. I've posted the modified function below (can make a PR explicitly if you want): // Compute gradient of the t-SNE cost function (using Barnes-Hut algorithm)
template <int NDims>
void TSNE<NDims>::computeGradient(double* P, unsigned int* inp_row_P, unsigned int* inp_col_P, double* inp_val_P, double* Y, int N, int D, double* dC, double theta)
{
// Construct space-partitioning tree on current map
SPTree<NDims>* tree = new SPTree<NDims>(Y, N);
// Compute all terms required for t-SNE gradient
double* pos_f = (double*) calloc(N * D, sizeof(double));
double* neg_f = (double*) calloc(N * D, sizeof(double));
if(pos_f == NULL || neg_f == NULL) { Rcpp::stop("Memory allocation failed!\n"); }
tree->computeEdgeForces(inp_row_P, inp_col_P, inp_val_P, N, pos_f);
// Storing the output to sum in single-threaded mode; avoid randomness in rounding errors.
std::vector<double> output(N);
#pragma omp parallel for schedule(guided)
for (int n = 0; n < N; n++) {
output[n]=tree->computeNonEdgeForces(n, theta, neg_f + n * D);
}
double sum_Q = .0;
for (int n=0; n<N; ++n) {
sum_Q += output[n];
}
// Compute final t-SNE gradient
for(int i = 0; i < N * D; i++) {
dC[i] = pos_f[i] - (neg_f[i] / sum_Q);
}
free(pos_f);
free(neg_f);
delete tree;
} The performance should not be much different; there's an extra memory allocation per call of |
Thanks. PR would be nice, but I can also commit it if you prefer that. |
Hi, I have tried using a value greater than 1 with the |
Yes, that is correct: currently only the approximate implementation can use multiple threads, while the exact does not. |
I've found that Rtsne.multicore is faster than using multiple threads within Rtsne,
Do you (@jkrijthe @mikejiang) know why this is happening? Please excuse the newbie questions and thank you for all your great work! |
Without looking at the |
An update on my previous post: it seems like what we have now is probably the best we're going to get. On a separate note, I tried to speed it up by moving some of the memory allocations out of the loops, but to my surprise, this had no effect at all (compiling on Rdevel with Clang 6.0). This result was quite unexpected, I'd always thought that performing repeated allocations within a loop was a Bad Thing. But... there you go. That was going to be my major speed-up strategy; so much for that. In any case, the side product of this little misadventure was a more C++-idiomatic version of |
Cool, thanks for testing this out!
… On 8 Dec 2018, at 15:57, Aaron Lun ***@***.***> wrote:
An update on my previous post: it seems like what we have now is probably the best we're going to get. Rtsne.multicore uses reduction for the summation, which is fast but leads to irreproducibility issues across multiple runs even when the seed is set - see my Oct 4 comment above.
On a separate note, I tried to speed it up by moving some of the memory allocations out of the loops, but to my surprise, this had no effect at all (compiling on Rdevel with Clang 6.0). This result was quite unexpected, I'd always thought that performing repeated allocations within a loop was a Bad Thing. But... there you go. That was going to be my major speed-up strategy; so much for that.
In any case, the side product of this little misadventure was a more C++-idiomatic version of tsne.cpp (the original bhtsne code was effectively written as C with classes), which also avoids an unnecessary copy of the input data set during the NN search. It uses OpenMP more widely but this doesn't make much of a difference as most time seems to be spent in the serial SPTree construction per training iteration. As such, the modified version runs slightly faster/slower than the current Rtsne, depending on the mood of my laptop. Check out the fork <https://github.com/LTLA/Rtsne> if you want to play around with it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#16 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAnyuVGLc1oSnIUHI3obAPPjt0yRcBhEks5u29NOgaJpZM4MqmL->.
|
Thank you @LTLA! However, I'm running into other problems and I'm not sure if I just have to keep scaling up computing power or if there is some "remove memory allocations out of loops" kind of fix. Original data set is 300+ Million rows (5 dim), but I've been sub-sampling. The largest sample I've successfully run is ~50 K rows. Any higher and I get C stack usage errors. This was the latest one, testing a ~350 K row sub-sample (ulimit is set to "unlimited" and I'm running it on a 32 core, 244 GB RAM, and 240 GB HD instance).
Any idea how I can make this work? |
It's probably a stack overflow caused by recursion, either in I'm a bit surprised, though; the nodes of the tree are created via |
|
Good! Since the main changes are in master now, I'm closing this issue. Specific suggestions for further improvements are, of course, welcome. Thanks all! |
Hi, forks I still get the error C stack usage is too close to the limit when doing t-SNE for a large dataset 210,614x30 (the PCA matrix of a single cell dataset. File attached: https://drive.google.com/open?id=180iI8W49rgLPvgt6-GbhvmrwnyOWgk0t). My machine has 90GB of RAM. Below are the specs of the machine: Architecture: x86_64 I've already set the ulimit -s to unlimited. Any suggestion is appreciated. @LTLA @jkrijthe |
I'm afraid I don't have much more guidance to give here. If To give some background, the stack errors are most likely caused by the recursions during the VP tree nearest neighbors search and the SP tree calculation of the forces. For huge datasets, the tree is deep enough so that it blows the stack when it tries to recurse. Refactoring both of these to iterate rather than recurse would be... possible, but a real chore, as recursion is a really natural way to work with trees. I don't know whether anyone would have the appetite for this. |
I'm afraid I do not have any suggestions here. |
Having thought about it a bit more, here's a long shot at solving the problem. If the stack error is occurring during the VP tree search, you should be able to avoid it by supplying your own NN results. Have a look at You're out of luck if the stack error occurs in the SP tree. That said, I have just looked at the SP tree code and the recursion is not as bad as I thought - it just occurs at one function, and it may be a relatively straightforward weekend job to refactor this into an iterative algorithm. |
Any plan to add support for parallel tsne? https://github.com/DmitryUlyanov/Multicore-TSNE
The text was updated successfully, but these errors were encountered: