support for multicore tsne #16

mikejiang · 2017-03-27T17:45:11Z

Any plan to add support for parallel tsne? https://github.com/DmitryUlyanov/Multicore-TSNE

jkrijthe · 2017-03-27T18:08:14Z

Thank you for the suggestion. I was not planning on it, but it could be a nice addition. The project you linked to seems to require OpenMP, which I think an R package can not currently assume is present on all platforms, particularly macOS. Do you have any suggestions on a different way to get this to work?

mikejiang · 2017-03-27T18:25:42Z

It can be conditionally compiled. data.table may be a good example to follow.
https://github.com/Rdatatable/data.table/blob/master/src/data.table.h#L5-L10
https://github.com/Rdatatable/data.table/blob/master/src/Makevars

jkrijthe · 2017-03-28T21:05:33Z

I've been trying to implement the changes from https://github.com/DmitryUlyanov/Multicore-TSNE in a separate branch, but so far, I have not been able to get large improvements (only a few per cent improvement), so this is going to take some more testing to figure out whether the performance improvement is worth it.

mikejiang · 2017-03-28T21:11:48Z

I implemented my own based on your R wrapper https://github.com/RGLab/Rtsne.multicore
which already sees 50% time saving with iris data

jkrijthe · 2017-03-28T22:11:44Z

I would expect most of those improvements to come from the use of quadtree.cpp, how big are the improvements from using OpenMP?

mikejiang · 2017-03-28T22:29:56Z

I am only testing at R wrapper level. But you might be rightsince changing num_threads doesn't seem to make much difference.

mikejiang · 2017-03-28T22:33:30Z

@DmitryUlyanov may be able to weight in on this

mikejiang · 2017-03-29T22:27:19Z

@gfinak, According to its document, it only speeds up the scenario where data has D>>N, yet majority of our use cases are rows >> columns, thus the major bottle neck at step 2 (i.e. Learning embedding) does not gain much from this multicore version

$ ./tsne_cpp 20 0.9
Using no_dims = 2, perplexity = 30.000000, and theta = 0.5
Computing input similarities...
Building tree...
Done in 0.2214 seconds (sparsity = 0.707175)!
Learning embedding...
Iteration 50: error is 45.849951 (50 iterations in 0.217287 seconds)
Iteration 100: error is 44.965177 (50 iterations in 0.160154 seconds)
...
Iteration 999: error is 0.123850 (50 iterations in 0.234315 seconds)
Fitting performed in 3.220189 seconds.
done

jkrijthe · 2017-07-07T15:40:45Z

I was finally able to have a look at this. After some experiments I decided to integrate the suggested changes in https://github.com/rappdw/tsne, to be able to keep most of the functionality of Rtsne intact, while providing a speed up. The main functionality that I lost is the ability for the user to specify an arbitrary output dimensionality at run time, other than those that are defined at compile time (1,2 and 3). While this could be added back later, I hope those three cases cover most of the use cases of tsne.

The changes are in this branch: https://github.com/jkrijthe/Rtsne/tree/openmp.

Benchmarks

Iris (149x4)

On iris I get about a 3x speedup on a single core, and a little help from using multiple threads (I have only 2 physical cores in this machine, so results for more than 2 threads may not be representative, as I am not sure what effect HyperThreading has):

New version:

Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
tsne_out <- Rtsne(mat, num_threads = 4) 268.3007 272.7952 288.3637 293.0806 297.0574 305.2486    10
tsne_out <- Rtsne(mat, num_threads = 3) 261.4605 263.5789 276.8929 272.7442 286.3434 309.8314    10
tsne_out <- Rtsne(mat, num_threads = 2) 242.5431 252.5582 284.0321 290.3844 300.3660 337.2284    10
tsne_out <- Rtsne(mat, num_threads = 1) 308.2431 314.8977 335.4389 323.9315 357.9258 374.1955    10

Old version (always only effectively uses 1 core):

  Unit: milliseconds
expr       min       lq     mean   median       uq      max neval
tsne_out <- Rtsne(mat, num_threads = 4)  999.5290 1032.207 1048.617 1045.712 1077.219 1100.675    10
tsne_out <- Rtsne(mat, num_threads = 3) 1018.1044 1035.353 1062.052 1050.393 1084.681 1130.307    10
tsne_out <- Rtsne(mat, num_threads = 2)  988.6694 1005.776 1043.140 1022.328 1072.930 1135.645    10
tsne_out <- Rtsne(mat, num_threads = 1) 1020.6595 1031.474 1036.125 1037.058 1043.347 1051.902    10

MNIST digits (10000x784)

Old 1 thread : 135.835 s
New 1 thread :  75.803 s
New 2 threads:  58.209 s
New 4 threads:  57.580 s

Before integrating this into the main branch I still need to check to what extend results are reproducible when different numbers of threads are used and to make sure I understand all the requirements to include OpenMP code in a CRAN package.

rgranit · 2018-08-09T07:00:12Z

So this feature is now merged? is there a flag I should one to utilize it?

jkrijthe · 2018-08-10T14:54:09Z

Unfortunately, no, it is still in the openmp branch, as I got caught up in other things during testing. Because of this, it has not been thoroughly tested yet, so I have not merged the changes. Another minor issue is that the change no longer allows for embeddings of dimension other than 1,2 or 3, breaking earlier possible (but never recommended) behaviour.

So to get the speed-up you currently have to install the version in the openmp branch.

I'm currently not sure when I will have time to test/merge into the main branch and send it to CRAN, but hopefully I can get to it soon. In the meantime, any testing/improvements are most welcome.

LTLA · 2018-10-04T12:49:48Z

It seems that the changes on master (currently on 2fbc1ca) are not consistent when you change the number of threads, or even across runs. When I execute this code multiple times, I get different results:

library(Rtsne)
iris_unique <- unique(iris) # Remove duplicates
iris_matrix <- as.matrix(iris_unique[,1:4])

set.seed(42) # Setting a seed doesn't help here, if I execute this again.
tsne_out <- Rtsne(iris_matrix, num_threads=2, verbose=TRUE) # Run TSNE
plot(tsne_out$Y,col=iris_unique$Species)

The culprit seems to be TSNE<NDims>::computeGradient; commenting out the #pragma gives the consistent results across runs. I suspect that the reduction in the openMP'd for loop does not guarantee addition to sum_Q in the same order in a multi-threaded application compared to a single-threaded application, which results in differences in rounding-off errors that are amplified across iterations.

I don't know if there is any openMP setting that can solve this - static scheduling will provide consistency across runs but not necessarily the same results as the single-threaded application. The simplest solution would just be to save everything into an array and sum across the array in single-threaded mode.

jkrijthe · 2018-10-04T13:02:16Z

Do you get consistency across different numbers of threads with the solution you propose? And what would be the difference in terms of performance?

LTLA · 2018-10-04T14:52:17Z

Some testing suggests that the proposed solution restores consistency to the application across runs and across numbers of threads. I've posted the modified function below (can make a PR explicitly if you want):

// Compute gradient of the t-SNE cost function (using Barnes-Hut algorithm)
template <int NDims>
void TSNE<NDims>::computeGradient(double* P, unsigned int* inp_row_P, unsigned int* inp_col_P, double* inp_val_P, double* Y, int N, int D, double* dC, double theta)
{
    // Construct space-partitioning tree on current map
    SPTree<NDims>* tree = new SPTree<NDims>(Y, N);

    // Compute all terms required for t-SNE gradient
    double* pos_f = (double*) calloc(N * D, sizeof(double));
    double* neg_f = (double*) calloc(N * D, sizeof(double));
    if(pos_f == NULL || neg_f == NULL) { Rcpp::stop("Memory allocation failed!\n"); }
    tree->computeEdgeForces(inp_row_P, inp_col_P, inp_val_P, N, pos_f);

    // Storing the output to sum in single-threaded mode; avoid randomness in rounding errors.
    std::vector<double> output(N);
    #pragma omp parallel for schedule(guided)
    for (int n = 0; n < N; n++) {
      output[n]=tree->computeNonEdgeForces(n, theta, neg_f + n * D);
    }

    double sum_Q = .0;
    for (int n=0; n<N; ++n) {
        sum_Q += output[n];
    }

    // Compute final t-SNE gradient
    for(int i = 0; i < N * D; i++) {
        dC[i] = pos_f[i] - (neg_f[i] / sum_Q);
    }

    free(pos_f);
    free(neg_f);
    delete tree;
}

The performance should not be much different; there's an extra memory allocation per call of computeGradient, which can be avoided by having output as a TSNE member variable. I guess there's also the possibility for cache misses when the computed value in each iteration needs to be stored in output, but this might be offset by the reduced complexity of the loop itself when reduction is no longer required. I'm not sure how these two factors would play off against each other - probably negligible in both cases.

jkrijthe · 2018-10-04T17:45:03Z

Thanks. PR would be nice, but I can also commit it if you prefer that.

kbarylyuk · 2018-11-21T15:50:44Z

Hi, I have tried using a value greater than 1 with the num_threads parameter but I don't seem to see any speed up when theta = 0. I do see a decrease in the running time when num_threads is changed from 1 to 2 (no further speed up when using 3 or 4 threads on my 4-core computer) if theta = 0.5. Also, I see only one thread in my Activity Monitor if I use, for example, theta = 0, num_threads = 2, but there are multiple threads when theta = 0.5.
Am I correct in thinking that using multiple threads only works at the Barnes-Hut approximation step?

jkrijthe · 2018-12-03T17:09:43Z

Yes, that is correct: currently only the approximate implementation can use multiple threads, while the exact does not.

cpenaloza · 2018-12-05T16:51:39Z

I implemented my own based on your R wrapper https://github.com/RGLab/Rtsne.multicore
which already sees 50% time saving with iris data

I've found that Rtsne.multicore is faster than using multiple threads within Rtsne,

library(microbenchmark)
library(Rtsne)
library(Rtsne.multicore)

microbenchmark(
   tout <- Rtsne.multicore(iris, perplexity = 10, num_threads = 16, check_duplicates = F), 
   tout <- Rtsne(iris, perplexity = 10, num_threads = 16, check_duplicates = F), 
   times = 10)

#>                      min        lq     mean    median      uq    max  neval cld
#> Rtsne.multicore   181.7699 190.2749 193.4733 191.6511 198.0421 206.6607    10  a 
#> Rtsne             213.6623 219.0838 234.7819 229.6211 238.5277 305.4631    10   b

Do you (@jkrijthe @mikejiang) know why this is happening?
Why is Rtsne.multicore deprecated if it is faster?
Which should I be using?

Please excuse the newbie questions and thank you for all your great work!

LTLA · 2018-12-05T17:27:42Z

Without looking at the Rtsne.multicore code, I can't say too much. But we did have to modify some of the OpenMP code in Rtsne to avoid stochasticity from race conditions; this may explain the slight slow-down, possibly due to increased memory management in each thread. I am planning to overhaul the C++ back-end to improve upon some of the hacky workarounds that I put in - with @jkrijthe's blessing, of course.

LTLA · 2018-12-08T14:57:18Z

An update on my previous post: it seems like what we have now is probably the best we're going to get. Rtsne.multicore uses reduction for the summation, which is fast but leads to irreproducibility issues across multiple runs even when the seed is set - see my Oct 4 comment above.

On a separate note, I tried to speed it up by moving some of the memory allocations out of the loops, but to my surprise, this had no effect at all (compiling on Rdevel with Clang 6.0). This result was quite unexpected, I'd always thought that performing repeated allocations within a loop was a Bad Thing. But... there you go. That was going to be my major speed-up strategy; so much for that.

In any case, the side product of this little misadventure was a more C++-idiomatic version of tsne.cpp (the original bhtsne code was effectively written as C with classes), which also avoids an unnecessary copy of the input data set during the NN search. It uses OpenMP more widely but this doesn't make much of a difference as most time seems to be spent in the serial SPTree construction per training iteration. As such, the modified version runs slightly faster/slower than the current Rtsne, depending on the mood of my laptop. Check out the fork if you want to play around with it.

jkrijthe · 2018-12-14T13:36:34Z

Cool, thanks for testing this out!

…

On 8 Dec 2018, at 15:57, Aaron Lun ***@***.***> wrote: An update on my previous post: it seems like what we have now is probably the best we're going to get. Rtsne.multicore uses reduction for the summation, which is fast but leads to irreproducibility issues across multiple runs even when the seed is set - see my Oct 4 comment above. On a separate note, I tried to speed it up by moving some of the memory allocations out of the loops, but to my surprise, this had no effect at all (compiling on Rdevel with Clang 6.0). This result was quite unexpected, I'd always thought that performing repeated allocations within a loop was a Bad Thing. But... there you go. That was going to be my major speed-up strategy; so much for that. In any case, the side product of this little misadventure was a more C++-idiomatic version of tsne.cpp (the original bhtsne code was effectively written as C with classes), which also avoids an unnecessary copy of the input data set during the NN search. It uses OpenMP more widely but this doesn't make much of a difference as most time seems to be spent in the serial SPTree construction per training iteration. As such, the modified version runs slightly faster/slower than the current Rtsne, depending on the mood of my laptop. Check out the fork <https://github.com/LTLA/Rtsne> if you want to play around with it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAnyuVGLc1oSnIUHI3obAPPjt0yRcBhEks5u29NOgaJpZM4MqmL->.

cpenaloza · 2018-12-18T17:57:19Z

Thank you @LTLA!
I've switched to Rtsne, I need reproducibility.

However, I'm running into other problems and I'm not sure if I just have to keep scaling up computing power or if there is some "remove memory allocations out of loops" kind of fix.

Original data set is 300+ Million rows (5 dim), but I've been sub-sampling. The largest sample I've successfully run is ~50 K rows. Any higher and I get C stack usage errors. This was the latest one, testing a ~350 K row sub-sample (ulimit is set to "unlimited" and I'm running it on a 32 core, 244 GB RAM, and 240 GB HD instance).

Error: C stack usage 761581163188 is too close to the limit

Any idea how I can make this work?

LTLA · 2018-12-19T00:22:40Z

It's probably a stack overflow caused by recursion, either in vptree.h or sptree.h. Is this occuring during the nearest-neighbour search or after it? Try seeing where the error occurs with verbose=TRUE.

I'm a bit surprised, though; the nodes of the tree are created via new so they shouldn't store their data on the stack. Which suggests that the stack frames are the problem... what does ulimit -s say?

cpenaloza · 2018-12-19T15:39:39Z

$ ulimit -s
8192
DOH! My bad @LTLA... I should have set ulimit -s unlimited in every call to screen. I'm running larger samples now... successfully so far (360 K+ rows), now running 3 M+ rows.
THANK YOU!

jkrijthe · 2018-12-19T16:44:10Z

Good! Since the main changes are in master now, I'm closing this issue. Specific suggestions for further improvements are, of course, welcome. Thanks all!

bioturing · 2019-08-05T11:11:37Z

Hi, forks

I still get the error C stack usage is too close to the limit when doing t-SNE for a large dataset 210,614x30 (the PCA matrix of a single cell dataset. File attached: https://drive.google.com/open?id=180iI8W49rgLPvgt6-GbhvmrwnyOWgk0t). My machine has 90GB of RAM. Below are the specs of the machine:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

I've already set the ulimit -s to unlimited. Any suggestion is appreciated. @LTLA @jkrijthe

LTLA · 2019-08-06T21:45:47Z

I'm afraid I don't have much more guidance to give here. If ulimit -s is truly unlimited (and you restarted R in this unlimited state), then there shouldn't have been issues. Unless you are exhausting the "unlimited" stack in terms of its actual physical size, but that seems unlikely.

To give some background, the stack errors are most likely caused by the recursions during the VP tree nearest neighbors search and the SP tree calculation of the forces. For huge datasets, the tree is deep enough so that it blows the stack when it tries to recurse. Refactoring both of these to iterate rather than recurse would be... possible, but a real chore, as recursion is a really natural way to work with trees. I don't know whether anyone would have the appetite for this.

jkrijthe · 2019-08-07T02:10:28Z

I'm afraid I do not have any suggestions here.

LTLA · 2019-08-07T05:51:38Z

Having thought about it a bit more, here's a long shot at solving the problem.

If the stack error is occurring during the VP tree search, you should be able to avoid it by supplying your own NN results. Have a look at scater::runTSNE to see how it's done. For example, you can compute NNs using an approximate method (e.g., Annoy, which still uses trees but these should be much shallower), then pass the NN indices and distances to Rtsne_neighbors.

You're out of luck if the stack error occurs in the SP tree. That said, I have just looked at the SP tree code and the recursion is not as bad as I thought - it just occurs at one function, and it may be a relatively straightforward weekend job to refactor this into an iterative algorithm.

mikejiang closed this as completed Mar 29, 2017

jkrijthe mentioned this issue May 3, 2017

Multicore support #19

Closed

jkrijthe mentioned this issue Jun 30, 2017

benchmarks #24

Closed

jkrijthe reopened this Jul 7, 2017

mikejiang mentioned this issue Aug 23, 2017

Precomputed distance RGLab/Rtsne.multicore#6

Merged

gfinak mentioned this issue Sep 1, 2017

Results differ based on number of threads RGLab/Rtsne.multicore#7

Open

daniel-wells mentioned this issue Sep 21, 2017

Performance Improvements via PCA #26

Closed

LTLA mentioned this issue Oct 4, 2018

Preserve addition order in multi-threaded mode. #40

Merged

LTLA mentioned this issue Dec 9, 2018

CRAN submission jlmelville/uwot#2

Closed

jkrijthe closed this as completed Dec 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for multicore tsne #16

support for multicore tsne #16

mikejiang commented Mar 27, 2017

jkrijthe commented Mar 27, 2017

mikejiang commented Mar 27, 2017

jkrijthe commented Mar 28, 2017

mikejiang commented Mar 28, 2017

jkrijthe commented Mar 28, 2017

mikejiang commented Mar 28, 2017

mikejiang commented Mar 28, 2017

mikejiang commented Mar 29, 2017 •

edited

Loading

jkrijthe commented Jul 7, 2017

rgranit commented Aug 9, 2018

jkrijthe commented Aug 10, 2018

LTLA commented Oct 4, 2018 •

edited

Loading

jkrijthe commented Oct 4, 2018

LTLA commented Oct 4, 2018 •

edited

Loading

jkrijthe commented Oct 4, 2018

kbarylyuk commented Nov 21, 2018

jkrijthe commented Dec 3, 2018

cpenaloza commented Dec 5, 2018

LTLA commented Dec 5, 2018 •

edited

Loading

LTLA commented Dec 8, 2018

jkrijthe commented Dec 14, 2018 via email

cpenaloza commented Dec 18, 2018

LTLA commented Dec 19, 2018

cpenaloza commented Dec 19, 2018

jkrijthe commented Dec 19, 2018

bioturing commented Aug 5, 2019

LTLA commented Aug 6, 2019

jkrijthe commented Aug 7, 2019

LTLA commented Aug 7, 2019

support for multicore tsne #16

support for multicore tsne #16

Comments

mikejiang commented Mar 27, 2017

jkrijthe commented Mar 27, 2017

mikejiang commented Mar 27, 2017

jkrijthe commented Mar 28, 2017

mikejiang commented Mar 28, 2017

jkrijthe commented Mar 28, 2017

mikejiang commented Mar 28, 2017

mikejiang commented Mar 28, 2017

mikejiang commented Mar 29, 2017 • edited Loading

jkrijthe commented Jul 7, 2017

Benchmarks

Iris (149x4)

MNIST digits (10000x784)

rgranit commented Aug 9, 2018

jkrijthe commented Aug 10, 2018

LTLA commented Oct 4, 2018 • edited Loading

jkrijthe commented Oct 4, 2018

LTLA commented Oct 4, 2018 • edited Loading

jkrijthe commented Oct 4, 2018

kbarylyuk commented Nov 21, 2018

jkrijthe commented Dec 3, 2018

cpenaloza commented Dec 5, 2018

LTLA commented Dec 5, 2018 • edited Loading

LTLA commented Dec 8, 2018

jkrijthe commented Dec 14, 2018 via email

cpenaloza commented Dec 18, 2018

LTLA commented Dec 19, 2018

cpenaloza commented Dec 19, 2018

jkrijthe commented Dec 19, 2018

bioturing commented Aug 5, 2019

LTLA commented Aug 6, 2019

jkrijthe commented Aug 7, 2019

LTLA commented Aug 7, 2019

mikejiang commented Mar 29, 2017 •

edited

Loading

LTLA commented Oct 4, 2018 •

edited

Loading

LTLA commented Oct 4, 2018 •

edited

Loading

LTLA commented Dec 5, 2018 •

edited

Loading