Skip to content
This repository has been archived by the owner on Dec 30, 2017. It is now read-only.

Results differ based on number of threads #7

Open
claczny opened this issue Sep 1, 2017 · 3 comments
Open

Results differ based on number of threads #7

claczny opened this issue Sep 1, 2017 · 3 comments

Comments

@claczny
Copy link

claczny commented Sep 1, 2017

I observed that the results differ based on the number of threads specified.

In my application which used BH-SNE to create a 2D embedding followed by automated clustering using DBSCAN, I have replaced the single-threaded Rtsne call by a call to your multi-threaded Rtsne.multicore. This was nice&easy thanks to the similarity of both interfaces.

However, when I run the application, the results differ ever so slightly, as indicated below (just the first couple of points each time):
Using 1 thread

-4.3473001944841 -9.88816236259427
-0.264536173449281 2.26121958696939
-11.8037471711157 -1.23420653192463
18.5043209507443 -13.4638139443446
1.51823629529208 -27.2209786228982
8.44296382274354 11.5004388863181
17.0385503073606 -19.5842234534257
-1.80122124653633 -35.1542911986375
-14.9339466535662 11.4724805072396
-16.7179891732902 10.300907221322

Using 2 threads

-4.33102494052646 -9.94346771160292
-0.300330796745644 2.47627128482164
-14.4865548712467 3.83169546954971
18.0266761572745 -13.3481838170748
1.55009711170931 -27.3536683521347
8.57133969496983 11.704078885386
16.8146752705904 -19.4804761345993
-1.67702875389705 -35.6116919363096
-16.328562693303 10.9834569354747
-17.9212513482976 10.1738069116024

Using 3 threads

-4.15202535615338 -9.91628914440292
-0.266922842312901 2.30165398545058
-12.0458514750223 -1.26327092092668
18.3116039523395 -13.4472311793933
1.8728867702686 -27.0478452540983
8.21259960134093 11.338018514761
16.938103908809 -19.4664656504238
-1.51129210868152 -35.5926372619633
-15.7107052664802 10.622091607029
-16.9275577907434 10.5760540704756

Using 4 threads

-4.40493207317474 -10.2542865145978
-0.240311071414228 2.34386945654285
-11.613066543124 -1.22167721092907
17.978213066292 -13.6367838896947
1.68103298346623 -27.3950001130062
8.48320430773571 11.5841961868582
16.5975194709815 -19.6467988772466
-1.21063128661383 -35.6738754692542
-16.2962040171112 11.6000609166704
-16.4988660902924 10.7927849813962

The results using the same number of threads seems to be consistent between different runs, though - which is good at least :)

Using 1 thread - a second run

-4.3473001944841 -9.88816236259427
-0.264536173449281 2.26121958696939
-11.8037471711157 -1.23420653192463
18.5043209507443 -13.4638139443446
1.51823629529208 -27.2209786228982
8.44296382274354 11.5004388863181
17.0385503073606 -19.5842234534257
-1.80122124653633 -35.1542911986375
-14.9339466535662 11.4724805072396
-16.7179891732902 10.300907221322

And for all the points, computing the MD5SUM:

cat ./one_threads/one.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
2410c2539be68ffe1f52d1be0f04bfac  -
cat ./one_threads_old/one.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
2410c2539be68ffe1f52d1be0f04bfac  -
cat ./two_threads/two.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
1f7dd4212d74b162420c79e619b3b91b  -
 cat ./three_threads/three.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
f659b3527318c9545766fed14fc72daa  -
./four_threads/four.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
0e7425b7acf3438d047fb1550bbd069f  -

While the differences are hard to spot by eye - I mean in a 2D scatterplot -, the automatic clustering is affected by the differences.

Your input is greatly appreciated!

Best,

Cedric

@claczny
Copy link
Author

claczny commented Sep 1, 2017

I explore this further and here is a minimal working example:

library(Rtsne.multicore) # Load package
library(digest)
iris_unique <- unique(iris) # Remove duplicates
mat <- as.matrix(iris_unique[,1:4])
set.seed(42) # Sets seed for reproducibility
tsne_out1 <- Rtsne.multicore(mat, num_threads = 1) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out1_2 <- Rtsne.multicore(mat, num_threads = 1) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out2 <- Rtsne.multicore(mat, num_threads = 2) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out2_2 <- Rtsne.multicore(mat, num_threads = 2) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out3 <- Rtsne.multicore(mat, num_threads = 3) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out3_2 <- Rtsne.multicore(mat, num_threads = 3) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out4 <- Rtsne.multicore(mat, num_threads = 4) # Run TSNE
print(digest(tsne_out1))
print(digest(tsne_out1_2))
print(digest(tsne_out2))
print(digest(tsne_out2_2))
print(digest(tsne_out3))
print(digest(tsne_out3_2))
print(digest(tsne_out4))

and some demo output from Rstudio:

> source('~/.active-rstudio-document')
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "82974082989bc301349e03f3d9ee5c5b"
[1] "a8c779d9a4f54f2c14d84b624ffe9da9"
[1] "ccc0b4af068a4c2005504c0b1493e256"
> source('~/.active-rstudio-document')
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "b3479248cefc9b979521e13b25418223"
[1] "07dd9ce0d52e0cb0d1332f8d4849675c"
[1] "8b3a73318d64dd07f96ecdc2e06251d5"

As you can see, the results are consistent between different runs using the same number of threads (here for 1 or 2 threads) yet differ when using different numbers of threads.
Moreover, I am confused as to why the results for 3 threads and 4 threads are different between two runs, i.e., behave differently than 1 or 2 threads.

This is quite puzzling to me.

@gfinak
Copy link
Member

gfinak commented Sep 1, 2017

Not sure as we didn't implement the multicore support, just wrapped the implementation.
See jkrijthe/Rtsne#16 , the Rtsne package has integrated the same multicore support. I'd suggest checking if that package produces the same issue, and discuss with the author.

@claczny
Copy link
Author

claczny commented Sep 4, 2017

Thanks for the answer.

Maybe you can tell me what I have missed there, but the Rtsne package seems not (yet) to contain parallelisation support. There seems to rather be some "derivative" of it (https://github.com/rappdw/tsne) which seems to be Python-based and currently without a wrapper for convenient use in R.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants