Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uwot/umap crashes when run twice .... #39

Closed
aldojongejan opened this issue Dec 2, 2019 · 40 comments
Closed

Uwot/umap crashes when run twice .... #39

aldojongejan opened this issue Dec 2, 2019 · 40 comments

Comments

@aldojongejan
Copy link

aldojongejan commented Dec 2, 2019

Dear all,

I can't get umap to run twice in an R-session without crashing. Initially observed using RunUMAP from Seurat 3.1.1, but also the very basic code (see below) would not work..
I have tried to resolve this using the hints in cole-trapnell-lab/monocle3#186 and satijalab/seurat#2256, but without any succes..

Any suggestions?

Session and info:

library(uwot)
iris_umap <- umap(iris, pca = 50)

# And a second time
iris_umap2 <- umap(iris, pca = 50)

# Crash ....
#
# Bioconductor version [1] ‘3.10’
#
# R Under development (unstable) (2019-11-05 r77375)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
#
# Matrix products: default
#
# locale:
# [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C LC_TIME=English_United States.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] uwot_0.1.5 Matrix_1.2-17
#
# loaded via a namespace (and not attached):
# [1] compiler_4.0.0 tools_4.0.0 yaml_2.2.0 Rcpp_1.0.3 grid_4.0.0 FNN_1.1.3 RcppParallel_4.4.4
# [8] lattice_0.20-38

thanks in advance and with kind regards,
Aldo

@jlmelville
Copy link
Owner

Sorry you are having trouble. I am unable to reproduce the crash you give, even with R-devel installed. If you get a stacktrace, do you get the same error as in the other issues, i.e. memory not mapped when RcppParallel::setNumThreads() is called?

If so, what does RcppParallel::defaultNumThreads() say? My guess is that a non-integer value of n_threads < 1 is being passed in, which I have just discovered does seem to cause RccpParallel some grief.

@jlmelville
Copy link
Owner

The current master might solve the issue. Please give it a try if possible and let me know.

@LTLA
Copy link
Contributor

LTLA commented Dec 2, 2019

Yeah, looks nasty. Confirm that 0 < n_threads < 1 blows up on my mac. Probably RcppParallel::defaultNumThreads() is giving 1 on affecting machines so you get n_threads=0.5. Interesting that n_threads=0 works, I would have thought that there would be a cast to integer at some point such that a fractional value would get truncated to zero pretty quickly...

@jlmelville
Copy link
Owner

jlmelville commented Dec 3, 2019

Thanks for confirming the issue @LTLA. That's one problem solved, but is it the problem? I suppose with uninitialized memory anything is possible, although it's odd that it works once then fails, and that more than one person reports in satijalab/seurat#2256 that installing RcppParallel from conda-forge solves the issue. Some compiler difference that initializes the memory differently?

@LTLA
Copy link
Contributor

LTLA commented Dec 3, 2019

Hmm. The fact that it fails on the second go does suggest it's a memory leak of some sort rather than the n_threads problem (which always fails immediately for me). Valgrind gives me a whole stack of warnings if run on the OP's code, but they all relate to base::eval rather than anything in uwot.

The question is whether this is a memory leak in uwot or RcppParallel, given that the problem was "fixed" by reinstalling the latter from conda. Though given how much conda messes with the libraries, it feels like a house of cards to rely on that to solve this kind of problem.

It would be nice to see what happens if someone can run Valgrind on a machine where the above code crashes. Might be pretty painful to do on Windows, though.

@aldojongejan
Copy link
Author

Dear @LTLA and @jlmelville ,

I just installed the latest uwot code, reinstalled RcppParallel and ran the code again....it now works!
Should not have reinstalled RcppParallel to confirm that changes in the uwot code did the trick and not the reinstallation of RcppParallel, I am sorry for that ;-)

RcppParallel::defaultNumThreads() gives 8, by the way (as it did before).
Just to let you know, I had been running the code setting different values for n_threads, but to no avail.

Thanks for all your help!!

@jlmelville
Copy link
Owner

jlmelville commented Dec 3, 2019

I'm glad it's working now, but I am mystified as to why. Did you reinstall RcppParallel from CRAN or from conda forge?

Edited to add: if n_threads was being set manually, then I am even more baffled. Seems like I will have to do another check of the parallel code to make sure it's not calling an R API at any point before I submit a new version to CRAN.

@aldojongejan
Copy link
Author

I reinstalled using CRAN (should have said so in previous comment). I am also mystified, but I am not that well versed in programming/tracing/debugging to be able to find the source of what went wrong...
And I don't know how ' RcppParallel' in R works together with RcppParellel in conda.

@LTLA
Copy link
Contributor

LTLA commented Dec 3, 2019

I will note that in the following chunk of code:

out <- prcomp(as.matrix(iris[,-5]))

library(irlba)
out <- irlba(as.matrix(iris[,-5]), nu=1, nv=1)

library(Rtsne)
out <- Rtsne(as.matrix(iris[,-5]), check_duplicates=FALSE)

library(uwot)
iris_umap <- umap(as.matrix(iris[,-5]), pca = 50, n_threads=1)

# And a second time
iris_umap2 <- umap(iris[,-5], pca = 50, n_threads=1)

Only umap triggers Valgrind warnings. So it actually doesn't seem like a pure eval problem, there seems to be some interaction between something happening in uwot and eval.

The first message looks something like this:

> iris_umap <- umap(as.matrix(iris[,-5]), pca = 50, n_threads=1)
==20371== Invalid read of size 32
==20371==    at 0x7154C91: __wcsnlen_avx2 (strlen-avx2.S:62)
==20371==    by 0x7082EC1: wcsrtombs (wcsrtombs.c:104)
==20371==    by 0x7008B20: wcstombs (wcstombs.c:34)
==20371==    by 0x1BE142: wcstombs (stdlib.h:154)
==20371==    by 0x1BE142: do_makenames (character.c:938)
==20371==    by 0x238822: bcEval (eval.c:7041)
==20371==    by 0x24519F: Rf_eval (eval.c:688)
==20371==    by 0x246F4E: R_execClosure (eval.c:1852)
==20371==    by 0x247C44: Rf_applyClosure (eval.c:1778)
==20371==    by 0x23C1C4: bcEval (eval.c:7009)
==20371==    by 0x24519F: Rf_eval (eval.c:688)
==20371==    by 0x246F4E: R_execClosure (eval.c:1852)
==20371==    by 0x247C44: Rf_applyClosure (eval.c:1778)
==20371==  Address 0x1136db90 is 0 bytes inside a block of size 12 alloc'd
==20371==    at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20371==    by 0x27A750: R_chk_calloc (memory.c:3422)
==20371==    by 0x1BE0D0: do_makenames (character.c:931)
==20371==    by 0x238822: bcEval (eval.c:7041)
==20371==    by 0x24519F: Rf_eval (eval.c:688)
==20371==    by 0x246F4E: R_execClosure (eval.c:1852)
==20371==    by 0x247C44: Rf_applyClosure (eval.c:1778)
==20371==    by 0x23C1C4: bcEval (eval.c:7009)
==20371==    by 0x24519F: Rf_eval (eval.c:688)
==20371==    by 0x246F4E: R_execClosure (eval.c:1852)
==20371==    by 0x247C44: Rf_applyClosure (eval.c:1778)
==20371==    by 0x23C1C4: bcEval (eval.c:7009)

Definitely cryptic enough to be a parallelization issue. Rtsne also parallelizes but via OpenMP, which is generally more restrictive so it's harder to accidentally put in R API calls.

@jlmelville
Copy link
Owner

Oops. Possibly maybe someone who shall remain nameless (spoiler alert: it's me) is calling the R random number generator from inside a thread? Fixing this isn't conceptually difficult, but requires a fair bit of typing (because it's C++) so might not get finished until later today.

@jlmelville
Copy link
Owner

I forgot to tag the commit with this issue, but what's currently on master should hopefully behave. @LTLA, if you ever install from master, re-running valgrind would be an interesting exercise.

@LTLA
Copy link
Contributor

LTLA commented Dec 4, 2019

master doesn't get rid of the valgrind warnings, but I did manage to track them down to find_ab_params(), most likely the stats::nls() call therein. Running umap() with specified a and b arguments avoids the warnings. This may well be a false positive, it's hard to believe that a base function would be compromised like that; I call nls all over the place in my own functions.

@jlmelville
Copy link
Owner

FWIW, I directly ran valgrind, and RDCsan in the container provided by https://github.com/wch/r-debug (RDSan seems to not work well with building RcppEigen) and did not see anything flagged that wasn't already something that shows up in the CRAN checks for RcppAnnoy and RcppParallel.

A new version of uwot is now on CRAN with the two fixes unearthed in this issue.

@theboocock
Copy link

I am still having issues with this even with the new version on cran. I reinstalled everything and tried again. But same as before on the second run of the example I get a seg fault.

@LTLA
Copy link
Contributor

LTLA commented Dec 10, 2019

Operating system?

@theboocock
Copy link

It is a linux cluster, which unfortunately is running the 2.6 kernel. However, there doesn't seem to be any major issues with any other R package.

Linux n6426 2.6.32-754.14.2.el6.x86_64 #1 SMP Tue May 14 19:35:42 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

@LTLA
Copy link
Contributor

LTLA commented Dec 10, 2019

Do you have valgrind installed? 🤞

If so, could you try copying the OP's code into some file (e.g., test.R) and running:

R CMD BATCH --no-save -d valgrind test.R

and seeing what test.Rout gives? If you don't have valgrind installed, the top should just say that valgrind isn't available. If you do have it installed, it should have some blurb at the top with memcheck blah blah blah and then hopefully give some diagnostics before the crash.

Those diagnostics would be extremely helpful.

@jlmelville
Copy link
Owner

Also could you run

iris_umap <- umap(iris, pca = 50, verbose = TRUE, n_threads = 0)

twice in a row, as well as repeating twice with:

iris_umap <- umap(iris, pca = 50, verbose = TRUE, n_threads = 1)

and see if either makes a difference, providing the output for the second crashing run. Getting a clue to where the second crash occurs would be helpful (although I suspect the damage is already done at some point in the first run).

@jlmelville
Copy link
Owner

jlmelville commented Dec 10, 2019

Edit: there is an explicit check that the number of components does not exceed the number of columns in the input, so for iris, the pca = 50 argument should be able to be omitted without affecting the crash.
Also does omitting pca = 50 make a difference? For iris, that step should be skipped anyway (I think — I’m away from my computer at the moment) because the input data doesn’t have sufficient rank to extract 50 components. It would be good to get a minimal reproducible example.

@aldojongejan
Copy link
Author

Reinstalling RcppParallel from CRAN after you have installed uwot etc? It seemed to help for me...

@LTLA
Copy link
Contributor

LTLA commented Dec 10, 2019

If @theboocock has valgrind installed, please try running our suggested commands above before attempting reinstallation. This would be a rare opportunity to identify the problem on a known failing machine and to fix it once and for all - such chances are hard to come by.

@aldojongejan
Copy link
Author

@LTLA , you're completely right!! Apologies for suggesting reinstallation BEFORE checking....

@theboocock
Copy link

theboocock commented Dec 10, 2019

Hey all,

Reinstalling never does anything for me anyways. Here is the valgrind error. Seems like it is coming from libtbb.

Is this post relevant https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/641654?


> 
> library(irlba)
Loading required package: Matrix
> out <- irlba(as.matrix(iris[,-5]), nu=1, nv=1)
> 
> library(Rtsne)
> out <- Rtsne(as.matrix(iris[,-5]), check_duplicates=FALSE)
> 
> library(uwot)
> iris_umap <- umap(as.matrix(iris[,-5]), pca = 50, n_threads=1)
==264614== Invalid read of size 8
==264614==    at 0x1BB2A5E8: ??? (in /u/project/kruglyak/smilefre/anaconda3/lib/R/library/RcppParallel/lib/libtbb.so.2)
==264614==    by 0x1BDAB1FF: ???
==264614==    by 0x1BDC757F: ???
==264614==  Address 0xfffffffffffffff7 is not stack'd, malloc'd or (recently) free'd
==264614== 

 *** caught segfault ***
address 0xfffffffffffffff7, cause 'memory not mapped'

Traceback:
 1: RcppParallel::setThreadOptions(numThreads = n_threads)
 2: uwot(X = X, n_neighbors = n_neighbors, n_components = n_components,     metric = metric, n_epochs = n_epochs, alpha = learning_rate,     scale = scale, init = init, init_sdev = init_sdev, spread = spread,     min_dist = min_dist, set_op_mix_ratio = set_op_mix_ratio,     local_connectivity = local_connectivity, bandwidth = bandwidth,     gamma = repulsion_strength, negative_sample_rate = negative_sample_rate,     a = a, b = b, nn_method = nn_method, n_trees = n_trees, search_k = search_k,     method = "umap", approx_pow = approx_pow, n_threads = n_threads,     n_sgd_threads = n_sgd_threads, grain_size = grain_size, y = y,     target_n_neighbors = target_n_neighbors, target_weight = target_weight,     target_metric = target_metric, pca = pca, pca_center = pca_center,     pcg_rand = pcg_rand, fast_sgd = fast_sgd, ret_model = ret_model,     ret_nn = ret_nn, tmpdir = tempdir(), verbose = verbose)
 3: umap(as.matrix(iris[, -5]), pca = 50, n_threads = 1)
An irrecoverable exception occurred. R is aborting now ...
--264614-- VALGRIND INTERNAL ERROR: Valgrind received a signal 11 (SIGSEGV) - exiting
--264614-- si_code=1;  Faulting address: 0x20000038;  sp: 0x402efbf50

valgrind: the 'impossible' happened:
   Killed by fatal signal
==264614==    at 0x38047487: vgPlain_get_StackTrace_wrk (m_stacktrace.c:334)
==264614==    by 0x3804756B: vgPlain_get_StackTrace (m_stacktrace.c:1086)
==264614==    by 0x3802F82E: record_ExeContext_wrk (m_execontext.c:314)
==264614==    by 0x38002A84: die_and_free_mem (mc_malloc_wrappers.c:361)
==264614==    by 0x3807A59A: vgPlain_scheduler (scheduler.c:1665)
==264614==    by 0x3803B63E: final_tidyup (m_main.c:2656)
==264614==    by 0x3803B767: shutdown_actions_NORETURN (m_main.c:2457)
==264614==    by 0x380A656B: run_a_thread_NORETURN (syswrap-linux.c:199)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable```

@jlmelville
Copy link
Owner

@theboocock, thank you for running valgrind. Do you know if you are running any other packages that use RcppParallel? There are definitely some similar issues with memset and gcc6 but I'm loath to prematurely put the blame on TBB.

@LTLA
Copy link
Contributor

LTLA commented Dec 11, 2019

Well, it looks like it isn't even hitting uwot's C++ code, so it's hard to blame anything else... An even simpler test would be whether running RcppParallel::setNumThreads() triggers the error, i.e.,

# valgrind me:
library(RcppParallel)
setThreadOptions(numThreads = 1)

If so, that seems like a slam dunk, though the use of conda does complicate matters.

@jlmelville
Copy link
Owner

@LTLA, seeing as we get through one run without a crash, is it possible that uwot just stomps all over some memory that RcppParallel or tbb is using? Seems like it's hard to completely rule out uwot being the villain. I'll have a look at finding a container with gcc6 in it and see if it can be reproduced.

@LTLA
Copy link
Contributor

LTLA commented Dec 11, 2019

I was looking at @theboocock's valgrind output above, where umap fails the first time it runs. (The difference from a non-valgrind context is expected.) Either that, or I've had one too many G&T's.

It's also possible that irlba() or Rtsne() are doing something Bad... which would be even more concerning. The minimal example would be clarifying. So, either just:

# Put into test1.R with nothing else, and run under valgrind:
library(RcppParallel)
setThreadOptions(numThreads = 1)

Or, if the above doesn't trigger the error, then:

# Put into test2.R with nothing else, and run under valgrind:
library(uwot)
iris_umap <- umap(as.matrix(iris[,-5]), pca = 50, n_threads=1)

@jlmelville
Copy link
Owner

I wonder if benjjneb/dada2#684 is a related problem? There are some suspicious similarities.

@theboocock
Copy link

library(RcppParallel)
setThreadOptions(numThreads = 1)

Triggers the error for me. I am going to try the dada2 solution now.,

@theboocock
Copy link

@jlmelville Yes! That fixed it. The patch added to rcpp-parallel on conda works for me.

conda install -c conda-forge r-rcppparallel

@theboocock
Copy link

This seems like the key piece
in build.sh


if [[ $target_platform =~ linux.* ]]; then
  # The vendored TBB library adds compile-time flags based on a probe of gcc,
  # this little "hack" ensures that the `gcc` executable is available when
  # TBB is built.
  mkdir $PWD/hack
  export PATH="$PWD/hack:$PATH"
  ln -s $CC $PWD/hack/gcc
  chmod +x $PWD/hack/gcc
fi

@LTLA
Copy link
Contributor

LTLA commented Dec 12, 2019

So the takeaway is that if you're running R under conda, you should be installing RcppParallel via conda as well? Not the most intuitive outcome, but tolerable. Possibly another thing to throw into the README; maybe it's worth having an entire section on "Known problems" along with the .Rprofile issue.

@jlmelville
Copy link
Owner

Yes, I was hoping to work out if there is a lesson learned here. I don't want to mislead anyone. I don't have any experience using conda for R packages, just Python, and I have no knowledge of bioconda. Is it safe in general to mix CRAN and conda packages or is that always ill-advised?

@jlmelville
Copy link
Owner

@aldojongejan, you mentioned that you reinstalled RcppParallel from CRAN to fix the issue. Do you know if you had previously installed from conda? Or if you had a mix of conda-installed and CRAN-installed packages?

@aldojongejan
Copy link
Author

I worked with the developers version of R to get the latest version of Seurat and SingleR working. Guess that that also installed RcppParallel. Then, as a possible fix for my problem, I installed via Conda as suggested here (cole-trapnell-lab/monocle3#186). That didn't solve it for me, only when I later reinstalled RcppParallel from CRAN again (removing the RcppParallel directory from the 'library' folder etc.)... I should have paid more attention and documented the exact steps...

@jlmelville
Copy link
Owner

@aldojongejan, no problem, thank you for opening the issue here in the first place.

@aldojongejan
Copy link
Author

Ha ha, opening the issue was not a real problem ;-)
I am sorry, I couldn;t be of more help, and I really appreciate all the work you guys put into helping me out and solving the problem!

@jlmelville
Copy link
Owner

uwot 0.1.8 removes the dependency on RcppParallel so hopefully these problems are gone.

@aldojongejan
Copy link
Author

aldojongejan commented Mar 16, 2020 via email

@jlmelville
Copy link
Owner

Hopefully this is solved. Closing.

yuhanH pushed a commit to yuhanH/uwot that referenced this issue Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants