Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAN submission #2

Closed
ttriche opened this issue Jul 31, 2018 · 29 comments · Fixed by #3
Closed

CRAN submission #2

ttriche opened this issue Jul 31, 2018 · 29 comments · Fixed by #3

Comments

@ttriche
Copy link
Contributor

ttriche commented Jul 31, 2018

see imminent pull request for details

ttriche added a commit to ttriche/uwot that referenced this issue Jul 31, 2018
eliminates the ERROR that is thrown by R CMD check and, in principle, will allow the package to head to CRAN

in this respect, it fixes jlmelville#2
@LTLA
Copy link
Contributor

LTLA commented Dec 8, 2018

This package looks great. I can't believe it's not on CRAN yet. Is there a hold-up, and can I help?

@jlmelville
Copy link
Owner

Thanks for the kind words.

There's no hold up on CRAN submission, I just don't have any plans to do so at the moment. Here's the roadmap/list of objections:

  1. There's already a UMAP package on CRAN: https://cran.r-project.org/package=umap.
  2. My previous experience with submitting to CRAN suggests that I will have to do a bit of justifying the package's existence (especially due to point 1).
  3. Packages with C++ code in get extra attention (rightfully so). In good conscience, I don't want to submit without having read and acted upon the Writing R Extensions guide first, which -- spoiler alert -- I have not done.
  4. Relatedly, the largeVis package was removed from CRAN for memory violation issues. I don't know what the problem was exactly, but as UMAP uses a very similar SGD strategy, it would benefit from some extra pairs of eyes from those with experience in Rcpp that uwot isn't doing something egregious, especially with how I pass around references to matrices. I may have just got lucky so far in that I have enough RAM for the datasets I look at and/or no garbage collection has kicked in at an inopportune moment.

Basically, I've been very lazy, but I also want to be respectful of the CRAN maintainers' time if the submission runs into issues. However, this sort of reminder is a useful kick in the pants. If having to install via devtools is inconveniencing people, I will think more about CRAN submission. I'll leave this issue open.

@jlmelville jlmelville reopened this Dec 8, 2018
@LTLA
Copy link
Contributor

LTLA commented Dec 8, 2018

1. and 2. I'm glad you brought this up. I don't think you'll have a problem there:

vals <- readRDS("pbmc68k.rds"# first 50 PCs of the 10X PBMC 68K data set
test <- vals[1:10000,]

library(umap)
system.time(X <- umap::umap(test, n_neighbors=15))
##    user  system elapsed 
##  74.370   3.949  79.067 

library(uwot)
system.time(Y <- uwot::umap(test, n_neighbors=15))
##    user  system elapsed 
##  22.079   0.560  12.822 

umap::umap does most of its heavy lifting in R, which explains the difference in speed.

3. and 4. I can probably help with that. I looked at your source code a while ago, it seemed alright. Does it go through the various platforms on rhub okay?

My motivation is to depend on uwot in my own packages, which requires uwot to be in CRAN (or Bioconductor); I can't pull it in via devtools.

@jlmelville
Copy link
Owner

I didn't know about rhub, thanks for the pointer! I have a lot of platform checking in my future, it seems.

I'm sure I've committed several other crimes against good coding hygiene, but the two major bits of dirty laundry I need to air (and ask for opinions on) are:

  1. In uwot::uwot (in uwot.R), the optimization C++ code is entered with a call a bit like:
result <- optimize_layout_umap(
      head_embedding = embedding,
      tail_embedding = embedding,
      positive_head = positive_head,
      positive_tail = positive_tail,
      epochs_per_sample = epochs_per_sample, ...)

head_embedding and tail_embedding are passed by reference. So when optimize_layout_umap returns, embedding is modified, which is completely against the pass-by-value semantics of R function calls. I've done this because I want to avoid storing a copy of embedding, especially as we don't need the initial coordinates any more: I assign the result of optimize_layout_umap to embedding anyway.

I've got away with it so far (in terms of it generating a segfault), but it seems like a complete no-no to me. Do I need to bite the bullet and ensure a copy is passed?

  1. The other thing I do that seems dodgy is that positive_head, positive_tail and epochs_per_sample end up on the C++ side (in optimize_layout_umap in optimize.cpp) as arma::uvec and arma::vec, passed by copy. These are then passed by reference into the SGDWorker where they are read from (but not written to) by multiple threads, without being wrapped in the thread-safe RcppParallel classes.

As far as I recall from some initial experiments between how RcppArmadillo vectors and Rcpp NumericVectors worked, this seemed like it was safe, in that it seemed like the data was genuinely being copied and hence moved out of R's clutches and so couldn't get garbage collected or moved under me. But I could be wrong. The alternatives I tried were real performance killers, unfortunately.

It feels good to have confessed to these sins.

@LTLA
Copy link
Contributor

LTLA commented Dec 9, 2018

Yeah, 1 is definitely bad. You should allocate new memory to store the result; but that should be cheap, I wouldn't worry about it given that the optimize_layout_umap function is only called once anyway.

2 is harder for me to tell, I usually use OpenMP to do my parallelization. My guess is that you should be fine for read-only, you should only need thread safety for writes. It shouldn't even compile if the objects were being modified... in theory, at least, who knows what's going on in there.

More generally, I wonder if you really need Armadillo for what you're doing, it seems that raw vectors might be sufficient for access. (I don't see a whole lot of matrix operations, though I'm not really familiar with Armadillo so I might have missed them.)

I also participated in the parallelization of Rtsne, from which we learnt some interesting lessons about the order of elements during reductions (see jkrijthe/Rtsne#16). I don't see any parallelReduce so you should be okay... but I wonder whether or not it would be simpler to use OpenMP instead.

Both of these last two points are mostly a matter of taste, so ignore them as you please; that's just where my experience tends to be concentrated.

@jlmelville
Copy link
Owner

Yeah, 1 is definitely bad. You should allocate new memory to store the result; but that should be cheap, I wouldn't worry about it given that the optimize_layout_umap function is only called once anyway.

1 is now fixed. It wasn't the speed of allocation that bothered me, but making extra copies of potentially large matrices. This is at best unnecessary, and at worst for standard UMAP, if the code ends up working on two separate copies of the input coordinates, the results come out wrong. But despite the best efforts of the compiler to conspire against me, I have emerged victorious. Probably.

More generally, I wonder if you really need Armadillo for what you're doing, it seems that raw vectors might be sufficient for access. (I don't see a whole lot of matrix operations, though I'm not really familiar with Armadillo so I might have missed them.)

I don't use Armadillo for linear algebra, just for its convenient row, column subsetting and vectorized function features and so on. I assume you are suggesting that I can just do what Rtsne does and work with *double and for loops directly? This certainly has an appeal to me (and makes me nostalgic for my Fortran 77 days), but I find doing the multidimensional pointer arithmetic error prone, hard to debug and hard to understand, so it might take me a while to muster the enthusiasm.

I also participated in the parallelization of Rtsne, from which we learnt some interesting lessons about the order of elements during reductions (see [jkrijthe/Rtsne#16]

That is an interesting read. Looks like the reduce occurs during the normalization to form the Q matrix? Fortunately UMAP is un-normalized so that never has to happen.

@jlmelville
Copy link
Owner

3 on my TODO list: for the spectral initialization, for what I assume are very poorly conditioned input matrices, RSpectra sometimes seems to hang (or at least take an unacceptably long time, I've never stuck around long enough to find out how long that is). Need to see if there's a way to get it to bail out earlier or if there's a better way to find the small eigenvalues needed for the normalized Laplacian.

@ttriche
Copy link
Contributor Author

ttriche commented Dec 9, 2018 via email

@LTLA
Copy link
Contributor

LTLA commented Dec 12, 2018

Just to motivate this discussion: https://github.com/davismcc/scater/tree/uwot. So once uwot gets on CRAN, it'll have at least one reverse dependency straight away.

@ttriche
Copy link
Contributor Author

ttriche commented Dec 13, 2018 via email

@LTLA
Copy link
Contributor

LTLA commented Dec 13, 2018

Though I guess CRAN doesn't track Bioc revdeps, so you wouldn't see them on the landing page. But @jlmelville can use that as an argument for getting CRAN to accept your package.

@jlmelville
Copy link
Owner

This is probably blocked until I at least understand what's causing the Mac issues with #1.

@jlmelville
Copy link
Owner

jlmelville commented Jan 16, 2019

Other things to fix:

- [ ] Avoid writing Annoy index to disk during multi-threaded NN search.
R policy allows writing to the R session temp directory, so this is ok: https://github.com/eddelbuettel/crp/blob/4dec483eadf5b642726a54099322a6e161392e49/texi/CRAN_policies.texi#L255

  • Can't use C++11 RNG.

@LTLA
Copy link
Contributor

LTLA commented Jan 21, 2019

I was going to say that if you can't use <random>, you could use boost/random.hpp or something from sitmo or dqrng, which would save you from having to roll your own PRNG in with the package. But I guess it doesn't really matter if you're happy with the current PRNG.

I should also add that:

some_rng() % tail_nvert

... is technically not correct (in terms of producing a uniform range in [0, tail_nvert)) unless the largest unsigned integer that some_rng() can produce is N*tail_nvert - 1 for some integer N. A more correct approach would be to use std::uniform_int_distribution (or its boost equivalent). This may or may not matter depending on the range of some_rng() and the size of tail_nvert, but why take the chance?

@jlmelville
Copy link
Owner

For now, I have tried to stick to copying how the Python UMAP implementation works. There is a slight complication that the Python % operator works differently to the C++ version, but it doesn't seem to make a big difference in practice.

I may take a look at pulling in Boost (although does anyone know why Boost RNG is OK to use if the C++11 RNGs aren't?) to generate integers from a uniform distribution the correct way at some point and see what effect it has on the speed of UMAP's optimization. It's not very high on my priority list at the moment, unless something demonstrably horrible is occurring with the current way of doing things.

@khughitt
Copy link
Contributor

khughitt commented Feb 5, 2019

Package may also be used in dimRed once it's available on CRAN.

@LTLA
Copy link
Contributor

LTLA commented Feb 5, 2019

I was just thinking about this - we should probably get the ball rolling again.

@jlmelville, regarding Boost's random headers: the major difference (for me) is that the C++ <random> does not use the same algorithm in its distribution functions across platforms. The standard only mandates the output of the PRNGs themselves, not how the random number stream is converted into variates that we actually use. For example, std::normal_distirbution has to produce normally distributed values, but it doesn't have to produce the same series of normally distributed values between clang and GCC.

This is a real problem, because I ran into it! I don't understand why the C++ standard was written like that, it basically makes <random> unusable for scientific computing where a result must be reproducible across systems. That's why I only use boost/random.hpp rather than <random>.

If we were to follow the spirit of the law, I would guess that we should not use Boost's random headers either. But there are some real limitations of R's C API for random number generation (see, for example, my comments in daqana/dqrng#11), and the best solution in such cases would be Boost.

@ttriche
Copy link
Contributor Author

ttriche commented Feb 6, 2019 via email

@jlmelville
Copy link
Owner

Thanks for explaining the difference between C++11 and Boost's random headers, @LTLA, that makes sense. I am open to considering either Boost or https://github.com/daqana/dqrng.

I have not abandoned preparing this for CRAN, I have just got distracted with other things. Mainly, I am still a bit dissatisfied with the speed of the nearest neighbor calculations. I have created https://github.com/jlmelville/rnndescent to implement the nearest neighbor descent method used by the Python UMAP (admittedly it also does a different initialization) and by LargeVis, but that is still a work in progress.

Also, #16 (spectral initialization basically taking forever) is still not satisfactorily resolved to my liking. Will I have to implement a LOBPCG routine for R? I haven't checked how painful that will be.

Finally, #19 required adding custom save/load functions. This feels like the sort of thing that shouldn't be necessary, but I don't know whether you can write custom serialization hooks for C++ objects that are wrapping external data. This might just be the way it is, and it's a non-issue.

@rstub
Copy link

rstub commented Feb 7, 2019

Note that PCG offers a nice convenience function as fast replacement of uniform_int_distribution:
http://www.pcg-random.org/using-pcg-cpp.html#result-type-pcg32-operator-result-type-upper-bound

In principle I also plan to add something along these lines to dqrng itself (c.f.
http://www.pcg-random.org/posts/bounded-rands.html and https://github.com/daqana/dqrng/tree/feature/sample), but I haven't had the time to finish that.

Concerning serialization you might look into https://cran.r-project.org/package=Rcereal and this example using an XPtr: https://stackoverflow.com/a/53157233/8416610.

@jlmelville
Copy link
Owner

Thanks to the pointer by @rstub, I have adopted PCG (linking via dqrng) as the default PRNG. Although it faster than Boost, it is noticeably slower than what we had before (optimization is 10-15% slower with default settings on MNIST digits), so I have added a parameter to get back the old behavior (pcg_rand = FALSE).

I have also added fast_sgd, which if set to TRUE will bring back the combination of approximations and thread settings that gives rise to a substantially faster optimization, which works perfectly well for visualizations.

Next: my experiments with nearest neighbor descent have been disappointing so far: you get a better improvement in accuracy by using more accurate Annoy settings, than spending the same time refining nearest neighbor descent. Probably I am doing something wrong. Probably I should remove this code and save it for a triumphant return at a later date.

Finally: I found and fixed a bug in the spectral initialization as I was generating the example images. Having now churned through these datasets successfully, at this point, probably #16 is all that prevents me finally getting round to a CRAN submission. But I haven't tested cosine and hamming distances as thoroughly. Example datasets (especially if they can be used in the examples page and/or will uncover horrible bugs) would be welcome.

@LTLA
Copy link
Contributor

LTLA commented Mar 28, 2019

Cool - I'll check out the compatibility with the uwot branch at https://github.com/davismcc/scater.

@jlmelville
Copy link
Owner

CRAN submission was just accepted: https://cran.r-project.org/package=uwot

Thanks to all in this thread for their contributions to getting this done.

@LTLA
Copy link
Contributor

LTLA commented Apr 7, 2019

You'll get a revdep as soon as scater re-builds on the BioC machines (probably in a day or two).

@jlmelville
Copy link
Owner

Does anyone else think the 'hooray' emoji looks like a taco shell with some lettuce sticking out the end?

Anyway, don't get too carried away with those celebratory salad tacos, because of course I managed to crash the R session mere moments after I received the email from CRAN. Fundamentally, the issue is because of spotify/annoy#378 (can't read Annoy indexes > 2GB), but the real blame lies with me for failing to check that Annoy actually returns k neighbors when you ask for them.

There will therefore be a patch release (hopefully 0.1.3) coming ASAP, assuming the maintainers don't ban me or throw uwot off CRAN. It also attempts to fix an ASAN/UBSAN/valgrind issue, although I think most of it seems to be due to internal code used by RcppParallel.

@LTLA
Copy link
Contributor

LTLA commented Apr 7, 2019

Well, if it's any consolation, I also have an inexplicable failure on one of the UMAP-related tests in scater on the Windows 32 build machines. Probably something to do with differences in numerical precision between FNN and BiocNeighbors during the nearest neighbor calculations ... whatever, I'm skipping it.

@jlmelville
Copy link
Owner

Hmm, this may not be related, but the appconveyor Windows CI builds have started sporadically failing on a umap_transform test, but I can't reproduce it on my machine or rhub and it works again whenever I restart the build, so I don't know what's going on there.

@jlmelville
Copy link
Owner

patch release 0.1.3 has hit CRAN, so the immediate panic is over.

The valgrind and related checks haven't been updated yet. If they aren't clean, I have been told they will need fixing "quickly", so I have that to look forward to also.

@jlmelville
Copy link
Owner

The CRAN checks have all updated to 0.1.3. There is an UBSAN issue, but it seems to be due to RcppParallel. The same issue occurs in the RcppParallel checks, so I don't think I can do anything about it.

I consider uwot successfully submitted to CRAN at last. Closing.

yuhanH pushed a commit to yuhanH/uwot that referenced this issue Jul 20, 2020
eliminates the ERROR that is thrown by R CMD check and, in principle, will allow the package to head to CRAN

in this respect, it fixes jlmelville#2
yuhanH pushed a commit to yuhanH/uwot that referenced this issue Jul 20, 2020
yuhanH pushed a commit to yuhanH/uwot that referenced this issue Jul 20, 2020
yuhanH pushed a commit to yuhanH/uwot that referenced this issue Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants