CRAN submission #2

ttriche · 2018-07-31T17:48:11Z

see imminent pull request for details

eliminates the ERROR that is thrown by R CMD check and, in principle, will allow the package to head to CRAN in this respect, it fixes jlmelville#2

LTLA · 2018-12-08T19:35:46Z

This package looks great. I can't believe it's not on CRAN yet. Is there a hold-up, and can I help?

jlmelville · 2018-12-08T21:14:12Z

Thanks for the kind words.

There's no hold up on CRAN submission, I just don't have any plans to do so at the moment. Here's the roadmap/list of objections:

There's already a UMAP package on CRAN: https://cran.r-project.org/package=umap.
My previous experience with submitting to CRAN suggests that I will have to do a bit of justifying the package's existence (especially due to point 1).
Packages with C++ code in get extra attention (rightfully so). In good conscience, I don't want to submit without having read and acted upon the Writing R Extensions guide first, which -- spoiler alert -- I have not done.
Relatedly, the largeVis package was removed from CRAN for memory violation issues. I don't know what the problem was exactly, but as UMAP uses a very similar SGD strategy, it would benefit from some extra pairs of eyes from those with experience in Rcpp that uwot isn't doing something egregious, especially with how I pass around references to matrices. I may have just got lucky so far in that I have enough RAM for the datasets I look at and/or no garbage collection has kicked in at an inopportune moment.

Basically, I've been very lazy, but I also want to be respectful of the CRAN maintainers' time if the submission runs into issues. However, this sort of reminder is a useful kick in the pants. If having to install via devtools is inconveniencing people, I will think more about CRAN submission. I'll leave this issue open.

LTLA · 2018-12-08T21:57:47Z

1. and 2. I'm glad you brought this up. I don't think you'll have a problem there:

vals <- readRDS("pbmc68k.rds") # first 50 PCs of the 10X PBMC 68K data set
test <- vals[1:10000,]

library(umap)
system.time(X <- umap::umap(test, n_neighbors=15))
##    user  system elapsed 
##  74.370   3.949  79.067 

library(uwot)
system.time(Y <- uwot::umap(test, n_neighbors=15))
##    user  system elapsed 
##  22.079   0.560  12.822

umap::umap does most of its heavy lifting in R, which explains the difference in speed.

3. and 4. I can probably help with that. I looked at your source code a while ago, it seemed alright. Does it go through the various platforms on rhub okay?

My motivation is to depend on uwot in my own packages, which requires uwot to be in CRAN (or Bioconductor); I can't pull it in via devtools.

jlmelville · 2018-12-08T23:29:30Z

I didn't know about rhub, thanks for the pointer! I have a lot of platform checking in my future, it seems.

I'm sure I've committed several other crimes against good coding hygiene, but the two major bits of dirty laundry I need to air (and ask for opinions on) are:

In uwot::uwot (in uwot.R), the optimization C++ code is entered with a call a bit like:

result <- optimize_layout_umap(
      head_embedding = embedding,
      tail_embedding = embedding,
      positive_head = positive_head,
      positive_tail = positive_tail,
      epochs_per_sample = epochs_per_sample, ...)

head_embedding and tail_embedding are passed by reference. So when optimize_layout_umap returns, embedding is modified, which is completely against the pass-by-value semantics of R function calls. I've done this because I want to avoid storing a copy of embedding, especially as we don't need the initial coordinates any more: I assign the result of optimize_layout_umap to embedding anyway.

I've got away with it so far (in terms of it generating a segfault), but it seems like a complete no-no to me. Do I need to bite the bullet and ensure a copy is passed?

The other thing I do that seems dodgy is that positive_head, positive_tail and epochs_per_sample end up on the C++ side (in optimize_layout_umap in optimize.cpp) as arma::uvec and arma::vec, passed by copy. These are then passed by reference into the SGDWorker where they are read from (but not written to) by multiple threads, without being wrapped in the thread-safe RcppParallel classes.

As far as I recall from some initial experiments between how RcppArmadillo vectors and Rcpp NumericVectors worked, this seemed like it was safe, in that it seemed like the data was genuinely being copied and hence moved out of R's clutches and so couldn't get garbage collected or moved under me. But I could be wrong. The alternatives I tried were real performance killers, unfortunately.

It feels good to have confessed to these sins.

LTLA · 2018-12-09T00:15:57Z

Yeah, 1 is definitely bad. You should allocate new memory to store the result; but that should be cheap, I wouldn't worry about it given that the optimize_layout_umap function is only called once anyway.

2 is harder for me to tell, I usually use OpenMP to do my parallelization. My guess is that you should be fine for read-only, you should only need thread safety for writes. It shouldn't even compile if the objects were being modified... in theory, at least, who knows what's going on in there.

More generally, I wonder if you really need Armadillo for what you're doing, it seems that raw vectors might be sufficient for access. (I don't see a whole lot of matrix operations, though I'm not really familiar with Armadillo so I might have missed them.)

I also participated in the parallelization of Rtsne, from which we learnt some interesting lessons about the order of elements during reductions (see jkrijthe/Rtsne#16). I don't see any parallelReduce so you should be okay... but I wonder whether or not it would be simpler to use OpenMP instead.

Both of these last two points are mostly a matter of taste, so ignore them as you please; that's just where my experience tends to be concentrated.

jlmelville · 2018-12-09T20:03:11Z

Yeah, 1 is definitely bad. You should allocate new memory to store the result; but that should be cheap, I wouldn't worry about it given that the optimize_layout_umap function is only called once anyway.

1 is now fixed. It wasn't the speed of allocation that bothered me, but making extra copies of potentially large matrices. This is at best unnecessary, and at worst for standard UMAP, if the code ends up working on two separate copies of the input coordinates, the results come out wrong. But despite the best efforts of the compiler to conspire against me, I have emerged victorious. Probably.

More generally, I wonder if you really need Armadillo for what you're doing, it seems that raw vectors might be sufficient for access. (I don't see a whole lot of matrix operations, though I'm not really familiar with Armadillo so I might have missed them.)

I don't use Armadillo for linear algebra, just for its convenient row, column subsetting and vectorized function features and so on. I assume you are suggesting that I can just do what Rtsne does and work with *double and for loops directly? This certainly has an appeal to me (and makes me nostalgic for my Fortran 77 days), but I find doing the multidimensional pointer arithmetic error prone, hard to debug and hard to understand, so it might take me a while to muster the enthusiasm.

I also participated in the parallelization of Rtsne, from which we learnt some interesting lessons about the order of elements during reductions (see [jkrijthe/Rtsne#16]

That is an interesting read. Looks like the reduce occurs during the normalization to form the Q matrix? Fortunately UMAP is un-normalized so that never has to happen.

jlmelville · 2018-12-09T20:08:04Z

3 on my TODO list: for the spectral initialization, for what I assume are very poorly conditioned input matrices, RSpectra sometimes seems to hang (or at least take an unacceptably long time, I've never stuck around long enough to find out how long that is). Need to see if there's a way to get it to bail out earlier or if there's a better way to find the small eigenvalues needed for the normalized Laplacian.

ttriche · 2018-12-09T21:41:39Z

+1 Anything we can do to help (ie me and @biobenkj) will be enlightened self interest. Uwot does things easily that either take a long time or are nearly impossible in the R UMAP package. LMK

…

--t

On Dec 8, 2018, at 4:57 PM, Aaron Lun ***@***.***> wrote: 1. and 2. I'm glad you brought this up. I don't think you'll have a problem there: vals <- readRDS("pbmc68k.rds") # first 50 PCs of the 10X PBMC 68K data set test <- vals[1:10000,] library(umap) system.time(X <- umap::umap(test, n_neighbors=15)) ## user system elapsed ## 74.370 3.949 79.067 library(uwot) system.time(Y <- uwot::umap(test, n_neighbors=15)) ## user system elapsed ## 22.079 0.560 12.822 umap::umap does most of its heavy lifting in R, which explains the difference in speed. 3. and 4. I can probably help with that. I looked at your source code a while ago, it seemed alright. Does it go through the various platforms on rhub okay? My motivation is to depend on uwot in my own packages, which requires uwot to be in CRAN (or Bioconductor); I can't pull it in via devtools. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

LTLA · 2018-12-12T18:11:42Z

Just to motivate this discussion: https://github.com/davismcc/scater/tree/uwot. So once uwot gets on CRAN, it'll have at least one reverse dependency straight away.

ttriche · 2018-12-13T00:25:24Z

Two, actually — we would have enabled UMAP support in compartmap if uwot had been on CRAN by the submission deadline, and will do so in devel if it does any time before March.

…

--t

On Dec 12, 2018, at 1:11 PM, Aaron Lun ***@***.***> wrote: Just to motivate this discussion: https://github.com/davismcc/scater/tree/uwot. So once uwot gets on CRAN, it'll have at least one reverse dependency straight away. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

LTLA · 2018-12-13T00:31:29Z

Though I guess CRAN doesn't track Bioc revdeps, so you wouldn't see them on the landing page. But @jlmelville can use that as an argument for getting CRAN to accept your package.

jlmelville · 2018-12-13T17:27:02Z

This is probably blocked until I at least understand what's causing the Mac issues with #1.

jlmelville · 2019-01-16T04:27:47Z

Other things to fix:

~~- [ ] Avoid writing Annoy index to disk during multi-threaded NN search.~~
R policy allows writing to the R session temp directory, so this is ok: https://github.com/eddelbuettel/crp/blob/4dec483eadf5b642726a54099322a6e161392e49/texi/CRAN_policies.texi#L255

Can't use C++11 RNG.

…> 1.

LTLA · 2019-01-21T15:04:26Z

I was going to say that if you can't use <random>, you could use boost/random.hpp or something from sitmo or dqrng, which would save you from having to roll your own PRNG in with the package. But I guess it doesn't really matter if you're happy with the current PRNG.

I should also add that:

some_rng() % tail_nvert

... is technically not correct (in terms of producing a uniform range in [0, tail_nvert)) unless the largest unsigned integer that some_rng() can produce is N*tail_nvert - 1 for some integer N. A more correct approach would be to use std::uniform_int_distribution (or its boost equivalent). This may or may not matter depending on the range of some_rng() and the size of tail_nvert, but why take the chance?

jlmelville · 2019-01-21T17:02:16Z

For now, I have tried to stick to copying how the Python UMAP implementation works. There is a slight complication that the Python % operator works differently to the C++ version, but it doesn't seem to make a big difference in practice.

I may take a look at pulling in Boost (although does anyone know why Boost RNG is OK to use if the C++11 RNGs aren't?) to generate integers from a uniform distribution the correct way at some point and see what effect it has on the speed of UMAP's optimization. It's not very high on my priority list at the moment, unless something demonstrably horrible is occurring with the current way of doing things.

khughitt · 2019-02-05T21:22:41Z

Package may also be used in dimRed once it's available on CRAN.

LTLA · 2019-02-05T21:59:36Z

I was just thinking about this - we should probably get the ball rolling again.

@jlmelville, regarding Boost's random headers: the major difference (for me) is that the C++ <random> does not use the same algorithm in its distribution functions across platforms. The standard only mandates the output of the PRNGs themselves, not how the random number stream is converted into variates that we actually use. For example, std::normal_distirbution has to produce normally distributed values, but it doesn't have to produce the same series of normally distributed values between clang and GCC.

This is a real problem, because I ran into it! I don't understand why the C++ standard was written like that, it basically makes <random> unusable for scientific computing where a result must be reproducible across systems. That's why I only use boost/random.hpp rather than <random>.

If we were to follow the spirit of the law, I would guess that we should not use Boost's random headers either. But there are some real limitations of R's C API for random number generation (see, for example, my comments in daqana/dqrng#11), and the best solution in such cases would be Boost.

ttriche · 2019-02-06T00:30:32Z

Is it not possible to use package ‘BH’ on CRAN to circumvent this?

…

--t

On Feb 5, 2019, at 4:59 PM, Aaron Lun ***@***.***> wrote: I was just thinking about this - we should probably get the ball rolling again. @jlmelville, regarding Boost's random headers: the major difference (for me) is that the C++ <random> does not use the same algorithm in its distribution functions across platforms. The standard only mandates the output of the PRNGs themselves, not how the random number stream is converted into variates that we actually use. For example, std::normal_distirbution has to produce normally distributed values, but it doesn't have to produce the same series of normally distributed values between clang and GCC. This is a real problem, because I ran into it! I don't understand why the C++ standard was written like that, it basically makes <random> unusable for scientific computing where a result must be reproducible across systems. That's why I only use boost/random.hpp rather than <random>. If we were to follow the spirit of the law, I would guess that we should not use Boost's random headers either. But there are some real limitations of R's C API for random number generation (see, for example, my comments in daqana/dqrng#11), and the best solution in such cases would be Boost. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

jlmelville · 2019-02-06T05:21:47Z

Thanks for explaining the difference between C++11 and Boost's random headers, @LTLA, that makes sense. I am open to considering either Boost or https://github.com/daqana/dqrng.

I have not abandoned preparing this for CRAN, I have just got distracted with other things. Mainly, I am still a bit dissatisfied with the speed of the nearest neighbor calculations. I have created https://github.com/jlmelville/rnndescent to implement the nearest neighbor descent method used by the Python UMAP (admittedly it also does a different initialization) and by LargeVis, but that is still a work in progress.

Also, #16 (spectral initialization basically taking forever) is still not satisfactorily resolved to my liking. Will I have to implement a LOBPCG routine for R? I haven't checked how painful that will be.

Finally, #19 required adding custom save/load functions. This feels like the sort of thing that shouldn't be necessary, but I don't know whether you can write custom serialization hooks for C++ objects that are wrapping external data. This might just be the way it is, and it's a non-issue.

rstub · 2019-02-07T16:57:02Z

Note that PCG offers a nice convenience function as fast replacement of uniform_int_distribution:
http://www.pcg-random.org/using-pcg-cpp.html#result-type-pcg32-operator-result-type-upper-bound

In principle I also plan to add something along these lines to dqrng itself (c.f.
http://www.pcg-random.org/posts/bounded-rands.html and https://github.com/daqana/dqrng/tree/feature/sample), but I haven't had the time to finish that.

Concerning serialization you might look into https://cran.r-project.org/package=Rcereal and this example using an XPtr: https://stackoverflow.com/a/53157233/8416610.

jlmelville · 2019-03-28T06:43:56Z

Thanks to the pointer by @rstub, I have adopted PCG (linking via dqrng) as the default PRNG. Although it faster than Boost, it is noticeably slower than what we had before (optimization is 10-15% slower with default settings on MNIST digits), so I have added a parameter to get back the old behavior (pcg_rand = FALSE).

I have also added fast_sgd, which if set to TRUE will bring back the combination of approximations and thread settings that gives rise to a substantially faster optimization, which works perfectly well for visualizations.

Next: my experiments with nearest neighbor descent have been disappointing so far: you get a better improvement in accuracy by using more accurate Annoy settings, than spending the same time refining nearest neighbor descent. Probably I am doing something wrong. Probably I should remove this code and save it for a triumphant return at a later date.

Finally: I found and fixed a bug in the spectral initialization as I was generating the example images. Having now churned through these datasets successfully, at this point, probably #16 is all that prevents me finally getting round to a CRAN submission. But I haven't tested cosine and hamming distances as thoroughly. Example datasets (especially if they can be used in the examples page and/or will uncover horrible bugs) would be welcome.

LTLA · 2019-03-28T06:54:10Z

Cool - I'll check out the compatibility with the uwot branch at https://github.com/davismcc/scater.

jlmelville · 2019-04-06T15:46:05Z

CRAN submission was just accepted: https://cran.r-project.org/package=uwot

Thanks to all in this thread for their contributions to getting this done.

LTLA · 2019-04-07T05:44:58Z

You'll get a revdep as soon as scater re-builds on the BioC machines (probably in a day or two).

jlmelville · 2019-04-07T08:13:09Z

Does anyone else think the 'hooray' emoji looks like a taco shell with some lettuce sticking out the end?

Anyway, don't get too carried away with those celebratory salad tacos, because of course I managed to crash the R session mere moments after I received the email from CRAN. Fundamentally, the issue is because of spotify/annoy#378 (can't read Annoy indexes > 2GB), but the real blame lies with me for failing to check that Annoy actually returns k neighbors when you ask for them.

There will therefore be a patch release (hopefully 0.1.3) coming ASAP, assuming the maintainers don't ban me or throw uwot off CRAN. It also attempts to fix an ASAN/UBSAN/valgrind issue, although I think most of it seems to be due to internal code used by RcppParallel.

LTLA · 2019-04-07T18:47:39Z

Well, if it's any consolation, I also have an inexplicable failure on one of the UMAP-related tests in scater on the Windows 32 build machines. Probably something to do with differences in numerical precision between FNN and BiocNeighbors during the nearest neighbor calculations ... whatever, I'm skipping it.

jlmelville · 2019-04-07T19:21:29Z

Hmm, this may not be related, but the appconveyor Windows CI builds have started sporadically failing on a umap_transform test, but I can't reproduce it on my machine or rhub and it works again whenever I restart the build, so I don't know what's going on there.

jlmelville · 2019-04-08T03:22:18Z

patch release 0.1.3 has hit CRAN, so the immediate panic is over.

The valgrind and related checks haven't been updated yet. If they aren't clean, I have been told they will need fixing "quickly", so I have that to look forward to also.

jlmelville · 2019-04-13T02:19:37Z

The CRAN checks have all updated to 0.1.3. There is an UBSAN issue, but it seems to be due to RcppParallel. The same issue occurs in the RcppParallel checks, so I don't think I can do anything about it.

I consider uwot successfully submitted to CRAN at last. Closing.

eliminates the ERROR that is thrown by R CMD check and, in principle, will allow the package to head to CRAN in this respect, it fixes jlmelville#2

…n_threads > 1.

ttriche added a commit to ttriche/uwot that referenced this issue Jul 31, 2018

tidy up for CRAN submission

b1b111d

eliminates the ERROR that is thrown by R CMD check and, in principle, will allow the package to head to CRAN in this respect, it fixes jlmelville#2

ttriche mentioned this issue Jul 31, 2018

tidy up for CRAN submission #3

Merged

jlmelville closed this as completed in #3 Aug 1, 2018

jlmelville reopened this Dec 8, 2018

jlmelville added a commit that referenced this issue Dec 9, 2018

#2: Stop C++ code mutating arguments passed from R.

f7e71fb

jlmelville added a commit that referenced this issue Dec 9, 2018

Issue #2: Simplify lvish C++ optimization setup.

3c20718

LTLA mentioned this issue Dec 19, 2018

Feature request: UMAP alanocallaghan/scater#76

Closed

jlmelville added a commit that referenced this issue Jan 19, 2019

#2: Remove dependency on C++11 random number library.

bf00d0f

jlmelville added a commit that referenced this issue Jan 19, 2019

#2: Note that annoy index will be written to tempfile when n_threads …

7caabef

…> 1.

LTLA mentioned this issue Apr 1, 2019

Uwot alanocallaghan/scater#84

Merged

jlmelville closed this as completed Apr 13, 2019

yuhanH pushed a commit to yuhanH/uwot that referenced this issue Jul 20, 2020

tidy up for CRAN submission

9488fad

eliminates the ERROR that is thrown by R CMD check and, in principle, will allow the package to head to CRAN in this respect, it fixes jlmelville#2

yuhanH pushed a commit to yuhanH/uwot that referenced this issue Jul 20, 2020

jlmelville#2: Stop C++ code mutating arguments passed from R.

20b4359

yuhanH pushed a commit to yuhanH/uwot that referenced this issue Jul 20, 2020

Issue jlmelville#2: Simplify lvish C++ optimization setup.

4b321a9

yuhanH pushed a commit to yuhanH/uwot that referenced this issue Jul 20, 2020

jlmelville#2: Remove dependency on C++11 random number library.

8421bc5

yuhanH pushed a commit to yuhanH/uwot that referenced this issue Jul 20, 2020

jlmelville#2: Note that annoy index will be written to tempfile when …

7d3e43c

…n_threads > 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRAN submission #2

CRAN submission #2

ttriche commented Jul 31, 2018

LTLA commented Dec 8, 2018

jlmelville commented Dec 8, 2018

LTLA commented Dec 8, 2018

jlmelville commented Dec 8, 2018

LTLA commented Dec 9, 2018 •

edited

Loading

jlmelville commented Dec 9, 2018

jlmelville commented Dec 9, 2018

ttriche commented Dec 9, 2018 via email

LTLA commented Dec 12, 2018

ttriche commented Dec 13, 2018 via email

LTLA commented Dec 13, 2018

jlmelville commented Dec 13, 2018

jlmelville commented Jan 16, 2019 •

edited

Loading

LTLA commented Jan 21, 2019 •

edited

Loading

jlmelville commented Jan 21, 2019

khughitt commented Feb 5, 2019

LTLA commented Feb 5, 2019

ttriche commented Feb 6, 2019 via email

jlmelville commented Feb 6, 2019

rstub commented Feb 7, 2019 •

edited

Loading

jlmelville commented Mar 28, 2019

LTLA commented Mar 28, 2019

jlmelville commented Apr 6, 2019

LTLA commented Apr 7, 2019 •

edited

Loading

jlmelville commented Apr 7, 2019

LTLA commented Apr 7, 2019

jlmelville commented Apr 7, 2019

jlmelville commented Apr 8, 2019

jlmelville commented Apr 13, 2019

CRAN submission #2

CRAN submission #2

Comments

ttriche commented Jul 31, 2018

LTLA commented Dec 8, 2018

jlmelville commented Dec 8, 2018

LTLA commented Dec 8, 2018

jlmelville commented Dec 8, 2018

LTLA commented Dec 9, 2018 • edited Loading

jlmelville commented Dec 9, 2018

jlmelville commented Dec 9, 2018

ttriche commented Dec 9, 2018 via email

LTLA commented Dec 12, 2018

ttriche commented Dec 13, 2018 via email

LTLA commented Dec 13, 2018

jlmelville commented Dec 13, 2018

jlmelville commented Jan 16, 2019 • edited Loading

LTLA commented Jan 21, 2019 • edited Loading

jlmelville commented Jan 21, 2019

khughitt commented Feb 5, 2019

LTLA commented Feb 5, 2019

ttriche commented Feb 6, 2019 via email

jlmelville commented Feb 6, 2019

rstub commented Feb 7, 2019 • edited Loading

jlmelville commented Mar 28, 2019

LTLA commented Mar 28, 2019

jlmelville commented Apr 6, 2019

LTLA commented Apr 7, 2019 • edited Loading

jlmelville commented Apr 7, 2019

LTLA commented Apr 7, 2019

jlmelville commented Apr 7, 2019

jlmelville commented Apr 8, 2019

jlmelville commented Apr 13, 2019

LTLA commented Dec 9, 2018 •

edited

Loading

jlmelville commented Jan 16, 2019 •

edited

Loading

LTLA commented Jan 21, 2019 •

edited

Loading

rstub commented Feb 7, 2019 •

edited

Loading

LTLA commented Apr 7, 2019 •

edited

Loading