-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRAN submission #2
Comments
eliminates the ERROR that is thrown by R CMD check and, in principle, will allow the package to head to CRAN in this respect, it fixes jlmelville#2
This package looks great. I can't believe it's not on CRAN yet. Is there a hold-up, and can I help? |
Thanks for the kind words. There's no hold up on CRAN submission, I just don't have any plans to do so at the moment. Here's the roadmap/list of objections:
Basically, I've been very lazy, but I also want to be respectful of the CRAN maintainers' time if the submission runs into issues. However, this sort of reminder is a useful kick in the pants. If having to install via |
1. and 2. I'm glad you brought this up. I don't think you'll have a problem there: vals <- readRDS("pbmc68k.rds") # first 50 PCs of the 10X PBMC 68K data set
test <- vals[1:10000,]
library(umap)
system.time(X <- umap::umap(test, n_neighbors=15))
## user system elapsed
## 74.370 3.949 79.067
library(uwot)
system.time(Y <- uwot::umap(test, n_neighbors=15))
## user system elapsed
## 22.079 0.560 12.822
3. and 4. I can probably help with that. I looked at your source code a while ago, it seemed alright. Does it go through the various platforms on My motivation is to depend on |
I didn't know about rhub, thanks for the pointer! I have a lot of platform checking in my future, it seems. I'm sure I've committed several other crimes against good coding hygiene, but the two major bits of dirty laundry I need to air (and ask for opinions on) are:
result <- optimize_layout_umap(
head_embedding = embedding,
tail_embedding = embedding,
positive_head = positive_head,
positive_tail = positive_tail,
epochs_per_sample = epochs_per_sample, ...)
I've got away with it so far (in terms of it generating a segfault), but it seems like a complete no-no to me. Do I need to bite the bullet and ensure a copy is passed?
As far as I recall from some initial experiments between how RcppArmadillo vectors and Rcpp It feels good to have confessed to these sins. |
Yeah, 1 is definitely bad. You should allocate new memory to store the result; but that should be cheap, I wouldn't worry about it given that the 2 is harder for me to tell, I usually use OpenMP to do my parallelization. My guess is that you should be fine for read-only, you should only need thread safety for writes. It shouldn't even compile if the objects were being modified... in theory, at least, who knows what's going on in there. More generally, I wonder if you really need Armadillo for what you're doing, it seems that raw vectors might be sufficient for access. (I don't see a whole lot of matrix operations, though I'm not really familiar with Armadillo so I might have missed them.) I also participated in the parallelization of Rtsne, from which we learnt some interesting lessons about the order of elements during reductions (see jkrijthe/Rtsne#16). I don't see any Both of these last two points are mostly a matter of taste, so ignore them as you please; that's just where my experience tends to be concentrated. |
1 is now fixed. It wasn't the speed of allocation that bothered me, but making extra copies of potentially large matrices. This is at best unnecessary, and at worst for standard UMAP, if the code ends up working on two separate copies of the input coordinates, the results come out wrong. But despite the best efforts of the compiler to conspire against me, I have emerged victorious. Probably.
I don't use Armadillo for linear algebra, just for its convenient row, column subsetting and vectorized function features and so on. I assume you are suggesting that I can just do what Rtsne does and work with
That is an interesting read. Looks like the |
3 on my TODO list: for the spectral initialization, for what I assume are very poorly conditioned input matrices, RSpectra sometimes seems to hang (or at least take an unacceptably long time, I've never stuck around long enough to find out how long that is). Need to see if there's a way to get it to bail out earlier or if there's a better way to find the small eigenvalues needed for the normalized Laplacian. |
+1
Anything we can do to help (ie me and @biobenkj) will be enlightened self interest. Uwot does things easily that either take a long time or are nearly impossible in the R UMAP package.
LMK
…--t
On Dec 8, 2018, at 4:57 PM, Aaron Lun ***@***.***> wrote:
1. and 2. I'm glad you brought this up. I don't think you'll have a problem there:
vals <- readRDS("pbmc68k.rds") # first 50 PCs of the 10X PBMC 68K data set
test <- vals[1:10000,]
library(umap)
system.time(X <- umap::umap(test, n_neighbors=15))
## user system elapsed
## 74.370 3.949 79.067
library(uwot)
system.time(Y <- uwot::umap(test, n_neighbors=15))
## user system elapsed
## 22.079 0.560 12.822
umap::umap does most of its heavy lifting in R, which explains the difference in speed.
3. and 4. I can probably help with that. I looked at your source code a while ago, it seemed alright. Does it go through the various platforms on rhub okay?
My motivation is to depend on uwot in my own packages, which requires uwot to be in CRAN (or Bioconductor); I can't pull it in via devtools.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Just to motivate this discussion: https://github.com/davismcc/scater/tree/uwot. So once uwot gets on CRAN, it'll have at least one reverse dependency straight away. |
Two, actually — we would have enabled UMAP support in compartmap if uwot had been on CRAN by the submission deadline, and will do so in devel if it does any time before March.
…--t
On Dec 12, 2018, at 1:11 PM, Aaron Lun ***@***.***> wrote:
Just to motivate this discussion: https://github.com/davismcc/scater/tree/uwot. So once uwot gets on CRAN, it'll have at least one reverse dependency straight away.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Though I guess CRAN doesn't track Bioc revdeps, so you wouldn't see them on the landing page. But @jlmelville can use that as an argument for getting CRAN to accept your package. |
This is probably blocked until I at least understand what's causing the Mac issues with #1. |
Other things to fix:
|
I was going to say that if you can't use I should also add that: some_rng() % tail_nvert ... is technically not correct (in terms of producing a uniform range in |
For now, I have tried to stick to copying how the Python UMAP implementation works. There is a slight complication that the Python I may take a look at pulling in Boost (although does anyone know why Boost RNG is OK to use if the C++11 RNGs aren't?) to generate integers from a uniform distribution the correct way at some point and see what effect it has on the speed of UMAP's optimization. It's not very high on my priority list at the moment, unless something demonstrably horrible is occurring with the current way of doing things. |
Package may also be used in dimRed once it's available on CRAN. |
I was just thinking about this - we should probably get the ball rolling again. @jlmelville, regarding Boost's random headers: the major difference (for me) is that the C++ This is a real problem, because I ran into it! I don't understand why the C++ standard was written like that, it basically makes If we were to follow the spirit of the law, I would guess that we should not use Boost's random headers either. But there are some real limitations of R's C API for random number generation (see, for example, my comments in daqana/dqrng#11), and the best solution in such cases would be Boost. |
Is it not possible to use package ‘BH’ on CRAN to circumvent this?
…--t
On Feb 5, 2019, at 4:59 PM, Aaron Lun ***@***.***> wrote:
I was just thinking about this - we should probably get the ball rolling again.
@jlmelville, regarding Boost's random headers: the major difference (for me) is that the C++ <random> does not use the same algorithm in its distribution functions across platforms. The standard only mandates the output of the PRNGs themselves, not how the random number stream is converted into variates that we actually use. For example, std::normal_distirbution has to produce normally distributed values, but it doesn't have to produce the same series of normally distributed values between clang and GCC.
This is a real problem, because I ran into it! I don't understand why the C++ standard was written like that, it basically makes <random> unusable for scientific computing where a result must be reproducible across systems. That's why I only use boost/random.hpp rather than <random>.
If we were to follow the spirit of the law, I would guess that we should not use Boost's random headers either. But there are some real limitations of R's C API for random number generation (see, for example, my comments in daqana/dqrng#11), and the best solution in such cases would be Boost.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Thanks for explaining the difference between C++11 and Boost's random headers, @LTLA, that makes sense. I am open to considering either Boost or https://github.com/daqana/dqrng. I have not abandoned preparing this for CRAN, I have just got distracted with other things. Mainly, I am still a bit dissatisfied with the speed of the nearest neighbor calculations. I have created https://github.com/jlmelville/rnndescent to implement the nearest neighbor descent method used by the Python UMAP (admittedly it also does a different initialization) and by LargeVis, but that is still a work in progress. Also, #16 (spectral initialization basically taking forever) is still not satisfactorily resolved to my liking. Will I have to implement a LOBPCG routine for R? I haven't checked how painful that will be. Finally, #19 required adding custom save/load functions. This feels like the sort of thing that shouldn't be necessary, but I don't know whether you can write custom serialization hooks for C++ objects that are wrapping external data. This might just be the way it is, and it's a non-issue. |
Note that PCG offers a nice convenience function as fast replacement of In principle I also plan to add something along these lines to dqrng itself (c.f. Concerning serialization you might look into https://cran.r-project.org/package=Rcereal and this example using an XPtr: https://stackoverflow.com/a/53157233/8416610. |
Thanks to the pointer by @rstub, I have adopted PCG (linking via I have also added Next: my experiments with nearest neighbor descent have been disappointing so far: you get a better improvement in accuracy by using more accurate Annoy settings, than spending the same time refining nearest neighbor descent. Probably I am doing something wrong. Probably I should remove this code and save it for a triumphant return at a later date. Finally: I found and fixed a bug in the spectral initialization as I was generating the example images. Having now churned through these datasets successfully, at this point, probably #16 is all that prevents me finally getting round to a CRAN submission. But I haven't tested cosine and hamming distances as thoroughly. Example datasets (especially if they can be used in the examples page and/or will uncover horrible bugs) would be welcome. |
Cool - I'll check out the compatibility with the |
CRAN submission was just accepted: https://cran.r-project.org/package=uwot Thanks to all in this thread for their contributions to getting this done. |
You'll get a revdep as soon as scater re-builds on the BioC machines (probably in a day or two). |
Does anyone else think the 'hooray' emoji looks like a taco shell with some lettuce sticking out the end? Anyway, don't get too carried away with those celebratory salad tacos, because of course I managed to crash the R session mere moments after I received the email from CRAN. Fundamentally, the issue is because of spotify/annoy#378 (can't read Annoy indexes > 2GB), but the real blame lies with me for failing to check that Annoy actually returns k neighbors when you ask for them. There will therefore be a patch release (hopefully 0.1.3) coming ASAP, assuming the maintainers don't ban me or throw uwot off CRAN. It also attempts to fix an ASAN/UBSAN/valgrind issue, although I think most of it seems to be due to internal code used by RcppParallel. |
Well, if it's any consolation, I also have an inexplicable failure on one of the UMAP-related tests in scater on the Windows 32 build machines. Probably something to do with differences in numerical precision between FNN and BiocNeighbors during the nearest neighbor calculations ... whatever, I'm skipping it. |
Hmm, this may not be related, but the appconveyor Windows CI builds have started sporadically failing on a |
patch release 0.1.3 has hit CRAN, so the immediate panic is over. The valgrind and related checks haven't been updated yet. If they aren't clean, I have been told they will need fixing "quickly", so I have that to look forward to also. |
The CRAN checks have all updated to 0.1.3. There is an UBSAN issue, but it seems to be due to RcppParallel. The same issue occurs in the RcppParallel checks, so I don't think I can do anything about it. I consider uwot successfully submitted to CRAN at last. Closing. |
eliminates the ERROR that is thrown by R CMD check and, in principle, will allow the package to head to CRAN in this respect, it fixes jlmelville#2
see imminent pull request for details
The text was updated successfully, but these errors were encountered: