An R package for finding approximate nearest neighbors, translated from the Python package PyNNDescent written by the great Leland McInnes. As the name suggests, it uses the Nearest Neighbor Descent method (Dong et al., 2011), but also makes use of Random Partition Trees (Dasgupta and Freund, 2008) as well as ideas from FANNG and NGT.
You can use rnndescent for:
- optimizing an initial set of nearest neighbors, e.g. those generated by RcppAnnoy or RcppHNSW.
- using this package for nearest neighbor search all on its own...
- ... including finding nearest neighbors on sparse data, which most other packages in the R ecosystem cannot do.
- and a much larger number of metrics than most other packages.
See the Get Started article for the basics. The other vignettes go into more detail.
14 May 2024: Version 0.1.6 has been released to CRAN. The previous release
didn't quite get compatibility with dqrng
right so here is another attempt.
Also a couple of other bug fixes have been included.
install.packages("rnndescent")
# install.packages("pak")
pak::pkg_install("jlmelville/rnndescent")
This packages makes use of C++ code which must be compiled. You may have to carry out a few extra steps before being able to build:
Windows: install
Rtools and ensure
C:\Rtools\bin
is on your path.
Mac OS X: using a custom ~/.R/Makevars
may cause linking errors.
This sort of thing is a potential problem on all platforms but seems to bite
Mac owners more.
The R for Mac OS X FAQ
may be helpful here to work out what you can get away with. To be on the safe
side, I would advise building without a custom Makevars
.
rnndescent
uses C++17. This shouldn't be too big a problem but not all R
platforms support it (sorry if this affects you).
library(rnndescent)
# Find 15-knn
iris_knn <- rnnd_knn(iris, k = 15)
# Build an index
iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
# Specify the number of neighbors you are likely to want to query for
iris_index <- rnnd_build(iris_index, k = 15)
# Query then index
iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]
iris_query_nn <- rnnd_query(index = iris_index, query = iris_odd, k = 15)
For more details, please see the documentation.
Many. See the metrics article for a list.
Compared to PyNNDescent, rnndescent
is currently lacking, in decreasing order
of likelihood of implementation:
- Only parallel batch queries are currently supported. This means that if you are trying to stream queries, where you are only querying one item at a time, you will get no parallelism.
- The index is always passed between the C++ and R layers when building an index
and querying. This is useful for portability as its easy to serialize the index
(you can use
saveRDS
like any R data for example), but it's not very efficient. Keeping the index as an R-wrapped C++ class has its own downsides but would fix that. - The index can also get very large for large (and high-dimensional) datasets.
- Some of the distance metrics. A large number are currently supported though.
See
Missing Metrics
below for those that are currently not available. - Custom metrics. This just isn't feasible with a C++ implementation.
The issues around index serialization and parallel behavior make rnndescent
currently unsuitable for streaming applications where you are querying one item
at a time. If you are doing batch queries, where you are querying multiple items
at once, then rnndescent
should be fine: for example, generating nearest
neighbors for UMAP (maybe for use with
uwot). Dimensionality reduction is my
personal use case for nearest neighbors calculation and I would like to get
rnndescent
onto CRAN in a useful-for-something state. As a result I am not
targeting an initial release to support the streaming case. I would like to fix
this for a subsequent release.
Also there is no specialized distance code to take advantage of AVX etc., so
rnndescent
is going to run slower than other packages. This wouldn't be
allowed on CRAN anyway, but might be a nice-to-have for installing from github.
Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web (pp. 577-586). ACM. doi.org/10.1145/1963405.1963487.
The R package as a whole is licensed under GPLv3 or later. The following files are licensed differently:
inst/include/dqsample.h
is a modification of some sampling code from dqrng and is AGPLv3 or later.inst/include/RcppPerpendicular.h
is a modification of some code from from RcppParallel and is GPLv2 or later- The underlying nearest neighbor descent C++ library, which can be found under
inst/include/tdoann
, is licensed under the BSD 2-Clause.
As far as I know, these licenses are all compatible with re-licensing under GPLv3 or later.
The following metrics are in PyNNDescent but are not supported in rnndescent:
- Circular Kantorovich
- Haversine
- Kantorovich
- Mahalanobis
- Minkowski
- Sinkhorn
- Standardised Euclidean
- Wasserstein 1d
- Weighted Minkowski
These require passing extra information as part of the metric definition, which is not currently supported.
- PyNNDescent, the Python implementation.
- nndescent, a C++ implementation.
- NearestNeighborDescent.jl, a Julia implementation.
- NNDescent.cpp, another C++ implementation.
- nndescent, another C++ implementation, with Python bindings.