Skip to content

R package implementing the Nearest Neighbor Descent method for approximate nearest neighbors

License

Notifications You must be signed in to change notification settings

jlmelville/rnndescent

Repository files navigation

rnndescent

R-CMD-check AppVeyor Build Status Coverage Status CRAN Status Badge Dependencies Downloads (monthly) Downloads (total) Last Commit

An R package for finding approximate nearest neighbors, translated from the Python package PyNNDescent written by the great Leland McInnes. As the name suggests, it uses the Nearest Neighbor Descent method (Dong et al., 2011), but also makes use of Random Partition Trees (Dasgupta and Freund, 2008) as well as ideas from FANNG and NGT.

You can use rnndescent for:

  • optimizing an initial set of nearest neighbors, e.g. those generated by RcppAnnoy or RcppHNSW.
  • using this package for nearest neighbor search all on its own...
  • ... including finding nearest neighbors on sparse data, which most other packages in the R ecosystem cannot do.
  • and a much larger number of metrics than most other packages.

Documentation

See the Get Started article for the basics. The other vignettes go into more detail.

Current Status

14 May 2024: Version 0.1.6 has been released to CRAN. The previous release didn't quite get compatibility with dqrng right so here is another attempt. Also a couple of other bug fixes have been included.

Installation

CRAN

install.packages("rnndescent")

Development Version

# install.packages("pak")
pak::pkg_install("jlmelville/rnndescent")

This packages makes use of C++ code which must be compiled. You may have to carry out a few extra steps before being able to build:

Windows: install Rtools and ensure C:\Rtools\bin is on your path.

Mac OS X: using a custom ~/.R/Makevars may cause linking errors. This sort of thing is a potential problem on all platforms but seems to bite Mac owners more. The R for Mac OS X FAQ may be helpful here to work out what you can get away with. To be on the safe side, I would advise building without a custom Makevars.

rnndescent uses C++17. This shouldn't be too big a problem but not all R platforms support it (sorry if this affects you).

Example

library(rnndescent)

# Find 15-knn
iris_knn <- rnnd_knn(iris, k = 15)

# Build an index
iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
# Specify the number of neighbors you are likely to want to query for
iris_index <- rnnd_build(iris_index, k = 15)

# Query then index
iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]
iris_query_nn <- rnnd_query(index = iris_index, query = iris_odd, k = 15)

For more details, please see the documentation.

Supported Metrics

Many. See the metrics article for a list.

Missing Features

Compared to PyNNDescent, rnndescent is currently lacking, in decreasing order of likelihood of implementation:

  • Only parallel batch queries are currently supported. This means that if you are trying to stream queries, where you are only querying one item at a time, you will get no parallelism.
  • The index is always passed between the C++ and R layers when building an index and querying. This is useful for portability as its easy to serialize the index (you can use saveRDS like any R data for example), but it's not very efficient. Keeping the index as an R-wrapped C++ class has its own downsides but would fix that.
  • The index can also get very large for large (and high-dimensional) datasets.
  • Some of the distance metrics. A large number are currently supported though. See Missing Metrics below for those that are currently not available.
  • Custom metrics. This just isn't feasible with a C++ implementation.

The issues around index serialization and parallel behavior make rnndescent currently unsuitable for streaming applications where you are querying one item at a time. If you are doing batch queries, where you are querying multiple items at once, then rnndescent should be fine: for example, generating nearest neighbors for UMAP (maybe for use with uwot). Dimensionality reduction is my personal use case for nearest neighbors calculation and I would like to get rnndescent onto CRAN in a useful-for-something state. As a result I am not targeting an initial release to support the streaming case. I would like to fix this for a subsequent release.

Also there is no specialized distance code to take advantage of AVX etc., so rnndescent is going to run slower than other packages. This wouldn't be allowed on CRAN anyway, but might be a nice-to-have for installing from github.

Citation

Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web (pp. 577-586). ACM. doi.org/10.1145/1963405.1963487.

License

The R package as a whole is licensed under GPLv3 or later. The following files are licensed differently:

  • inst/include/dqsample.h is a modification of some sampling code from dqrng and is AGPLv3 or later.
  • inst/include/RcppPerpendicular.h is a modification of some code from from RcppParallel and is GPLv2 or later
  • The underlying nearest neighbor descent C++ library, which can be found under inst/include/tdoann, is licensed under the BSD 2-Clause.

As far as I know, these licenses are all compatible with re-licensing under GPLv3 or later.

Missing Metrics

The following metrics are in PyNNDescent but are not supported in rnndescent:

  • Circular Kantorovich
  • Haversine
  • Kantorovich
  • Mahalanobis
  • Minkowski
  • Sinkhorn
  • Standardised Euclidean
  • Wasserstein 1d
  • Weighted Minkowski

These require passing extra information as part of the metric definition, which is not currently supported.

See Also