#75: nmslib is now optional

gagolews · Sep 15, 2022 · 6258ca7 · 6258ca7
1 parent 2a95b7e
commit 6258ca7
Show file tree

Hide file tree

Showing 84 changed files with 266 additions and 4,276 deletions.
diff --git a/.devel/sphinx/bibliography.bib b/.devel/sphinx/bibliography.bib
@@ -19,6 +19,14 @@ @article{genieclust
     pages = {100722}
 }
 
+@misc{clustering_benchmarks_v1,
+    author = {M. Gagolewski and others},
+    title = {Benchmark Suite for Clustering Algorithms -- Version 1},
+    year = {2020},
+    url = {https://github.com/gagolews/clustering-benchmarks},
+    doi = {10.5281/zenodo.3815066}
+}
+
 @misc{aaa,
     author = {M. Gagolewski},
     title = {Adjusted asymmetric accuracy: {A} well-behaving external cluster validity measure},
@@ -251,14 +259,6 @@ @incollection{dbscan
     pages = {226--231}
 }
 
-@misc{clustering_benchmarks_v1,
-    author = {M. Gagolewski and others},
-    title = {Benchmark Suite for Clustering Algorithms -- Version 1},
-    year = {2020},
-    url = {https://github.com/gagolews/clustering-benchmarks},
-    doi = {10.5281/zenodo.3815066}
-}
-
 @inproceedings{sklearn_api,
     author    = {L. Buitinck and others},
     title     = {{API} design for machine learning software: {E}xperiences from the scikit-learn project},

diff --git a/.devel/sphinx/index.md b/.devel/sphinx/index.md
@@ -35,7 +35,7 @@ for datasets of millions of points, can be completed within
 Therefore, it is capable of solving **extreme clustering tasks**
 (large datasets with any number of clusters to detect)
 on data that fit into memory.
-Thanks to the use of *nmslib* {cite}`nmslib`,
+Thanks to the use of *nmslib* {cite}`nmslib` (if available),
 sparse or string inputs are also supported.
 
 Genie also allows clustering with respect to mutual reachability distances

diff --git a/.devel/sphinx/news.md b/.devel/sphinx/news.md
@@ -1,6 +1,11 @@
 # What Is New in *genieclust*
 
 
+## 1.1.1 (2022-09-15)
+
+*  [Python] #75: `nmslib` is now optional.
+
+
 ## 1.1.0 (2022-09-05)
 
 *  [GENERAL] The below-mentioned cluster validity measures are discussed

diff --git a/.devel/sphinx/weave/sparse.rst b/.devel/sphinx/weave/sparse.rst
@@ -6,6 +6,12 @@ To illustrate how *genieclust* handles
 let's perform a simple exercise in movie recommendation based on
 `MovieLens <https://grouplens.org/datasets/movielens/latest/>`_ data.
 
+.. important::
+
+    Make sure that the *nmslib* package (an optional dependency) is installed.
+
+
+
 
 .. code-block:: python
 
@@ -45,7 +51,7 @@ and map the movie IDs to consecutive integers.
 
 
 
-Then we read the movie meta data and transform the movie IDs
+Then we read the movie metadata and transform the movie IDs
 in the same way:
 
 
@@ -110,11 +116,11 @@ First few observations:
 
 
 
-Let's extract 200 clusters with Genie with respect to  cosine similarity between films' ratings
+Let's extract 200 clusters with Genie with respect to the cosine similarity between films' ratings
 as given by users (two movies considered similar if they get similar reviews).
 Sparse inputs are supported by the approximate version of the algorithm
 which  relies on the
-near-neighbour search routines implemented in the `nmslib` package.
+near-neighbour search routines implemented in the *nmslib* package.
 
 
 
@@ -152,6 +158,7 @@ Here are the members of an example cluster:
     ## 2084                                     Bowfinger (1999)
     ## 2190                                Boys Don't Cry (1999)
     ## 2888                                     Cell, The (2000)
+    ## 832                                     Doors, The (1991)
     ## 955                                      Duck Soup (1933)
     ## 836                     E.T. the Extra-Terrestrial (1982)
     ## 1960                                      Election (1999)
@@ -180,6 +187,7 @@ Here are the members of an example cluster:
     ## 898     Star Wars: Episode V - The Empire Strikes Back...
     ## 911     Star Wars: Episode VI - Return of the Jedi (1983)
     ## 934                                     Sting, The (1973)
+    ## 2030                                 Summer of Sam (1999)
     ## 987                             This Is Spinal Tap (1984)
     ## 2174                                   Three Kings (1999)
     ## 839                                        Top Gun (1986)
@@ -195,10 +203,10 @@ Here are the members of an example cluster:
 
 The above was performed on an abridged version of the MovieLens dataset.
 The project's `website <https://grouplens.org/datasets/movielens/latest/>`_
-also features a full database that yields a 53889x283228 ratings table
-(with 27753444  non-zero elements) -- such a matrix would definitely
+also features a full database that yields a 53,889x283,228 ratings table
+(with 27,753,444  non-zero elements) -- such a matrix would definitely
 not fit into our RAM if it was in the dense form.
-Determining the whole cluster hierarchy takes only 144 secs.
+Determining the whole cluster hierarchy takes only 144 seconds.
 Here is one of 500 clusters extracted:
 
 .. code::

diff --git a/.devel/sphinx/weave/sparse.rstw b/.devel/sphinx/weave/sparse.rstw
@@ -6,6 +6,12 @@ To illustrate how *genieclust* handles
 let's perform a simple exercise in movie recommendation based on
 `MovieLens <https://grouplens.org/datasets/movielens/latest/>`_ data.
 
+.. important::
+
+    Make sure that the *nmslib* package (an optional dependency) is installed.
+
+
+
 <<sparse-example-imports>>=
 import numpy as np
 import scipy.sparse
@@ -39,7 +45,7 @@ ratings["movieId"] = np.searchsorted(old_movieId_map, ratings["movieId"])
 ratings.head()
 @
 
-Then we read the movie meta data and transform the movie IDs
+Then we read the movie metadata and transform the movie IDs
 in the same way:
 
 <<sparse-example-movies>>=
@@ -69,11 +75,11 @@ First few observations:
 X[:5, :10].todense()
 @
 
-Let's extract 200 clusters with Genie with respect to  cosine similarity between films' ratings
+Let's extract 200 clusters with Genie with respect to the cosine similarity between films' ratings
 as given by users (two movies considered similar if they get similar reviews).
 Sparse inputs are supported by the approximate version of the algorithm
 which  relies on the
-near-neighbour search routines implemented in the `nmslib` package.
+near-neighbour search routines implemented in the *nmslib* package.
 
 
 <<sparse-example-cluster>>=
@@ -95,10 +101,10 @@ movies.loc[movies.cluster == int(which_cluster)].title.sort_values()
 
 The above was performed on an abridged version of the MovieLens dataset.
 The project's `website <https://grouplens.org/datasets/movielens/latest/>`_
-also features a full database that yields a 53889x283228 ratings table
-(with 27753444  non-zero elements) -- such a matrix would definitely
+also features a full database that yields a 53,889x283,228 ratings table
+(with 27,753,444  non-zero elements) -- such a matrix would definitely
 not fit into our RAM if it was in the dense form.
-Determining the whole cluster hierarchy takes only 144 secs.
+Determining the whole cluster hierarchy takes only 144 seconds.
 Here is one of 500 clusters extracted:
 
 .. code::

diff --git a/.devel/sphinx/weave/string.rst b/.devel/sphinx/weave/string.rst
@@ -6,7 +6,12 @@ data. Let's perform an example grouping based
 on `Levenshtein's <https://en.wikipedia.org/wiki/Levenshtein_distance>`_ edit
 distance.
 
-We'll use one of the benchmark datasets mentioned in :cite:`genieins`
+.. important::
+
+    Make sure that the *nmslib* package (an optional dependency) is installed.
+
+
+We will use one of the benchmark datasets mentioned in :cite:`genieins`
 as an example:
 
 
@@ -27,7 +32,7 @@ as an example:
 
 ::
 
-    ## /tmp/ipykernel_56999/1616393685.py:3: DeprecationWarning: `np.str` is
+    ## /tmp/ipykernel_15571/1616393685.py:3: DeprecationWarning: `np.str` is
     ## a deprecated alias for the builtin `str`. To silence this warning, use
     ## `str` by itself. Doing this will not modify any behavior and is safe.
     ## If you specifically wanted the numpy scalar type, use `np.str_` here.
@@ -64,7 +69,7 @@ by an expert:
 
 
 Clustering in the string domain relies on the
-near-neighbour search routines implemented in the `nmslib` package.
+near-neighbour search routines implemented in the *nmslib* package.
 
 
 .. code-block:: python

diff --git a/.devel/sphinx/weave/string.rstw b/.devel/sphinx/weave/string.rstw
@@ -6,7 +6,12 @@ data. Let's perform an example grouping based
 on `Levenshtein's <https://en.wikipedia.org/wiki/Levenshtein_distance>`_ edit
 distance.
 
-We'll use one of the benchmark datasets mentioned in :cite:`genieins`
+.. important::
+
+    Make sure that the *nmslib* package (an optional dependency) is installed.
+
+
+We will use one of the benchmark datasets mentioned in :cite:`genieins`
 as an example:
 
 
@@ -46,7 +51,7 @@ print(n_clusters)
 
 
 Clustering in the string domain relies on the
-near-neighbour search routines implemented in the `nmslib` package.
+near-neighbour search routines implemented in the *nmslib* package.
 
 <<string-example-cluster>>=
 import genieclust

diff --git a/.github/workflows/cibuildwheel.yml b/.github/workflows/cibuildwheel.yml
@@ -6,7 +6,7 @@ env:
     # nmslib does not build on 32bit Windows
     # https://cibuildwheel.readthedocs.io/en/stable/options/
     # cp39-win_amd64
-    CIBW_SKIP: cp2* pp* cp35* cp36* cp37-win32 cp38-win32 cp39-win* cp310-win* cp311-win* cp310-manylinux_i686 cp311-manylinux_i686 *-musllinux*
+    CIBW_SKIP: cp2* pp* cp35* cp36* cp37-win32 cp38-win32 cp39-win32 cp310-win32 cp311-win32 cp310-manylinux_i686 cp311-manylinux_i686 *-musllinux*
     CIBW_BEFORE_BUILD: pip install -r requirements.txt --upgrade
 
 #[ubuntu-latest, windows-latest, macos-latest]
@@ -18,7 +18,7 @@ jobs:
     runs-on: ${{ matrix.os }}
     strategy:
       matrix:
-        os: [ubuntu-20.04, windows-2019, macOS-11]
+        os: [windows-2019, macOS-11, ubuntu-20.04]
 
     steps:
       - uses: actions/checkout@v3

diff --git a/.github/workflows/py.yml b/.github/workflows/py.yml
@@ -32,7 +32,11 @@ jobs:
         pip install flake8 pytest --upgrade
         pip install sphinx numpydoc sphinx_rtd_theme sphinx-bootstrap-theme sphinxcontrib-jsmath sphinxcontrib-bibtex myst_parser --upgrade
         pip install rpy2 pweave ipython jupyter tabulate --upgrade
-    - name: Install optional dependencies
+    - name: Install optional dependencies (nmslib)
+      continue-on-error: true
+      run: |
+        pip install nmslib --upgrade
+    - name: Install optional dependencies (mlpack)
       continue-on-error: true
       run: |
         pip install mlpack --upgrade

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,8 +1,8 @@
 Package: genieclust
 Type: Package
 Title: Fast and Robust Hierarchical Clustering with Noise Points Detection
-Version: 1.1.0
-Date: 2022-09-05
+Version: 1.1.1
+Date: 2022-09-15
 Authors@R: c(
     person("Marek", "Gagolewski",
         role = c("aut", "cre", "cph"),

diff --git a/Makefile b/Makefile
@@ -75,11 +75,12 @@ news:
 html: python r news weave rd2myst weave-examples
 	rm -rf .devel/sphinx/_build/
 	cd .devel/sphinx && make html
+	rm -rf .devel/sphinx/_build/html/_sources
 	@echo "*** Browse the generated documentation at"\
 	    "file://`pwd`/.devel/sphinx/_build/html/index.html"
 
 docs: html
-	@echo "*** Making 'docs' is only recommended when publishing an"\
+	@echo "*** Making 'docs' is only recommended when publishing the"\
 	    "official release, because it updates the package homepage."
 	@echo "*** Therefore, we check if the package version is like 1.2.3"\
 	    "and not 1.2.2.9007."

diff --git a/NEWS b/NEWS
@@ -1,6 +1,11 @@
 # What Is New in *genieclust*
 
 
+## 1.1.1 (2022-09-15)
+
+*  [Python] #75: `nmslib` is now optional.
+
+
 ## 1.1.0 (2022-09-05)
 
 *  [GENERAL] The below-mentioned cluster validity measures are discussed

diff --git a/README.md b/README.md
@@ -44,8 +44,8 @@ graphs, Genie is also **very fast** – determining the whole cluster hierarchy
 for datasets of millions of points can be completed within minutes. Therefore,
 it is nicely suited for solving of **extreme clustering tasks** (large datasets
 with any number of clusters to detect) for data (also sparse) that fit into
-memory. Thanks to the use of [**nmslib**](https://github.com/nmslib/nmslib),
-sparse or string inputs are also supported.
+memory. Thanks to the use of [**nmslib**](https://github.com/nmslib/nmslib)
+(if available), sparse or string inputs are also supported.
 
 It also allows clustering with respect to mutual reachability distances
 so that it can act as a **noise point detector** or a
@@ -114,8 +114,8 @@ pip3 install genieclust
 ```
 
 The package requires Python 3.7+ together with **cython** as well as
-**numpy**, **scipy**, **matplotlib**, **nmslib**, and **scikit-learn**.
-Optional dependency: **mlpack**.
+**numpy**, **scipy**, **matplotlib**, and **scikit-learn**.
+Optional dependencies: **nmslib** and **mlpack**.
 
 
 
@@ -193,7 +193,8 @@ Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?,
 [DOI: 10.1016/j.ins.2021.10.004](https://doi.org/10.1016/j.ins.2021.10.004).
 
 Gagolewski M., *Adjusted asymmetric accuracy: A well-behaving external
-cluster validity measure*, 2022, submitted for publication.
+cluster validity measure*, under review (preprint),
+[DOI: 10.48550/arXiv.2209.02935](https://doi.org/10.48550/arXiv.2209.02935).
 
 Gagolewski M., *A Framework for Benchmarking Clustering Algorithms*,
 2022, <https://clustering-benchmarks.gagolewski.com>.

diff --git a/docs/_sources/genieclust.rst.txt b/docs/_sources/genieclust.rst.txt
diff --git a/docs/_sources/genieclust_cluster_validity.rst.txt b/docs/_sources/genieclust_cluster_validity.rst.txt
diff --git a/docs/_sources/genieclust_compare_partitions.rst.txt b/docs/_sources/genieclust_compare_partitions.rst.txt
diff --git a/docs/_sources/genieclust_genie.rst.txt b/docs/_sources/genieclust_genie.rst.txt
diff --git a/docs/_sources/genieclust_inequity.rst.txt b/docs/_sources/genieclust_inequity.rst.txt
diff --git a/docs/_sources/genieclust_internal.rst.txt b/docs/_sources/genieclust_internal.rst.txt
diff --git a/docs/_sources/genieclust_plots.rst.txt b/docs/_sources/genieclust_plots.rst.txt
diff --git a/docs/_sources/genieclust_tools.rst.txt b/docs/_sources/genieclust_tools.rst.txt