Skip to content

Commit

Permalink
#75: nmslib is now optional
Browse files Browse the repository at this point in the history
  • Loading branch information
gagolews committed Sep 15, 2022
1 parent 2a95b7e commit 6258ca7
Show file tree
Hide file tree
Showing 84 changed files with 266 additions and 4,276 deletions.
16 changes: 8 additions & 8 deletions .devel/sphinx/bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,14 @@ @article{genieclust
pages = {100722}
}

@misc{clustering_benchmarks_v1,
author = {M. Gagolewski and others},
title = {Benchmark Suite for Clustering Algorithms -- Version 1},
year = {2020},
url = {https://github.com/gagolews/clustering-benchmarks},
doi = {10.5281/zenodo.3815066}
}

@misc{aaa,
author = {M. Gagolewski},
title = {Adjusted asymmetric accuracy: {A} well-behaving external cluster validity measure},
Expand Down Expand Up @@ -251,14 +259,6 @@ @incollection{dbscan
pages = {226--231}
}

@misc{clustering_benchmarks_v1,
author = {M. Gagolewski and others},
title = {Benchmark Suite for Clustering Algorithms -- Version 1},
year = {2020},
url = {https://github.com/gagolews/clustering-benchmarks},
doi = {10.5281/zenodo.3815066}
}

@inproceedings{sklearn_api,
author = {L. Buitinck and others},
title = {{API} design for machine learning software: {E}xperiences from the scikit-learn project},
Expand Down
2 changes: 1 addition & 1 deletion .devel/sphinx/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ for datasets of millions of points, can be completed within
Therefore, it is capable of solving **extreme clustering tasks**
(large datasets with any number of clusters to detect)
on data that fit into memory.
Thanks to the use of *nmslib* {cite}`nmslib`,
Thanks to the use of *nmslib* {cite}`nmslib` (if available),
sparse or string inputs are also supported.

Genie also allows clustering with respect to mutual reachability distances
Expand Down
5 changes: 5 additions & 0 deletions .devel/sphinx/news.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# What Is New in *genieclust*


## 1.1.1 (2022-09-15)

* [Python] #75: `nmslib` is now optional.


## 1.1.0 (2022-09-05)

* [GENERAL] The below-mentioned cluster validity measures are discussed
Expand Down
20 changes: 14 additions & 6 deletions .devel/sphinx/weave/sparse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,12 @@ To illustrate how *genieclust* handles
let's perform a simple exercise in movie recommendation based on
`MovieLens <https://grouplens.org/datasets/movielens/latest/>`_ data.

.. important::

Make sure that the *nmslib* package (an optional dependency) is installed.




.. code-block:: python
Expand Down Expand Up @@ -45,7 +51,7 @@ and map the movie IDs to consecutive integers.



Then we read the movie meta data and transform the movie IDs
Then we read the movie metadata and transform the movie IDs
in the same way:


Expand Down Expand Up @@ -110,11 +116,11 @@ First few observations:



Let's extract 200 clusters with Genie with respect to cosine similarity between films' ratings
Let's extract 200 clusters with Genie with respect to the cosine similarity between films' ratings
as given by users (two movies considered similar if they get similar reviews).
Sparse inputs are supported by the approximate version of the algorithm
which relies on the
near-neighbour search routines implemented in the `nmslib` package.
near-neighbour search routines implemented in the *nmslib* package.



Expand Down Expand Up @@ -152,6 +158,7 @@ Here are the members of an example cluster:
## 2084 Bowfinger (1999)
## 2190 Boys Don't Cry (1999)
## 2888 Cell, The (2000)
## 832 Doors, The (1991)
## 955 Duck Soup (1933)
## 836 E.T. the Extra-Terrestrial (1982)
## 1960 Election (1999)
Expand Down Expand Up @@ -180,6 +187,7 @@ Here are the members of an example cluster:
## 898 Star Wars: Episode V - The Empire Strikes Back...
## 911 Star Wars: Episode VI - Return of the Jedi (1983)
## 934 Sting, The (1973)
## 2030 Summer of Sam (1999)
## 987 This Is Spinal Tap (1984)
## 2174 Three Kings (1999)
## 839 Top Gun (1986)
Expand All @@ -195,10 +203,10 @@ Here are the members of an example cluster:

The above was performed on an abridged version of the MovieLens dataset.
The project's `website <https://grouplens.org/datasets/movielens/latest/>`_
also features a full database that yields a 53889x283228 ratings table
(with 27753444 non-zero elements) -- such a matrix would definitely
also features a full database that yields a 53,889x283,228 ratings table
(with 27,753,444 non-zero elements) -- such a matrix would definitely
not fit into our RAM if it was in the dense form.
Determining the whole cluster hierarchy takes only 144 secs.
Determining the whole cluster hierarchy takes only 144 seconds.
Here is one of 500 clusters extracted:

.. code::
Expand Down
18 changes: 12 additions & 6 deletions .devel/sphinx/weave/sparse.rstw
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,12 @@ To illustrate how *genieclust* handles
let's perform a simple exercise in movie recommendation based on
`MovieLens <https://grouplens.org/datasets/movielens/latest/>`_ data.

.. important::

Make sure that the *nmslib* package (an optional dependency) is installed.



<<sparse-example-imports>>=
import numpy as np
import scipy.sparse
Expand Down Expand Up @@ -39,7 +45,7 @@ ratings["movieId"] = np.searchsorted(old_movieId_map, ratings["movieId"])
ratings.head()
@

Then we read the movie meta data and transform the movie IDs
Then we read the movie metadata and transform the movie IDs
in the same way:

<<sparse-example-movies>>=
Expand Down Expand Up @@ -69,11 +75,11 @@ First few observations:
X[:5, :10].todense()
@

Let's extract 200 clusters with Genie with respect to cosine similarity between films' ratings
Let's extract 200 clusters with Genie with respect to the cosine similarity between films' ratings
as given by users (two movies considered similar if they get similar reviews).
Sparse inputs are supported by the approximate version of the algorithm
which relies on the
near-neighbour search routines implemented in the `nmslib` package.
near-neighbour search routines implemented in the *nmslib* package.


<<sparse-example-cluster>>=
Expand All @@ -95,10 +101,10 @@ movies.loc[movies.cluster == int(which_cluster)].title.sort_values()

The above was performed on an abridged version of the MovieLens dataset.
The project's `website <https://grouplens.org/datasets/movielens/latest/>`_
also features a full database that yields a 53889x283228 ratings table
(with 27753444 non-zero elements) -- such a matrix would definitely
also features a full database that yields a 53,889x283,228 ratings table
(with 27,753,444 non-zero elements) -- such a matrix would definitely
not fit into our RAM if it was in the dense form.
Determining the whole cluster hierarchy takes only 144 secs.
Determining the whole cluster hierarchy takes only 144 seconds.
Here is one of 500 clusters extracted:

.. code::
Expand Down
11 changes: 8 additions & 3 deletions .devel/sphinx/weave/string.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,12 @@ data. Let's perform an example grouping based
on `Levenshtein's <https://en.wikipedia.org/wiki/Levenshtein_distance>`_ edit
distance.

We'll use one of the benchmark datasets mentioned in :cite:`genieins`
.. important::

Make sure that the *nmslib* package (an optional dependency) is installed.


We will use one of the benchmark datasets mentioned in :cite:`genieins`
as an example:


Expand All @@ -27,7 +32,7 @@ as an example:
::

## /tmp/ipykernel_56999/1616393685.py:3: DeprecationWarning: `np.str` is
## /tmp/ipykernel_15571/1616393685.py:3: DeprecationWarning: `np.str` is
## a deprecated alias for the builtin `str`. To silence this warning, use
## `str` by itself. Doing this will not modify any behavior and is safe.
## If you specifically wanted the numpy scalar type, use `np.str_` here.
Expand Down Expand Up @@ -64,7 +69,7 @@ by an expert:


Clustering in the string domain relies on the
near-neighbour search routines implemented in the `nmslib` package.
near-neighbour search routines implemented in the *nmslib* package.


.. code-block:: python
Expand Down
9 changes: 7 additions & 2 deletions .devel/sphinx/weave/string.rstw
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,12 @@ data. Let's perform an example grouping based
on `Levenshtein's <https://en.wikipedia.org/wiki/Levenshtein_distance>`_ edit
distance.

We'll use one of the benchmark datasets mentioned in :cite:`genieins`
.. important::

Make sure that the *nmslib* package (an optional dependency) is installed.


We will use one of the benchmark datasets mentioned in :cite:`genieins`
as an example:


Expand Down Expand Up @@ -46,7 +51,7 @@ print(n_clusters)


Clustering in the string domain relies on the
near-neighbour search routines implemented in the `nmslib` package.
near-neighbour search routines implemented in the *nmslib* package.

<<string-example-cluster>>=
import genieclust
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/cibuildwheel.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ env:
# nmslib does not build on 32bit Windows
# https://cibuildwheel.readthedocs.io/en/stable/options/
# cp39-win_amd64
CIBW_SKIP: cp2* pp* cp35* cp36* cp37-win32 cp38-win32 cp39-win* cp310-win* cp311-win* cp310-manylinux_i686 cp311-manylinux_i686 *-musllinux*
CIBW_SKIP: cp2* pp* cp35* cp36* cp37-win32 cp38-win32 cp39-win32 cp310-win32 cp311-win32 cp310-manylinux_i686 cp311-manylinux_i686 *-musllinux*
CIBW_BEFORE_BUILD: pip install -r requirements.txt --upgrade

#[ubuntu-latest, windows-latest, macos-latest]
Expand All @@ -18,7 +18,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-20.04, windows-2019, macOS-11]
os: [windows-2019, macOS-11, ubuntu-20.04]

steps:
- uses: actions/checkout@v3
Expand Down
6 changes: 5 additions & 1 deletion .github/workflows/py.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,11 @@ jobs:
pip install flake8 pytest --upgrade
pip install sphinx numpydoc sphinx_rtd_theme sphinx-bootstrap-theme sphinxcontrib-jsmath sphinxcontrib-bibtex myst_parser --upgrade
pip install rpy2 pweave ipython jupyter tabulate --upgrade
- name: Install optional dependencies
- name: Install optional dependencies (nmslib)
continue-on-error: true
run: |
pip install nmslib --upgrade
- name: Install optional dependencies (mlpack)
continue-on-error: true
run: |
pip install mlpack --upgrade
Expand Down
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Package: genieclust
Type: Package
Title: Fast and Robust Hierarchical Clustering with Noise Points Detection
Version: 1.1.0
Date: 2022-09-05
Version: 1.1.1
Date: 2022-09-15
Authors@R: c(
person("Marek", "Gagolewski",
role = c("aut", "cre", "cph"),
Expand Down
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -75,11 +75,12 @@ news:
html: python r news weave rd2myst weave-examples
rm -rf .devel/sphinx/_build/
cd .devel/sphinx && make html
rm -rf .devel/sphinx/_build/html/_sources
@echo "*** Browse the generated documentation at"\
"file://`pwd`/.devel/sphinx/_build/html/index.html"

docs: html
@echo "*** Making 'docs' is only recommended when publishing an"\
@echo "*** Making 'docs' is only recommended when publishing the"\
"official release, because it updates the package homepage."
@echo "*** Therefore, we check if the package version is like 1.2.3"\
"and not 1.2.2.9007."
Expand Down
5 changes: 5 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# What Is New in *genieclust*


## 1.1.1 (2022-09-15)

* [Python] #75: `nmslib` is now optional.


## 1.1.0 (2022-09-05)

* [GENERAL] The below-mentioned cluster validity measures are discussed
Expand Down
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ graphs, Genie is also **very fast** – determining the whole cluster hierarchy
for datasets of millions of points can be completed within minutes. Therefore,
it is nicely suited for solving of **extreme clustering tasks** (large datasets
with any number of clusters to detect) for data (also sparse) that fit into
memory. Thanks to the use of [**nmslib**](https://github.com/nmslib/nmslib),
sparse or string inputs are also supported.
memory. Thanks to the use of [**nmslib**](https://github.com/nmslib/nmslib)
(if available), sparse or string inputs are also supported.

It also allows clustering with respect to mutual reachability distances
so that it can act as a **noise point detector** or a
Expand Down Expand Up @@ -114,8 +114,8 @@ pip3 install genieclust
```

The package requires Python 3.7+ together with **cython** as well as
**numpy**, **scipy**, **matplotlib**, **nmslib**, and **scikit-learn**.
Optional dependency: **mlpack**.
**numpy**, **scipy**, **matplotlib**, and **scikit-learn**.
Optional dependencies: **nmslib** and **mlpack**.



Expand Down Expand Up @@ -193,7 +193,8 @@ Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?,
[DOI: 10.1016/j.ins.2021.10.004](https://doi.org/10.1016/j.ins.2021.10.004).

Gagolewski M., *Adjusted asymmetric accuracy: A well-behaving external
cluster validity measure*, 2022, submitted for publication.
cluster validity measure*, under review (preprint),
[DOI: 10.48550/arXiv.2209.02935](https://doi.org/10.48550/arXiv.2209.02935).

Gagolewski M., *A Framework for Benchmarking Clustering Algorithms*,
2022, <https://clustering-benchmarks.gagolewski.com>.
Expand Down
27 changes: 0 additions & 27 deletions docs/_sources/genieclust.rst.txt

This file was deleted.

5 changes: 0 additions & 5 deletions docs/_sources/genieclust_cluster_validity.rst.txt

This file was deleted.

5 changes: 0 additions & 5 deletions docs/_sources/genieclust_compare_partitions.rst.txt

This file was deleted.

5 changes: 0 additions & 5 deletions docs/_sources/genieclust_genie.rst.txt

This file was deleted.

5 changes: 0 additions & 5 deletions docs/_sources/genieclust_inequity.rst.txt

This file was deleted.

5 changes: 0 additions & 5 deletions docs/_sources/genieclust_internal.rst.txt

This file was deleted.

5 changes: 0 additions & 5 deletions docs/_sources/genieclust_plots.rst.txt

This file was deleted.

5 changes: 0 additions & 5 deletions docs/_sources/genieclust_tools.rst.txt

This file was deleted.

Loading

0 comments on commit 6258ca7

Please sign in to comment.