Skip to content

Commit

Permalink
v1.1.5
Browse files Browse the repository at this point in the history
  • Loading branch information
gagolews committed Oct 18, 2023
1 parent 1345503 commit b0a6f92
Show file tree
Hide file tree
Showing 115 changed files with 2,882 additions and 3,743 deletions.
12 changes: 2 additions & 10 deletions .devel/sphinx/bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,9 @@ @misc{nca
note = {under review (preprint)}
}

@misc{clustering_benchmarks_v1,
author = {M. Gagolewski and others},
title = {Benchmark Suite for Clustering Algorithms -- Version 1},
year = {2020},
url = {https://github.com/gagolews/clustering-benchmarks},
doi = {10.5281/zenodo.3815066}
}

@misc{Gagolewski2022:clustering-data-v1.1.0,
author = {M. Gagolewski and others},
title = {A benchmark suite for clustering algorithms: Version 1.1.0},
title = {A benchmark suite for clustering algorithms: {V}ersion 1.1.0},
year = {2022},
url = {https://github.com/gagolews/clustering-data-v1/releases/tag/v1.1.0},
doi = {10.5281/zenodo.7088171}
Expand All @@ -47,7 +39,7 @@ @article{clustering-benchmarks

@book{datawranglingpy,
author = {M. Gagolewski},
title = {Minimalist Data Wrangling with Python},
title = {Minimalist Data Wrangling with {P}ython},
doi = {10.5281/zenodo.6451068},
isbn = {978-0-6455719-1-2},
publisher = {Zenodo},
Expand Down
9 changes: 4 additions & 5 deletions .devel/sphinx/news.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Changelog


## 1.1.4.9xxx
## 1.1.5 (2023-10-18)

* [BACKWARD INCOMPATIBILITY] [Python and R] Inequality measures
are no longer referred to as inequity measures.
Expand Down Expand Up @@ -66,9 +65,6 @@

## 1.1.0 (2022-09-05)

* [GENERAL] The below-mentioned cluster validity measures are discussed
in more detail at <https://clustering-benchmarks.gagolewski.com>.

* [Python and R] New function: `adjusted_asymmetric_accuracy`.

* [Python and R] Implementations of the so-called internal cluster
Expand All @@ -89,6 +85,9 @@
`silhouette_w_index`,
`wcnn_index`.

These cluster validity measures are discussed
in more detail at <https://clustering-benchmarks.gagolewski.com>.

* [BACKWARD INCOMPATIBILITY] `normalized_confusion_matrix`
now solves the maximal assignment problem instead of applying
the somewhat primitive partial pivoting.
Expand Down
35 changes: 10 additions & 25 deletions .devel/sphinx/weave/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,40 +3,25 @@
FILES_RMD = \
basics.Rmd \
sklearn_toy_example.Rmd \
r.Rmd
noise.Rmd \
r.Rmd \
benchmarks_approx.Rmd \
benchmarks_ar.Rmd \
benchmarks_details.Rmd \
timings.Rmd


FILES_RSTW = \
benchmarks_ar.rstw \
benchmarks_details.rstw \
benchmarks_approx.rstw \
noise.rstw \
timings.rstw

# string.rstw \
# sparse.rstw \
# sparse.Rmd \
# string.Rmd \

RMD_MD_OUTPUTS=$(patsubst %.Rmd,%.md,$(FILES_RMD))
#RMD_RST_OUTPUTS=$(patsubst %.Rmd,%.rst,$(FILES_RMD))

RSTW_RST_OUTPUTS=$(patsubst %.rstw,%.rst,$(FILES_RSTW))

%.md: %.Rmd
./Rmd2md.sh "$<"

#%.rst: %.md
# pandoc -f markdown+grid_tables --wrap=none "$<" -o "$@"

%.rst: %.rstw
./pweave_custom.py "$<" "$@"


all : rmd rstw
all : rmd

rmd : $(RMD_MD_OUTPUTS)

rstw : $(RSTW_RST_OUTPUTS)

clean:
rm -f $(RSTW_RST_OUTPUTS) $(RMD_MD_OUTPUTS)
rm -f $(RMD_MD_OUTPUTS)
Original file line number Diff line number Diff line change
@@ -1,29 +1,25 @@
Benchmarks — Approximate Method
===============================
# Benchmarks — Approximate Method

In one of the :any:`previous sections <timings>` we have demonstrated that the approximate version
of the Genie algorithm (:class:`genieclust.Genie(exact=False, ...) <genieclust.Genie>`), i.e.,
one which relies on `nmslib <https://github.com/nmslib/nmslib/tree/master/python_bindings>`_\ 's
approximate nearest neighbour search, is much faster than the exact one
on large, high-dimensional datasets. In particular, we have noted that
clustering of 1 million points in a 100d Euclidean space
takes less than 5 minutes on a laptop.
In one of the [previous sections](timings), we have demonstrated that the approximate version
of the Genie algorithm ([`genieclust.Genie(exact=False, ...)`](genieclust.Genie)), i.e.,
one which relies on `nmslib`'s {cite}`nmslib` approximate nearest neighbour search,
is much faster than the exact one on large, high-dimensional datasets.
In particular, we have noted that clustering of 1 million points
in a 100d Euclidean space takes less than 5 minutes on a laptop.

As *fast* does not necessarily mean *meaningful* (tl;dr spoiler alert: in our case, it does),
let's again consider all the datasets
from the `Benchmark Suite for Clustering Algorithms Version 1 <https://github.com/gagolews/clustering-benchmarks>`_
:cite:`clustering_benchmarks_v1`
(except the ``h2mg`` and ``g2mg`` batteries). Features with variance of 0 were
from the [Benchmark Suite for Clustering Algorithms (Version 1.0)](https://clustering-benchmarks.gagolewski.com)
{cite}`clustering-benchmarks`
(except the `h2mg` and `g2mg` batteries). Features with variance of 0 were
removed, datasets were centred at **0** and scaled so that they have total
variance of 1. Tiny bit of Gaussian noise was added to each observation.
Clustering is performed with respect to the Euclidean distance.






<<bench-approx-imports,results="hidden",echo=False>>=
```{python bench-approx-imports,results="hide",echo=FALSE}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Expand All @@ -50,11 +46,11 @@ res = pd.read_csv("v1-timings.csv") # see timings.py
dims = pd.read_csv("v1-dims.csv")
dims["dataset"] = dims["battery"]+"/"+dims["dataset"]
dims = dims.loc[:,"dataset":]
@
```



<<approx-diffs-load,results="hidden",echo=False>>=
```{python approx-diffs-load,results="hide",echo=FALSE}
# Load results file:
res = pd.read_csv("v1-scores-approx.csv")
# ari, afm can be negative --> replace negative indexes with 0.0
Expand All @@ -80,20 +76,20 @@ params.columns = ["method", "gini_threshold", "run"]
res_max = pd.concat((res_max.drop("method", axis=1), params), axis=1)
res_max["dataset"] = res_max["battery"] + "/" + res_max["dataset"]
res_max = res_max.iloc[:, 1:]
@
```



On each benchmark dataset ("small" and "large" altogether)
we have fired 10 runs of the approximate Genie method (``exact=False``)
we have fired 10 runs of the approximate Genie method (`exact=False`)
and computed the adjusted Rand (AR) indices to quantify the similarity between the predicted
outputs and the reference ones.

We've computed the differences between each of the 10 AR indices
and the AR index for the exact method. Here is the complete list of datasets
and `gini_threshold`\ s where this discrepancy is seen at least 2 digits of precision:
and `gini_threshold`s where this discrepancy is seen at least 2 digits of precision:

<<approx-diffs,results="rst",echo=False>>=
```{python approx-diffs,results="asis",echo=FALSE}
# which similarity measure to report below:
similarity_measure = "ar"
Expand All @@ -106,35 +102,35 @@ _dat = diffs_stats.loc[(np.abs(diffs_stats["min"])>=0.0095)|(np.abs(diffs_stats[
#_dat = _dat.drop("count", axis=1)
which_repeated = (_dat.dataset.shift(1) == _dat.dataset)
_dat.loc[which_repeated, "dataset"] = ""
print(tabulate(_dat, _dat.columns, tablefmt="rst", showindex=False), "\n\n")
@
print(tabulate(_dat, _dat.columns, tablefmt="github", showindex=False), "\n\n")
```


The only noteworthy difference is for the ``sipu/birch2`` dataset
The only noteworthy difference is for the `sipu/birch2` dataset
where we observe that the approximate method generates worse results
(although recall that `gini_threshold` of 1 corresponds to the single linkage method).
Interestingly, for ``sipu/worms_64``, the in-exact algorithm with `gini_threshold`
Interestingly, for `sipu/worms_64`, the in-exact algorithm with `gini_threshold`
of 0.5 yields a much better outcome than the original one.


Here are the descriptive statistics for the AR indices across all the datasets
(for the approximate method we chose the median AR in each of the 10 runs):

<<approx-ar,results="rst",echo=False>>=
```{python approx-ar,results="asis",echo=FALSE}
_dat = res_max.groupby(["dataset", "method"])[similarity_measure].\
median().reset_index().groupby(["method"]).describe().\
round(3).reset_index()
_dat.columns = [l0 if not l1 else l1 for l0, l1 in _dat.columns]
_dat.method
#_dat.method
#which_repeated = (_dat.gini_threshold.shift(1) == _dat.gini_threshold)
#_dat.loc[which_repeated, "gini_threshold"] = ""
#_dat = _dat.drop("count", axis=1)
print(tabulate(_dat, _dat.columns, tablefmt="rst", showindex=False), "\n\n")
@
print(tabulate(_dat, _dat.columns, tablefmt="github", showindex=False), "\n\n")
```


For the recommended ranges of the `gini_threshold` parameter,
i.e., between 0.1 and 0.5, we see that the approximate version of Genie
behaves as good as the original one.
behaves similarly to the original one.
83 changes: 83 additions & 0 deletions .devel/sphinx/weave/benchmarks_approx.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@




# Benchmarks — Approximate Method

In one of the [previous sections](timings), we have demonstrated that the approximate version
of the Genie algorithm ([`genieclust.Genie(exact=False, ...)`](genieclust.Genie)), i.e.,
one which relies on `nmslib`'s {cite}`nmslib` approximate nearest neighbour search,
is much faster than the exact one on large, high-dimensional datasets.
In particular, we have noted that clustering of 1 million points
in a 100d Euclidean space takes less than 5 minutes on a laptop.

As *fast* does not necessarily mean *meaningful* (tl;dr spoiler alert: in our case, it does),
let's again consider all the datasets
from the [Benchmark Suite for Clustering Algorithms (Version 1.0)](https://clustering-benchmarks.gagolewski.com)
{cite}`clustering-benchmarks`
(except the `h2mg` and `g2mg` batteries). Features with variance of 0 were
removed, datasets were centred at **0** and scaled so that they have total
variance of 1. Tiny bit of Gaussian noise was added to each observation.
Clustering is performed with respect to the Euclidean distance.












On each benchmark dataset ("small" and "large" altogether)
we have fired 10 runs of the approximate Genie method (`exact=False`)
and computed the adjusted Rand (AR) indices to quantify the similarity between the predicted
outputs and the reference ones.

We've computed the differences between each of the 10 AR indices
and the AR index for the exact method. Here is the complete list of datasets
and `gini_threshold`s where this discrepancy is seen at least 2 digits of precision:

| dataset | gini_threshold | count | mean | std | min | 25% | 50% | 75% | max |
|------------------|------------------|---------|--------|-------|-------|-------|-------|-------|-------|
| sipu/birch2 | 0.7 | 10 | -0.01 | 0.01 | -0.02 | -0.02 | -0.01 | -0.01 | 0 |
| | 1 | 10 | -0.35 | 0.18 | -0.44 | -0.44 | -0.43 | -0.43 | 0 |
| sipu/worms_64 | 0.1 | 10 | -0.03 | 0.01 | -0.06 | -0.03 | -0.02 | -0.02 | -0.02 |
| | 0.3 | 10 | 0.02 | 0.01 | -0.01 | 0.02 | 0.03 | 0.03 | 0.03 |
| | 0.5 | 10 | 0.23 | 0.08 | 0.11 | 0.16 | 0.25 | 0.29 | 0.34 |
| wut/trajectories | 0.1 | 10 | -0 | 0.02 | -0.05 | 0 | 0 | 0 | 0 |
| | 0.3 | 10 | -0 | 0.02 | -0.05 | 0 | 0 | 0 | 0 |
| | 0.5 | 10 | -0 | 0.02 | -0.05 | 0 | 0 | 0 | 0 |
| | 0.7 | 10 | -0 | 0.02 | -0.05 | 0 | 0 | 0 | 0 |
| | 1 | 10 | -0.1 | 0.32 | -1 | 0 | 0 | 0 | 0 |


The only noteworthy difference is for the `sipu/birch2` dataset
where we observe that the approximate method generates worse results
(although recall that `gini_threshold` of 1 corresponds to the single linkage method).
Interestingly, for `sipu/worms_64`, the in-exact algorithm with `gini_threshold`
of 0.5 yields a much better outcome than the original one.


Here are the descriptive statistics for the AR indices across all the datasets
(for the approximate method we chose the median AR in each of the 10 runs):

| method | count | mean | std | min | 25% | 50% | 75% | max |
|------------------|---------|--------|-------|-------|-------|-------|-------|-------|
| Genie_0.1 | 79 | 0.728 | 0.307 | 0 | 0.516 | 0.844 | 1 | 1 |
| Genie_0.1_approx | 79 | 0.728 | 0.307 | 0 | 0.516 | 0.844 | 1 | 1 |
| Genie_0.3 | 79 | 0.755 | 0.292 | 0 | 0.555 | 0.9 | 1 | 1 |
| Genie_0.3_approx | 79 | 0.755 | 0.292 | 0 | 0.568 | 0.9 | 1 | 1 |
| Genie_0.5 | 79 | 0.731 | 0.332 | 0 | 0.531 | 0.844 | 1 | 1 |
| Genie_0.5_approx | 79 | 0.734 | 0.326 | 0 | 0.531 | 0.844 | 1 | 1 |
| Genie_0.7 | 79 | 0.624 | 0.376 | 0 | 0.264 | 0.719 | 1 | 1 |
| Genie_0.7_approx | 79 | 0.624 | 0.376 | 0 | 0.264 | 0.719 | 1 | 1 |
| Genie_1.0 | 79 | 0.415 | 0.447 | 0 | 0 | 0.174 | 1 | 1 |
| Genie_1.0_approx | 79 | 0.409 | 0.45 | 0 | 0 | 0.148 | 1 | 1 |


For the recommended ranges of the `gini_threshold` parameter,
i.e., between 0.1 and 0.5, we see that the approximate version of Genie
behaves similarly to the original one.
Loading

0 comments on commit b0a6f92

Please sign in to comment.