Skip to content

Commit

Permalink
v1.1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
gagolews committed Sep 5, 2022
1 parent 3331a63 commit 38056a1
Show file tree
Hide file tree
Showing 195 changed files with 6,075 additions and 1,536 deletions.
10 changes: 5 additions & 5 deletions .devel/sphinx/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,10 @@ which we use for computing the normalised accuracy and pair sets index).

About <self>
Author <https://www.gagolewski.com/>
Source Code (GitHub) <https://github.com/gagolews/genieclust>
Bug Tracker and Feature Suggestions <https://github.com/gagolews/genieclust/issues>
PyPI Entry <https://pypi.org/project/genieclust/>
CRAN Entry <https://CRAN.R-project.org/package=genieclust>
::::


Expand Down Expand Up @@ -149,11 +153,7 @@ rapi
:maxdepth: 1
:caption: See Also

Source Code (GitHub) <https://github.com/gagolews/genieclust>
Bug Tracker and Feature Suggestions <https://github.com/gagolews/genieclust/issues>
PyPI Entry <https://pypi.org/project/genieclust/>
CRAN Entry <https://CRAN.R-project.org/package=genieclust>
Clustering Benchmarks <https://github.com/gagolews/clustering-benchmarks>
Clustering Benchmarks <https://clustering-benchmarks.gagolewski.com>
Data Wrangling in Python <https://datawranglingpy.gagolewski.com/>
::::

Expand Down
10 changes: 5 additions & 5 deletions .devel/sphinx/news.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# What Is New in *genieclust*


## 1.1.x (under development)
## 1.1.0 (2022-09-05)

- [GENERAL] ..TO DO.. We now mention that the partition similarity scores
are discussed in more detail at

- [GENERAL] The cluster validity measures are discussed in more detail at
<https://clustering-benchmarks.gagolewski.com>.

- [Python and R] New function:
`compare_partitions.adjusted_asymmetric_accuracy`.

- [Python and R] Implementations of the so-called internal cluster
validity measures discussed in
DOI:[10.1016/j.ins.2021.10.004](https://doi.org/10.1016/j.ins.2021.10.004);
DOI: [10.1016/j.ins.2021.10.004](https://doi.org/10.1016/j.ins.2021.10.004);
see our (GitHub-only) [CVI](https://github.com/gagolews/optim_cvi) package
for R. In particular, the generalised Dunn indices are based on the code
originally authored by Maciej Bartoszuk. Thanks.
Expand Down Expand Up @@ -47,7 +47,7 @@

- [GENERAL] A paper on the `genieclust` package is now available:
M. Gagolewski, genieclust: Fast and robust hierarchical clustering,
*SoftwareX* **15**, 100722, 2021, DOI:
SoftwareX 15, 100722, 2021, DOI:
[10.1016/j.softx.2021.100722](https://doi.org/10.1016/j.softx.2021.100722).

- [Python] `plots.plot_scatter` now uses a more accessible default palette
Expand Down
12 changes: 6 additions & 6 deletions .devel/sphinx/rapi/comparing_partitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,11 @@ normalizing_permutation(x, y = NULL)

## Arguments

| | |
|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `x` | an integer vector of length n (or an object coercible to) representing a K-partition of an n-set (e.g., a reference partition), or a confusion matrix with K rows and L columns (see [`table(x, y)`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html)) |
| `y` | an integer vector of length n (or an object coercible to) representing an L-partition of the same set (e.g., the output of a clustering algorithm we wish to compare with `x`), or NULL (if x is an K\*L confusion matrix) |
| `whether` | to assume E=1 in the definition of the pair sets index index, i.e., use Eq. (20) instead of (18); see (Rezaei, Franti, 2016). |
| | |
|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `x` | an integer vector of length n (or an object coercible to) representing a K-partition of an n-set (e.g., a reference partition), or a confusion matrix with K rows and L columns (see [`table(x, y)`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html)) |
| `y` | an integer vector of length n (or an object coercible to) representing an L-partition of the same set (e.g., the output of a clustering algorithm we wish to compare with `x`), or NULL (if x is an K\*L confusion matrix) |
| `simplified` | whether to assume E=1 in the definition of the pair sets index index, i.e., use Eq. (20) instead of (18); see (Rezaei, Franti, 2016). |

## Details

Expand Down Expand Up @@ -80,7 +80,7 @@ Each cluster validity measure is a single numeric value.

Gagolewski M., *A Framework for Benchmarking Clustering Algorithms*, 2022, <https://clustering-benchmarks.gagolewski.com>.

Gagolewski M., Adjusted asymmetric accuracy: An interpretable external cluster validity measure, 2022, submitted for publication.
Gagolewski M., Adjusted asymmetric accuracy: A well-behaving external cluster validity measure, 2022, submitted for publication.

Hubert L., Arabie P., Comparing partitions, *Journal of Classification* 2(1), 1985, 193-218, esp. Eqs. (2) and (4).

Expand Down
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Package: genieclust
Type: Package
Title: Fast and Robust Hierarchical Clustering with Noise Points Detection
Version: 1.0.0.9001
Date: 2022-08-29
Version: 1.1.0
Date: 2022-09-05
Authors@R: c(
person("Marek", "Gagolewski",
role = c("aut", "cre", "cph"),
Expand Down
4 changes: 3 additions & 1 deletion MANIFEST
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,21 @@
LICENSE
MANIFEST.in
NEWS
README.rst
requirements.txt
setup.cfg
setup.py
genieclust/__init__.py
genieclust/c_argfuns.pxd
genieclust/c_compare_partitions.pxd
genieclust/c_cvi.pxd
genieclust/c_disjoint_sets.pxd
genieclust/c_genie.pxd
genieclust/c_gini_disjoint_sets.pxd
genieclust/c_inequity.pxd
genieclust/c_mst.pxd
genieclust/c_postprocess.pxd
genieclust/c_preprocess.pxd
genieclust/cluster_validity.pyx
genieclust/compare_partitions.pyx
genieclust/genie.py
genieclust/inequity.pyx
Expand All @@ -25,6 +26,7 @@ genieclust/tools.pyx
src/c_argfuns.h
src/c_common.h
src/c_compare_partitions.h
src/c_cvi.h
src/c_disjoint_sets.h
src/c_distance.h
src/c_genie.h
Expand Down
10 changes: 5 additions & 5 deletions NEWS
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# What Is New in *genieclust*


## 1.1.x (under development)
## 1.1.0 (2022-09-05)

- [GENERAL] ..TO DO.. We now mention that the partition similarity scores
are discussed in more detail at

- [GENERAL] The cluster validity measures are discussed in more detail at
<https://clustering-benchmarks.gagolewski.com>.

- [Python and R] New function:
`compare_partitions.adjusted_asymmetric_accuracy`.

- [Python and R] Implementations of the so-called internal cluster
validity measures discussed in
DOI:[10.1016/j.ins.2021.10.004](https://doi.org/10.1016/j.ins.2021.10.004);
DOI: [10.1016/j.ins.2021.10.004](https://doi.org/10.1016/j.ins.2021.10.004);
see our (GitHub-only) [CVI](https://github.com/gagolews/optim_cvi) package
for R. In particular, the generalised Dunn indices are based on the code
originally authored by Maciej Bartoszuk. Thanks.
Expand Down Expand Up @@ -47,7 +47,7 @@

- [GENERAL] A paper on the `genieclust` package is now available:
M. Gagolewski, genieclust: Fast and robust hierarchical clustering,
*SoftwareX* **15**, 100722, 2021, DOI:
SoftwareX 15, 100722, 2021, DOI:
[10.1016/j.softx.2021.100722](https://doi.org/10.1016/j.softx.2021.100722).

- [Python] `plots.plot_scatter` now uses a more accessible default palette
Expand Down
4 changes: 2 additions & 2 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@
#' Gagolewski M., \emph{A Framework for Benchmarking Clustering Algorithms},
#' 2022, \url{https://clustering-benchmarks.gagolewski.com}.
#'
#' Gagolewski M., Adjusted asymmetric accuracy: An interpretable external
#' Gagolewski M., Adjusted asymmetric accuracy: A well-behaving external
#' cluster validity measure, 2022, submitted for publication.
#'
#' Hubert L., Arabie P., Comparing partitions,
Expand Down Expand Up @@ -126,7 +126,7 @@
#' clustering algorithm we wish to compare with \code{x}),
#' or NULL (if x is an K*L confusion matrix)
#'
#' @param whether to assume E=1 in the definition of the pair sets index index,
#' @param simplified whether to assume E=1 in the definition of the pair sets index index,
#' i.e., use Eq. (20) instead of (18); see (Rezaei, Franti, 2016).
#'
#'
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?,
*Information Sciences* **581**, 2021, 620–636.
[DOI: 10.1016/j.ins.2021.10.004](https://doi.org/10.1016/j.ins.2021.10.004).

Gagolewski M., *Adjusted asymmetric accuracy: An interpretable external
Gagolewski M., *Adjusted asymmetric accuracy: A well-behaving external
cluster validity measure*, 2022, submitted for publication.

Gagolewski M., *A Framework for Benchmarking Clustering Algorithms*.
Expand Down
Binary file modified docs/_images/benchmarks_ar_plot_large_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/benchmarks_ar_plot_small_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/benchmarks_details_indices_large_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/benchmarks_details_indices_small_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/noise_noise-Genie1_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/noise_noise-Genie2_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/noise_noise-Genie3_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/noise_noise-HDBSCAN1_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/noise_noise-HDBSCAN2_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/noise_noise-scatter_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/r_ssi-map-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/r_ssi-oecd-dendrogram-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/sklearn_toy_example_clustering_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/_images/timings_g2mg-plot_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/_sources/genieclust.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Python Package `genieclust` Reference
.. autosummary::

genieclust.Genie
genieclust.cluster_validity
genieclust.compare_partitions
genieclust.inequity
genieclust.internal
Expand All @@ -18,6 +19,7 @@ Python Package `genieclust` Reference
:caption: Modules and Classes:

genieclust_genie
genieclust_cluster_validity
genieclust_compare_partitions
genieclust_inequity
genieclust_internal
Expand Down
5 changes: 5 additions & 0 deletions docs/_sources/genieclust_cluster_validity.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
genieclust.cluster_validity
=============================

.. automodule:: genieclust.cluster_validity
:members:
179 changes: 179 additions & 0 deletions docs/_sources/index.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# *genieclust*: Fast and Robust Hierarchical Clustering with Noise Point Detection

::::{epigraph}
**Genie finds meaningful clusters and is fast even on large data sets.**
::::

::::{image} _static/img/genie_toy_example.png
:class: img-right-align-always
:alt: Genie
:width: 128px
::::


The *genieclust* package {cite}`genieclust` equips Python and R users with
a faster and more powerful version of *Genie* {cite}`genieins` — a robust
and outlier resistant clustering algorithm, originally published as an R package
[*genie*](https://cran.r-project.org/web/packages/genie).

The idea behind *Genie* is beautifully simple. First, make each individual
point the sole member of its own cluster. Then, keep merging pairs
of the closest clusters, one after another. However, to **prevent
the formation of clusters of highly imbalanced sizes**
a point group of the smallest size will sometimes be matched with its nearest
neighbours.

Genie's appealing simplicity goes hand in hand with its usability;
it **often outperforms other clustering approaches**
such as K-means, BIRCH, or average, Ward, and complete linkage
on {any}`benchmark data <weave/benchmarks_ar>`.

Genie is also **very fast** — determining the whole cluster hierarchy
for datasets of millions of points, can be completed within
{any}`minutes <weave/timings>`.
Therefore, it is capable of solving **extreme clustering tasks**
(large datasets with any number of clusters to detect)
on data that fit into memory.
Thanks to the use of *nmslib* {cite}`nmslib`,
sparse or string inputs are also supported.

Genie also allows clustering with respect to mutual reachability distances
so that it can act as a **noise point detector** or a robustified version
of *HDBSCAN\** {cite}`hdbscan` that is able to detect a predefined
number of clusters and so it doesn't dependent on the *DBSCAN*'s somewhat
difficult-to-set `eps` parameter.



The **Python version** of *genieclust* is available via
[PyPI](https://pypi.org/project/genieclust/), e.g.,
via a call to

```bash
pip3 install genieclust
```

from the command line or through your favourite package manager.
Note a familiar *scikit-learn*-like {cite}`sklearn_api` look-and-feel:

```python
import genieclust
X = ... # some data
g = genieclust.Genie(n_clusters=2)
labels = g.fit_predict(X)
```

::::{epigraph}
*To learn more about Python, check out Marek's recent open-access (free!) textbook*
[Minimalist Data Wrangling in Python](https://datawranglingpy.gagolewski.com/)
{cite}`datawranglingpy`.
::::



The **R version** of *genieclust* can be downloaded from
[CRAN](https://cran.r-project.org/web/packages/genieclust/)
by calling:

```r
install.packages("genieclust")
```

Its interface is compatible with the classic `stats::hclust()`, but there is more.

```r
X <- ... # some data
h <- gclust(X)
plot(h) # plot cluster dendrogram
cutree(h, k=2)
# or simply: genie(X, k=2)
```



*genieclust* is distributed
under the open source GNU AGPL v3 license and can be downloaded from
[GitHub](https://github.com/gagolews/genieclust).
The core functionality is implemented in the form of a header-only C++
library, so it may be adapted to new environments relatively easily —
any contributions are welcome (Julia, Matlab, etc.).

**Author and Maintainer**: [Marek Gagolewski](https://www.gagolewski.com)

**Contributors**:
[Maciej Bartoszuk](http://bartoszuk.rexamine.com), [Anna Cena](https://cena.rexamine.com) (R packages
[*genie*](https://cran.r-project.org/web/packages/genie) /*genieclust*'s predecessor {cite}`genieins`/
and [*CVI*](https://github.com/gagolews/optim_cvi) /some internal cluster validity measures {cite}`cvi`/),
[Peter M. Larsen](https://github.com/pmla/)
(an [implementation](https://github.com/scipy/scipy/blob/main/scipy/optimize/rectangular_lsap/rectangular_lsap.cpp)
of the shortest augmenting path algorithm for the rectangular assignment problem
which we use for computing the normalised accuracy and pair sets index).



::::{toctree}
:maxdepth: 2
:caption: genieclust
:hidden:

About <self>
Author <https://www.gagolewski.com/>
Source Code (GitHub) <https://github.com/gagolews/genieclust>
Bug Tracker and Feature Suggestions <https://github.com/gagolews/genieclust/issues>
PyPI Entry <https://pypi.org/project/genieclust/>
CRAN Entry <https://CRAN.R-project.org/package=genieclust>
::::


::::{toctree}
:maxdepth: 2
:caption: Examples and Tutorials

weave/basics
weave/sklearn_toy_example
weave/benchmarks_ar
weave/timings
weave/noise
weave/sparse
weave/string
weave/r
::::


::::{toctree}
:maxdepth: 1
:caption: API Documentation

genieclust
rapi
::::


::::{toctree}
:maxdepth: 1
:caption: See Also

Clustering Benchmarks <https://clustering-benchmarks.gagolewski.com>
Data Wrangling in Python <https://datawranglingpy.gagolewski.com/>
::::


::::{toctree}
:maxdepth: 1
:caption: Appendix

news
weave/benchmarks_details
weave/benchmarks_approx
z_bibliography
::::


<!--
Indices and Tables
------------------

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
-->
Loading

0 comments on commit 38056a1

Please sign in to comment.