Skip to content

Commit

Permalink
Release for JSS paper.
Browse files Browse the repository at this point in the history
  • Loading branch information
mhahsler committed Oct 23, 2019
1 parent 70adc7b commit 50cacb7
Show file tree
Hide file tree
Showing 9 changed files with 966 additions and 919 deletions.
7 changes: 4 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: dbscan
Version: 1.1-4.1
Date: 2019-xx-xx
Version: 1.1-5
Date: 2019-10-22
Title: Density Based Clustering of Applications with Noise (DBSCAN) and Related
Algorithms
Authors@R: c(person("Michael", "Hahsler", role = c("aut", "cre", "cph"),
Expand All @@ -14,7 +14,8 @@ Description: A fast reimplementation of several density-based algorithms of
the clustering structure) clustering algorithms HDBSCAN (hierarchical DBSCAN) and the LOF (local outlier
factor) algorithm. The implementations use the kd-tree data structure (from
library ANN) for faster k-nearest neighbor search. An R interface to fast kNN
and fixed-radius NN search is also provided.
and fixed-radius NN search is also provided.
See Hahsler M, Piekenbrock M and Doran D (2019) <doi:10.18637/jss.v091.i01>.
Imports:
Rcpp (>= 1.0.0),
graphics,
Expand Down
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# dbscan 1.1-4.1 (2019-xx-xx)
# dbscan 1.1-5 (2019-10-22)

## New Features
* kNN and frNN gained parameter query to query neighbors for points not in the data.
Expand Down
19 changes: 19 additions & 0 deletions inst/CITATION
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
bibentry(bibtype = "Article",
title = "{dbscan}: Fast Density-Based Clustering with {R}",
author = c(person(given = "Michael",
family = "Hahsler",
email = "[email protected]"),
person(given = "Matthew",
family = "Piekenbrock"),
person(given = "Derek",
family = "Doran",
email = "[email protected]")),
journal = "Journal of Statistical Software",
year = "2019",
volume = "91",
number = "1",
pages = "1--30",
doi = "10.18637/jss.v091.i01",
header = "To cite dbscan in publications use:"
)

7 changes: 5 additions & 2 deletions man/dbscan.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ dbscan(x, eps, minPts = 5, weights = NULL, borderPoints = TRUE, ...)
\emph{Note:} use \code{dbscan::dbscan} to call this
implementation when you also use package \pkg{fpc}.

This implementation of DBSCAN implements the original algorithm as described by
This implementation of DBSCAN (Hahsler et al, 2019) implements the original algorithm as described by
Ester et al (1996). DBSCAN estimates the density around each data point by counting the number of points in a user-specified eps-neighborhood and applies a used-specified minPts thresholds to identify core, border and noise points. In a second step, core points are joined into a cluster if they are density-reachable (i.e., there is a chain of core points where one falls inside the eps-neighborhood of the next). Finally, border points are assigned to clusters. The algorithm only needs
parameters \code{eps} and \code{minPts}.

Expand Down Expand Up @@ -72,7 +72,10 @@ cluster will be reported as members of the noise cluster 0.
\item{cluster }{A integer vector with cluster assignments. Zero indicates noise points.}
}
\references{
Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. \emph{Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).}
Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R.
\emph{Journal of Statistical Software}, 91(1), 1-30. \doi{10.18637/jss.v091.i01}

Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. \emph{Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96)}, 226-231.

Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013). Density-Based Clustering
Based on Hierarchical Density Estimates. \emph{Proceedings of the 17th
Expand Down
11 changes: 6 additions & 5 deletions man/hdbscan.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,10 @@ hdbscan(x, minPts, xdist = NULL,

}
\details{
Computes the hierarchical cluster tree representing density estimates along with the stability-based flat cluster extraction
This fast implementation of HDBSCAN (Hahsler et al, 2019) computes the hierarchical cluster tree representing density estimates along with the stability-based flat cluster extraction
proposed by Campello et al. (2013). HDBSCAN essentially computes the hierarchy of all DBSCAN* clusterings, and then uses a stability-based extraction method to find optimal cuts in the hierarchy, thus producing a flat solution.

Additional, related algorithms including the "Global-Local Outlier Score from Hierarchies" (GLOSH) (see section 6 of Campello et al. 2015) outlier scores and ability to cluster based on instance-level constraints (see section 5.3 of Campello et al. 2015) are supported. The algorithms only need the parameter \code{minPts}.
Additional, related algorithms including the "Global-Local Outlier Score from Hierarchies" (GLOSH) (see section 6 of Campello et al., 2015) outlier scores and ability to cluster based on instance-level constraints (see section 5.3 of Campello et al. 2015) are supported. The algorithms only need the parameter \code{minPts}.

Note that \code{minPts} not only acts as a minimum cluster size to detect, but also as a "smoothing" factor of the density estimates implicitly computed from HDBSCAN.
}
Expand All @@ -54,11 +54,12 @@ Note that \code{minPts} not only acts as a minimum cluster size to detect, but a
%% ...
}
\references{
Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. \emph{Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).}
Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R.
\emph{Journal of Statistical Software}, 91(1), 1-30. \doi{10.18637/jss.v091.i01}

Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. \emph{Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013,} Lecture Notes in Computer Science 7819, p. 160.
Campello RJGB, Moulavi D, Sander J (2013). Density-Based Clustering Based on Hierarchical Density Estimates. \emph{Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013,} Lecture Notes in Computer Science 7819, p. 160.

Campello, Ricardo JGB, et al. "Hierarchical density estimates for data clustering, visualization, and outlier detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 10.1 (2015): 5.
Campello RJGB, Moulavi D, Zimek A, Sander J (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. \emph{ACM Transactions on Knowledge Discovery from Data (TKDD),} 10(5):1-51.
}

\seealso{
Expand Down
9 changes: 6 additions & 3 deletions man/optics.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ extractXi(object, xi, minimum = FALSE, correctPredecessors = TRUE)
details on how to control the search strategy.}
}
\details{
This implementation of OPTICS implements the original algorithm as described by
This implementation of OPTICS (Hahsler et al, 2019) implements the original algorithm as described by
Ankerst et al (1999). OPTICS is an ordering algorithm using similar concepts
to DBSCAN. However, for OPTICS
\code{eps} is only an upper limit for the neighborhood size used to reduce
Expand Down Expand Up @@ -101,9 +101,12 @@ See \code{\link{frNN}} for more information on the parameters related to nearest
\item{clusters_xi }{ data.frame containing the start and end of each cluster found in the OPTICS ordering. }
}
\references{
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Joerg Sander (1999). OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD international conference on Management of data. ACM Press. pp. 49--60.
Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast Density-Based Clustering with R.
\emph{Journal of Statistical Software}, 91(1), 1-30. \doi{10.18637/jss.v091.i01}

Erich Schubert, Michael Gertz (2018). Improving the Cluster Structure Extracted from OPTICS Plots. Lernen, Wissen, Daten, Analysen (LWDA 2018). pp. 318--329.
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Joerg Sander (1999). OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD international conference on Management of data. ACM Press. pp. 49-60.

Erich Schubert, Michael Gertz (2018). Improving the Cluster Structure Extracted from OPTICS Plots. Lernen, Wissen, Daten, Analysen (LWDA 2018). pp. 318-329.
}

\author{
Expand Down
7 changes: 7 additions & 0 deletions vignettes/dbscan.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,13 @@ This article presents an overview of the \proglang{R} package~\pkg{dbscan}
focusing on DBSCAN and OPTICS, outlining its operation and experimentally
compares its performance with implementations in other open-source implementations. We first review the concept of density-based clustering and present the DBSCAN and OPTICS algorithms in Section~\ref{sec:dbc}. This section concludes with a short review of existing software packages that implement these algorithms. Details about \pkg{dbscan}, with examples of its use, are presented in Section~\ref{sec:dbscan}. A performance evaluation is presented in Section~\ref{sec:eval}. Concluding remarks are offered in Section~\ref{sec:conc}.

A version of this article describing the package \pkg{dbscan} was published as \cite{hahsler2019dbscan} and should be cited.

<<echo=FALSE>>=
options(useFancyQuotes = FALSE)
citation("dbscan")
@

\section{Density-based clustering}\label{sec:dbc}
Density-based clustering is now a well-studied field. Conceptually, the idea behind density-based clustering is simple: given a set of data points, define a structure that accurately reflects the underlying density~\citep{sander2011density}. An important distinction between density-based clustering and alternative approaches to cluster analysis, such as the use of \emph{(Gaussian) mixture models}~\citep[see][]{jain1999review}, is that the latter represents a \emph{parametric} approach in which the observed data are assumed to have been produced by mixture of either Gaussian or other parametric families of distributions.
While certainly useful in many applications, parametric approaches naturally assume clusters will exhibit some type convex (generally hyper-spherical or hyper-elliptical) shape. Other approaches, such as $k$-means clustering (where the $k$ parameter signifies the user-specified number of clusters to find), share this common theme of `minimum variance', where the underlying assumption is made that ideal clusters are found by minimizing some measure of intra-cluster variance (often referred to as cluster cohesion) and maximizing the inter-cluster variance (cluster separation)~\citep{arbelaitz2013extensive}. Conversely, the label density-based clustering is used for methods which do not assume parametric distributions, are capable of finding arbitrarily-shaped clusters, handle varying amounts of noise, and require no prior knowledge regarding how to set the number of clusters $k$. This methodology is best expressed in the DBSCAN algorithm, which we discuss next.
Expand Down
Loading

0 comments on commit 50cacb7

Please sign in to comment.