genieclust: Fast and Robust Hierarchical Clustering with Noise Point Detection¶
Genie finds meaningful clusters quickly – even on large data sets.
The genieclust package  for Python and R implements a robust and outlier resistant clustering algorithm called Genie .
The idea behind Genie is beautifully simple. First, make each individual point the sole member of its own cluster. Then, keep merging pairs of the closest clusters, one after another. However, to prevent the formation of clusters of highly imbalanced sizes a point group of the smallest size will sometimes be matched with its nearest neighbours.
Genie’s appealing simplicity goes hand in hand with its usability; it often outperforms other clustering approaches such as K-means, BIRCH, or average, Ward, and complete linkage on various kinds of benchmark dataset. Of course, there is no, nor will there ever be, a single best universal clustering approach for every kind of problem, but Genie is definitely worth a try!
Thanks to its being based on minimal spanning trees  of the pairwise distance graphs, Genie is also very fast — determining the whole cluster hierarchy for datasets of millions of points, can be completed within minutes. Therefore, it is capable of solving extreme clustering tasks (large datasets with any number of clusters to detect) on data that fit into memory. Thanks to the use of nmslib  (if available), sparse or string inputs are also supported.
Genie also allows clustering with respect to mutual reachability distances
so that it can act as a noise point detector or a robustified version
of HDBSCAN*  that is able to detect a predefined
number of clusters and so it doesn’t dependent on the DBSCAN’s somewhat
The Python version of genieclust is available via PyPI, e.g., via a call to
pip3 install genieclust
from the command line or through your favourite package manager. Note a familiar scikit-learn-like  look-and-feel:
import genieclust X = ... # some data g = genieclust.Genie(n_clusters=2) labels = g.fit_predict(X)
To learn more about Python, check out Marek’s recent open-access (free!) textbook Minimalist Data Wrangling in Python .
The R version of genieclust can be downloaded from CRAN by calling:
Its interface is compatible with the classic
stats::hclust(), but there is more.
X <- ... # some data h <- gclust(X) plot(h) # plot cluster dendrogram cutree(h, k=2) # or simply: genie(X, k=2)
To learn more about R, check out Marek’s recent open-access (free!) textbook Deep R Programming .
The implemented algorithms include:
Genie++ – a reimplementation of the original Genie algorithm from the R package genie : much faster than the original one; supports approximate disconnected MSTs;
Genie+HDBSCAN* – a robustified (Geniefied) retake on the HDBSCAN*  method that detects noise points in data and outputs clusters of predefined sizes;
GIc (Genie+Information Criterion) – a heuristic agglomerative algorithm  to minimise the information theoretic criterion ; see  (Python only).
inequality measures: the normalised Gini, Bonferroni, and De Vergottini indices;
external cluster validity measures (see [12, 13] for discussion): adjusted asymmetric accuracy and partition similarity scores such as normalised accuracy, pair sets index (PSI) , adjusted/unadjusted Rand, adjusted/unadjusted Fowlkes–Mallows (FM), adjusted/normalised/unadjusted mutual information (MI) indices;
internal cluster validity measures (see  for discussion): the Caliński–Harabasz, Silhouette, Ball–Hall, Davies–Bouldin, generalised Dunn indices, etc.;
(Python only) union-find (disjoint sets) data structures (with extensions);
(Python only) some R-like plotting functions.
genieclust is distributed under the open source GNU AGPL v3 license and can be downloaded from GitHub. The core functionality is implemented in the form of a header-only C++ library, so it may be adapted to new environments relatively easily — any valuable contributions are welcome (Julia, Matlab, etc.).
Author and Maintainer: Marek Gagolewski
Contributors: Maciej Bartoszuk and Anna Cena (genieclust’s predecessor genie  and some internal cluster validity measures CVI ); Peter M. Larsen (an implementation of the shortest augmenting path algorithm for the rectangular assignment problem which we use for computing some external cluster validity measures [16, 33]).
- Comparing Algorithms on Toy Datasets
- Benchmarks (How Good Is It?)
- Timings (How Fast Is It?)
- Clustering with Noise Points Detection
- Example: Sparse Data and Movie Recommendation
- Example: String Data and Grouping of DNA
- R Interface Examples