genieclust: Fast and Robust Hierarchical Clustering with Noise Point Detection¶
Genie finds meaningful clusters and is fast even on large data sets.
—by Marek Gagolewski
The idea behind Genie is very simple. First, make each individual point the only member of its own cluster. Then, keep merging pairs of the closest clusters, one after another. However, to prevent the formation of clusters of highly imbalanced sizes a point group of the smallest size is sometimes matched with its nearest neighbours.
Genie’s appealing simplicity goes hand in hand with its usability; it often outperforms other clustering approaches such as K-means, BIRCH, or average, Ward, and complete linkage on benchmark data.
Genie is also very fast — determining the whole cluster hierarchy for datasets of millions of points can be completed within a coffee break. Therefore, it is perfectly suited for solving of extreme clustering tasks (large datasets with any number of clusters to detect) for data that fit into memory. Thanks to the use of nmslib [NBMN19], sparse or string inputs are also supported.
Genie also allows clustering with respect to mutual reachability distances so that it can act as a noise point detector or a robustified version of HDBSCAN* [CMZS15] that is able to detect a predefined number of clusters and hence it doesn’t dependent on the DBSCAN’s somehow difficult-to-set eps parameter.
The Python language version of genieclust has a familiar scikit-learn-like [BLB+13] look-and-feel:
import genieclust X = ... # some data g = genieclust.Genie(n_clusters=2) labels = g.fit_predict(X)
R’s interface is compatible with
hclust(), but there is more.
X <- ... # some data h <- gclust(X) plot(h) # plot cluster dendrogram cutree(h, k=2) # or genie(X, k=2)
The genieclust package is available for Python (via PyPI) and R (on CRAN). Its source code is distributed under the open source GNU AGPL v3 license and can be downloaded from GitHub. Note that the core functionality is implemented in the form of a header-only C++ library, hence it might be relatively easily adapted to new environments — any contributions are welcome.
- Comparing Algorithms on Toy Datasets
- Benchmarks (How Good Is It?)
- Timings (How Fast Is It?)
- Clustering with Noise Points Detection
- Example: Sparse Data and Movie Recommendation
- Example: String Data and Grouping of DNA
- R Interface Examples
- Python Package genieclust Reference
- R Package genieclust Reference
- comparing_partitions: Pairwise Partition Similarity Scores
- emst_mlpack: Euclidean Minimum Spanning Tree
- gclust: The Genie++ Hierarchical Clustering Algorithm
- genieclust-package: The Genie++ Hierarchical Clustering Algorithm (with Extras)
- inequity: Inequity (Inequality) Measures
- mst: Minimum Spanning Tree of the Pairwise Distance Graph