# Python and R Package genieclust: Fast and Robust Hierarchical Clustering with Noise Point Detection

Genie finds meaningful clusters and is fast even on large data sets.

genieclust [Gag21] brings
a faster and more powerful version of **Genie** [GBC16] — a robust
and outlier resistant clustering algorithm, originally published as an R package
genie.

The idea behind Genie is beautifully simple. First, make each individual
point the sole member of its own cluster. Then, keep merging pairs
of the closest clusters, one after another. However, to **prevent
the formation of clusters of highly imbalanced sizes**
a point group of the smallest size will sometimes be matched with its nearest
neighbours.

Genie’s appealing simplicity goes hand in hand with its usability;
it **often outperforms other clustering approaches**
such as K-means, BIRCH, or average, Ward, and complete linkage
on benchmark data.

Genie is also **very fast** — determining the whole cluster hierarchy
for datasets of millions of points can be completed within
a coffee break.
Therefore, it is perfectly suited for solving of **extreme clustering tasks**
(large datasets with any number of clusters to detect) for data
that fit into memory.
Thanks to the use of nmslib [NBMN19],
sparse or string inputs are also supported.

Genie also allows clustering with respect to mutual reachability distances
so that it can act as a **noise point detector** or a robustified version
of HDBSCAN* [CMZS15] that is able to detect a predefined
number of clusters and so it doesn’t dependent on the DBSCAN’s somewhat
difficult-to-set eps parameter.

The Python language version of genieclust has a familiar scikit-learn-like [B+13] look-and-feel:

```
import genieclust
X = ... # some data
g = genieclust.Genie(n_clusters=2)
labels = g.fit_predict(X)
```

To learn more about Python, check out Marek’s recent open-access (free!) textbookMinimalist Data Wrangling in Python [Gag22].

The R language interface is compatible with `hclust()`

, but there is more.

```
X <- ... # some data
h <- gclust(X)
plot(h) # plot cluster dendrogram
cutree(h, k=2)
# or genie(X, k=2)
```

The genieclust package is available for Python (via PyPI) and R (on CRAN). Its source code is distributed under the open source GNU AGPL v3 license and can be downloaded from GitHub. The core functionality is implemented in the form of a header-only C++ library, so it may be adapted to new environments relatively easily — any contributions are welcome (Julia, Matlab, etc.).