# Basics

*Genie* [16] is an agglomerative hierarchical clustering
algorithm that links clusters minding that
the Gini index (a popular measure of inequity used in, amongst others,
economics) of the cluster sizes should not go too far beyond a given threshold.
If this happens, instead of merging two closest clusters, a smallest cluster
is joined with its nearest neighbour.
In the following sections we’ll show that Genie frequently outperforms many other methods
in terms of clustering quality and speed.

Here are a few examples of basic interactions with the Python version
of the genieclust [11] package,
which can be installed from
PyPI, e.g.,
via a call to `pip3 install genieclust`

from the command line.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import genieclust
```

## Breaking the Ice

Let’s load an example benchmark set, `jain`

[22], which comes
with the true corresponding partition (as assigned by experts).

```
# see https://github.com/gagolews/genieclust/tree/master/devel/sphinx/weave
dataset = "jain"
# Load an example 2D dataset:
X = np.loadtxt("%s.data.gz" % dataset, ndmin=2)
# Load the corresponding reference labels. The original labels are in {1,2,..,k}.
# We'll make them more Python-ish by subtracting 1.
labels_true = np.loadtxt("%s.labels0.gz" % dataset, dtype=np.intp)-1
# The number of unique labels gives the true cluster count:
n_clusters = len(np.unique(labels_true))
```

A scatter plot of the dataset together with the reference labels:

```
genieclust.plots.plot_scatter(X, labels=labels_true)
plt.title("%s (n=%d, true n_clusters=%d)" % (dataset, X.shape[0], n_clusters))
plt.axis("equal")
plt.show()
```

Let’s apply the Genie algorithm (with the default/recommended
`gini_threshold`

parameter value). The genieclust package’s interface
is compatible with the one from the popular
scikit-learn library [32].
In particular, an object of class Genie is equipped with the
fit and fit_predict methods [1].

```
g = genieclust.Genie(n_clusters=n_clusters)
labels_genie = g.fit_predict(X)
```

See the documentation of the `genieclust.Genie`

class
for more details.

Plotting of the discovered partition:

```
genieclust.plots.plot_scatter(X, labels=labels_genie)
plt.title("Genie (gini_threshold=%g)" % g.gini_threshold)
plt.axis("equal")
plt.show()
```

Nice.

A picture is worth a thousand words, but numbers are worth millions of pictures. We can compare the resulting clustering with the reference one by computing, for example, the confusion matrix.

```
# Compute the confusion matrix (with pivoting)
genieclust.compare_partitions.normalized_confusion_matrix(labels_true, labels_genie)
```

```
## array([[276, 0],
## [ 0, 97]])
```

The above confusion matrix can be summarised by means of partition
similarity measures, like the Adjusted Rand Index (`ar`

).

```
# See also: sklearn.metrics.adjusted_rand_score()
genieclust.compare_partitions.adjusted_rand_score(labels_true, labels_genie)
```

```
## 1.0
```

Which of course denotes a perfect match between these two.

## A Comparison with k-means

For the sake of comparison, let’s apply the k-means algorithm on the same dataset.

```
import sklearn.cluster
km = sklearn.cluster.KMeans(n_clusters=n_clusters)
labels_kmeans = km.fit_predict(X)
genieclust.plots.plot_scatter(X, labels=labels_kmeans)
plt.title("k-means")
plt.axis("equal")
plt.show()
```

It is well known that the k-means algorithm can only split the input space into convex regions (compare the notion of the Voronoi diagrams). So we shouldn’t be much surprised with this result.

```
# Compute the confusion matrix for the k-means output:
genieclust.compare_partitions.normalized_confusion_matrix(labels_true, labels_kmeans)
```

```
## array([[197, 79],
## [ 1, 96]])
```

```
# A cluster similarity measure for k-means:
genieclust.compare_partitions.adjusted_rand_score(labels_true, labels_kmeans)
```

```
## 0.3241080446115835
```

The adjusted Rand score of \(\sim 0.3\) indicates a far-from-perfect fit.

## A Comparison with HDBSCAN*

Let’s also make a comparison against a version of the DBSCAN
[7, 25] algorithm. The original DBSCAN relies on a somewhat
magical `eps`

parameter, which might be hard to tune in practice. However,
the hdbscan package
[27] implements its robustified variant
[2], which makes the algorithm much more user-friendly.

Here are the clustering results with the `min_cluster_size`

parameter
of 3, 5, 10, and 15:

```
import hdbscan
mcs = [3, 5, 10, 15]
for i in range(len(mcs)):
h = hdbscan.HDBSCAN(min_cluster_size=mcs[i])
labels_hdbscan = h.fit_predict(X)
plt.subplot(2, 2, i+1)
genieclust.plots.plot_scatter(X, labels=labels_hdbscan)
plt.title("HDBSCAN (min_cluster_size=%d)" % h.min_cluster_size)
plt.axis("equal")
plt.show()
```

- Side note.
Gray plotting symbols denote “noise” points — we’ll get back to them in another section; it turns out that the Genie algorithm is also equipped with such a feature (on demand).

In HDBSCAN*, `min_cluster_size`

affects the “granularity”
of the obtained clusters. Its default value is set to:

```
hdbscan.HDBSCAN().min_cluster_size
```

```
## 5
```

Unfortunately, we cannot easily guess how many clusters will be generated
by this method. At a first glance, it would seem that `min_cluster_size`

should lie somewhere between 10 and 15, but…

```
mcs = range(10, 16)
for i in range(len(mcs)):
h = hdbscan.HDBSCAN(min_cluster_size=mcs[i])
labels_hdbscan = h.fit_predict(X)
plt.subplot(3, 2, i+1)
genieclust.plots.plot_scatter(X, labels=labels_hdbscan)
plt.title("HDBSCAN (min_cluster_size=%d)"%h.min_cluster_size)
plt.axis("equal")
plt.show()
```

Strangely enough, `min_cluster_size`

of \(11\) generates 4 clusters,
whereas \(11\pm 1\) - only 3 of them.

On the other hand, the Genie algorithm belongs
to the group of *hierarchical agglomerative methods* — by definition
it’s able to generate
a sequence of *nested* partitions, which means that by
increasing `n_clusters`

, we split one and only one cluster
into two subgroups.
This makes the resulting partitions more stable.

```
ncl = range(2, 8)
for i in range(len(ncl)):
g = genieclust.Genie(n_clusters=ncl[i])
labels_genie = g.fit_predict(X)
plt.subplot(3, 2, i+1)
genieclust.plots.plot_scatter(X, labels=labels_genie)
plt.title("Genie (n_clusters=%d)"%(g.n_clusters,))
plt.axis("equal")
plt.show()
```

## Dendrograms

Dendrogram plotting is possible with scipy.cluster.hierarchy:

```
import scipy.cluster.hierarchy
g = genieclust.Genie(compute_full_tree=True)
g.fit(X)
linkage_matrix = np.column_stack([g.children_, g.distances_, g.counts_])
scipy.cluster.hierarchy.dendrogram(linkage_matrix,
show_leaf_counts=False, no_labels=True)
plt.show()
```

For a list of graphical parameters, refer to the function’s manual:

```
scipy.cluster.hierarchy.dendrogram(linkage_matrix,
truncate_mode="lastp", p=15, orientation="left")
plt.show()
```

## Further Reading

For more details, refer to the package’s API reference
manual: `genieclust.Genie`

.

To learn more about Python, check out Marek’s open-access (free!) textbook Minimalist Data Wrangling in Python [14].