Clustering with Noise Points Detection¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import genieclust
Let’s load an example dataset that can be found the at hdbscan [26] package’s project site:
dataset = "hdbscan"
X = np.loadtxt("%s.data.gz" % dataset, ndmin=2)
labels_true = np.loadtxt("%s.labels0.gz" % dataset, dtype=np.intp) - 1
n_clusters = len(np.unique(labels_true[labels_true>=0]))
Here are the “reference” labels as identified by an expert (of course,
each dataset might reveal many different clusterings that a user might
find useful for whatever their goal is).
The -1
labels denote noise points (light grey markers).
genieclust.plots.plot_scatter(X, labels=labels_true, alpha=0.5)
plt.title("(n=%d, true n_clusters=%d)" % (X.shape[0], n_clusters))
plt.axis("equal")
plt.show()
Smoothing Factor¶
The genieclust
package allows for clustering with respect
to a mutual reachability distance, \(d_M\),
known from the HDBSCAN* algorithm [2].
It is parameterised by a smoothing factor, M
, which
controls how eagerly we tend to classify points as noise.
Here are the effects of playing with the M
parameter
(we keep the default gini_threshold
):
Ms = [2, 5, 10, 25]
for i in range(len(Ms)):
g = genieclust.Genie(n_clusters=n_clusters, M=Ms[i])
labels_genie = g.fit_predict(X)
plt.subplot(2, 2, i+1)
genieclust.plots.plot_scatter(X, labels=labels_genie, alpha=0.5)
plt.title("(gini_threshold=%g, M=%d)"%(g.gini_threshold, g.M))
plt.axis("equal")
plt.show()
For a more natural look-and-feel, it can be a good idea to first identify the noise points with Genie, remove them from the data set (at least temporarily), and then apply the clustering procedure once again (did we mention that our algorithm is fast?) but now with respect to the original distance (here: Euclidean):
# Step 1: Noise point identification
g1 = genieclust.Genie(n_clusters=n_clusters, M=50)
labels_noise = g1.fit_predict(X)
non_noise = (labels_noise >= 0) # True == non-noise point
# Step 2: Clustering of non-noise points:
g2 = genieclust.Genie(n_clusters=n_clusters)
labels_genie = g2.fit_predict(X[non_noise, :])
# Replace old labels with the new ones:
labels_noise[non_noise] = labels_genie
# Scatter plot:
genieclust.plots.plot_scatter(X, labels=labels_noise, alpha=0.5)
plt.title("(gini_threshold=%g, noise points removed first; M=%d)"%(g2.gini_threshold, g1.M))
plt.axis("equal")
plt.show()
Contrary to the excellent implementation of HDBSCAN* that is featured in the hdbscan package [26] and which also relies on a minimum spanning tree with respect to \(d_M\), here, we still have the hierarchical Genie [16] algorithm under the hood. It means that we can request a specific number of clusters. Moreover, we can easily switch between partitions of finer or coarser granularity.
ncs = [5, 6, 7, 8, 10, 15]
for i in range(len(ncs)):
g = genieclust.Genie(n_clusters=ncs[i])
labels_genie = g.fit_predict(X[non_noise, :])
plt.subplot(3, 2, i+1)
labels_noise[non_noise] = labels_genie
genieclust.plots.plot_scatter(X, labels=labels_noise, alpha=0.5)
plt.title("(n_clusters=%d)"%(g.n_clusters))
plt.axis("equal")
plt.show()
A Comparision with HDBSCAN*¶
Here are the results returned by hdbscan
with default parameters:
import hdbscan
h = hdbscan.HDBSCAN()
labels_hdbscan = h.fit_predict(X)
genieclust.plots.plot_scatter(X, labels=labels_hdbscan, alpha=0.5)
plt.title("(min_cluster_size=%d, min_samples=%d)" % (
h.min_cluster_size, h.min_samples or h.min_cluster_size))
plt.axis("equal")
plt.show()
By tuning up min_cluster_size
and/or min_samples
(which corresponds to our M
parameter;
by the way, min_samples
defaults to min_cluster_size
if not provided explicitly),
we can obtain a partition that is even closer to the reference one:
mcss = [5, 10, 25]
mss = [5, 10]
for i in range(len(mcss)):
for j in range(len(mss)):
h = hdbscan.HDBSCAN(min_cluster_size=mcss[i], min_samples=mss[j])
labels_hdbscan = h.fit_predict(X)
plt.subplot(3, 2, i*len(mss)+j+1)
genieclust.plots.plot_scatter(X, labels=labels_hdbscan, alpha=0.5)
plt.title("(min_cluster_size=%d, min_samples=%d)" % (
h.min_cluster_size, h.min_samples or h.min_cluster_size))
plt.axis("equal")
plt.show()
Neat.