Outlier Detection with Deadwood

The Deadwood outlier detection algorithm[1] can be run on the clustered dataset to identify anomalous points in each cluster. This may be beneficial where the clusters are of heterogeneous densities.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import genieclust
import deadwood

def plot_scatter(X, labels=None):
    deadwood.plot_scatter(X, labels=labels, asp=1, alpha=0.5, markers='o', s=10)

Let’s load an example dataset that can be found at the hdbscan [14] package’s project site:

dataset = "hdbscan"
X = np.loadtxt("%s.data.gz" % dataset, ndmin=2)
labels_true = np.loadtxt("%s.labels0.gz" % dataset, dtype=np.intp) - 1
n_clusters = len(np.unique(labels_true[labels_true>=0]))

Here are the reference labels as identified by a manual annotator. The light gray markers corresponding to the label -1 designate outliers.

plot_scatter(X, labels_true)
plt.title("(n=%d, true n_clusters=%d)" % (X.shape[0], n_clusters))
plt.axis("equal")
plt.show()
../_images/noise-scatter-1.png

Figure 10 Reference labels

Smoothing Factor

The genieclust package allows for clustering with respect to the mutual reachability distance, \(d_M\), known from the DBSCAN* algorithm [2]. This metric is parameterised by a smoothing factor, \(M\ge 1\).

Namely, instead of the ordinary (usually Euclidean) distance between two points \(i\) and \(j\), \(d(i,j)\), we take \(d_M(i,j)=\max\{ c_M(i), c_M(j), d(i, j) \}\), where the so-called \(M\)-core distance \(c_M(i)\) is the distance between \(i\) and its \(M\)-th nearest neighbour (here, not including \(i\) itself, unlike in [2]).

DBSCAN* and its hierarchical version, HDBSCAN*, introduces the notion of noise and core points. Furthermore, their predecessor, DBSCAN, also marks certain non-core values as border points. They all rely on a specific threshold \(\varepsilon\) that is applied onto the points’ core distances.

In our case, we can perform anomaly detection on subsets of mutual reachability minium spanning trees using the Deadwood algorithm, separately in each cluster identified by Genie.

Here are the effects of playing with the \(M\) parameter (we keep the default gini_threshold):

Ms = [10, 25]
for i in range(len(Ms)):
    g = genieclust.Genie(n_clusters=n_clusters, M=Ms[i]).fit(X)
    labels = g.labels_.copy()
    d = deadwood.Deadwood().fit(g)
    labels[d.labels_ < 0] = -1
    plt.subplot(1, 2, i+1)
    plot_scatter(X, labels)
    plt.title("(gini_threshold=%g, M=%d)"%(g.gini_threshold, g.M))
    plt.axis("equal")
plt.show()
../_images/noise-Genie1-3.png

Figure 11 Labels predicted by Genie with outliers detected by Deadwood

Cluster Hierarchy

Contrary to the HDBSCAN* method featured in the hdbscan package [14], in our case, we can request a specific number of clusters. Moreover, we can easily switch between partitions of finer or coarser granularity. Genie is a hierarchical algorithm, so the partitions are properly nested.

ncs = [3, 6, 8, 10]
for i in range(len(ncs)):
    g = genieclust.Genie(n_clusters=ncs[i], M=10).fit(X)
    labels = g.labels_.copy()
    d = deadwood.Deadwood().fit(g)
    labels[d.labels_ < 0] = -1
    plt.subplot(2, 2, i+1)
    plot_scatter(X, labels)
    plt.title("(n_clusters=%d)"%(g.n_clusters))
    plt.axis("equal")
plt.show()
../_images/noise-Genie3-5.png

Figure 12 Labels predicted by Genie with outliers removed by Deadwood: Different number of clusters requested

A Comparision with HDBSCAN*

Here are the results returned by hdbscan with default parameters:

import hdbscan
h = hdbscan.HDBSCAN()
labels_hdbscan = h.fit_predict(X)
plot_scatter(X, labels_hdbscan)
plt.title("(min_cluster_size=%d, min_samples=%d)" % (
    h.min_cluster_size, h.min_samples or h.min_cluster_size))
plt.axis("equal")
plt.show()
../_images/noise-HDBSCAN1-7.png

Figure 13 Labels predicted by HDBSCAN*.

By tuning up min_cluster_size and/or min_samples (which corresponds to our M; min_samples defaults to min_cluster_size if not provided explicitly), we can obtain a partition that is even closer to the reference one:

mcss = [10, 25]
mss = [10, 25]
for i in range(len(mcss)):
    for j in range(len(mss)):
        h = hdbscan.HDBSCAN(min_cluster_size=mcss[i], min_samples=mss[j])
        labels_hdbscan = h.fit_predict(X)
        plt.subplot(len(mcss), len(mss), i*len(mss)+j+1)
        plot_scatter(X, labels_hdbscan)
        plt.title("(min_cluster_size=%d, min_samples=%d)" % (
            h.min_cluster_size, h.min_samples or h.min_cluster_size))
        plt.axis("equal")
plt.show()
../_images/noise-HDBSCAN2-9.png

Figure 14 Labels predicted by HDBSCAN* – different settings