Benchmarks (How Good Is It?)
In this section we evaluate the usefulness of different clustering algorithms. We use our framework for benchmarking clustering algorithms (benchmark suite version 1.0.1) [12] which aggregates datasets from various sources, including, but not limited to [6, 9, 10, 20, 23, 34]. Groundtruth/reference label vectors are provided alongside each dataset. They define the desired number of clusters. Hence, we only study the algorithms that allow for setting of n_clusters explicitly.
We will apply a few agglomerative hierarchical methods (average, centroid, complete, single, and Ward linkage; implemented in the fastcluster package [29]), kmeans, expectationmaximisation (EM) for Gaussian mixtures, Birch, spectral (implemented in scikitlearn [32]), ITM [28], and Genie [16].
The adjusted Rand index (see [21]) will be used to quantify the agreement between a reference and a predicted clustering on the scale [0,1], with score of 1.0 denoting perfect agreement. However, as there might be multiple equally valid/plausible/useful partitions (see also [5] and [19] for discussion), the outputs generated by a single algorithm is evaluated against all the available reference labellings and the maximal similarity score is reported.
For more detailed results based on other partition similarity scores, see the Appendix.
Small Datasets
As some of the algorithms tested here have failed to generate a solution
within reasonable time limits (e.g., spectral clustering),
in this part we restrict ourselves to the datasets with up to 10,000 observations.
As suggested in the benchmark suite’s description, we omit the overpopulous
“parametric” Gaussiandistributed batteries h2mg
and g2mg
.
Here are the boxplots of the empirical distributions of the adjusted Rand index. We report the results for Birch and spectral clustering with parameters that lead to the highest average AR score (the former was tested on a parameter grid of branching_factor in [10, 50, 100] and threshold in [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0] and the latter on affinity in [“rbf”, “laplacian”, “poly”, “sigmoid”] and gamma in [0.25, 0.5, 1.0, 2.5, 5.0]). Moreover, Gaussian mixtures used n_init=100.
The Genie algorithm with gini_threshold of 0.3 gives the highest average and median AR index and, at the same time, is subject to the least variability. The (parametric!) EM algorithm fitting mixtures of Gaussians and the (perhaps lesserknown) informationtheoretic ITM [28] method (which is also based on a minimum spanning tree; compare [18]) tend to output good quality outcomes as well.
Descriptive statistics for the ranks (for each dataset, each algorithm that gets the highest AR index rounded to 2 decimal digits, gets a rank of 1); lower ranks are better:
count 
mean 
std 
min 
25% 
50% 
75% 
max 


Average linkage 
72 
6.8 
3.6 
1 
5 
7 
10 
13 
Birch_0.01 
72 
6.1 
3.1 
1 
4 
6 
8 
13 
Centroid linkage 
72 
7 
3.7 
1 
4.8 
7.5 
10 
13 
Complete linkage 
72 
8.3 
3.4 
1 
6 
9 
11 
13 
Gaussian mixtures 
72 
4.3 
3.8 
1 
1 
2 
6.2 
13 
Genie_0.1 
72 
3.8 
3.6 
1 
1 
3 
5.2 
13 
Genie_0.3 
72 
3.5 
3.4 
1 
1 
1.5 
5 
12 
Genie_0.5 
72 
4.4 
4.2 
1 
1 
2 
8.2 
12 
ITM 
72 
5.6 
4.3 
1 
1 
5 
9.2 
13 
Kmeans 
72 
6 
4.1 
1 
1 
6 
9.2 
13 
Single linkage 
72 
7.9 
5.5 
1 
1 
11 
13 
13 
Spectral_RBF_5 
72 
5.3 
3.6 
1 
1 
5.5 
8 
12 
Ward linkage 
72 
6.4 
3.2 
1 
4 
7 
8.2 
13 
Large Datasets
Below we provide the results for the larger datasets (70,000105,600 points).
This time, the ITM method and Genie with gini_threshold of 0.1 give the highest typical scores.
Descriptive statistics for the ranks (AR index):
count 
mean 
std 
min 
25% 
50% 
75% 
max 


Centroid linkage 
6 
5.3 
2.4 
1 
4.5 
6.5 
7 
7 
Genie_0.1 
6 
1.8 
1.2 
1 
1 
1.5 
2 
4 
Genie_0.3 
6 
3.2 
1.7 
1 
2 
3 
4.8 
5 
Genie_0.5 
6 
5 
2 
1 
5.2 
6 
6 
6 
ITM 
6 
3.7 
2.7 
1 
1.5 
3 
6 
7 
Kmeans 
6 
3.5 
1.9 
1 
2.2 
3.5 
4.8 
6 
Single linkage 
6 
7.3 
0.8 
6 
7 
7.5 
8 
8 
Ward linkage 
6 
3.3 
1.6 
1 
2.2 
3.5 
4.8 
5 
Summary
Overall, the Genie algorithm tends to outperform other algorithms considered in this study, at least on this rich benchmark battery. In [16], based on a much smaller sample of reference datasets, we have recommended gini_threshold of 0.3, which is set as the default also in the genieclust package. However, sometimes inspecting thresholds equal to 0.1 and 0.5 is worth a try. interestingly, the algorithm is quite stable in the sense that small changes of this parameter should not affect the generated clusterings in a significant way.