genieclust

class genieclust.MSTClusterMixin(*, n_clusters, M, metric, postprocess, quitefastmst_params, verbose)

A base class for genieclust.Genie, genieclust.GIc, and other MST-based clustering algorithms [2].

Parameters:
n_clustersint

The number of clusters to detect.

If M > 1 and postprocess is not "all", setting n_clusters = 1 makes the algorithm behave like a noise point/outlier detector.

Mint

Smoothing factor for the mutual reachability distance [1]. M = 1 and M = 2 indicate the original distance as given by the metric parameter.

metricstr

The metric used to compute the linkage.

One of: "l2" (synonym: "euclidean"; the default), "l1" (synonym: "manhattan", "cityblock"), "cosinesimil" (synonym: "cosine"), or "precomputed".

For "precomputed", the X argument to the fit method must be a distance vector or a square-form distance matrix; see scipy.spatial.distance.pdist.

Determining minimum spanning trees with respect to the Euclidean distance (the default) is much faster than with other metrics thanks to the quitefastmst package.

postprocess{"boundary", "none", "all"}

Controls the treatment of noise/boundary points after the clusters are identified.

In effect only if M > 1. Each leaf in the minimum spanning tree is treated as a noise point. We call it a boundary point if it is amongst its adjacent vertex’s M - 1 nearest neighbours. By default, only boundary points are merged with their nearest core points.

To force a classical n_clusters-partition of a data set (with no notion of noise), choose "all". Furthermore, "none" leaves all leaves marked as noise.

quitefastmst_paramsdict

Additional parameters to be passed to quitefastmst.mst_euclid if metric is "l2"

verbosebool

Whether to print diagnostic messages and progress information onto stderr.

Attributes:
labels_ndarray

Detected cluster labels.

Normally an integer vector such that labels_[i] gives the cluster ID (between 0 and n_clusters_ - 1) of the i-th object. If M > 1, noise/boundary points are labelled -1 (unless taken care of at the postprocessing stage).

n_clusters_int

The actual number of clusters detected by the algorithm.

It can be different from the requested one if there are too many noise points in the dataset.

n_samples_int

The number of points in the dataset.

n_features_int

The number of features in the dataset.

If the information is not available, it will be set to -1.

Methods

fit_predict(X[, y])

Perform cluster analysis of a dataset and return the predicted labels.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

Notes

If the Euclidean distance is selected, then quitefastmst.mst_euclid is used to compute the MST; it is quite fast in low-dimensional spaces. Otherwise, an implementation of the Jarník (Prim/Dijkstra)-like \(O(n^2)\)-time algorithm is called.

If M > 1, then the minimum spanning tree is computed with respect to the mutual reachability distance (based, e.g., on the Euclidean metric). Formally, the distance \(d_M(i,j)\) is used instead of the chosen “raw” distance, \(d(i,j)\). It holds \(d_M(i,j)=\max\{d(i,j), c_M(i), c_M(j)\}\), where the “core” distance \(c_M(i)\) is given by \(d(i,k)\) with \(k\) being the \((M-1)\)-th nearest neighbour of \(i\). This makes “noise” and “boundary” points being “pulled away” from each other.

Note that M = 2 corresponds to the original distance.

The underlying clustering algorithm might leave out all MST leaves from the clustering process. They may be merged with the nearest clusters at the postprocessing stage, or left marked as “noise” observations.

Environment variables:
OMP_NUM_THREADS

Controls the number of threads used for computing the minimum spanning tree.

References

[1]

Campello, R.J.G.B., Moulavi, D., Sander, J., Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, doi:10.1007/978-3-642-37456-2_14.

[2]

Gagolewski, M., Cena, A., Bartoszuk, M., Brzozowski, L., Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, doi:10.1007/s00357-024-09483-1.

fit_predict(X, y=None)

Perform cluster analysis of a dataset and return the predicted labels.

Parameters:
Xobject

Typically a matrix with n_samples rows and n_features columns; see below for more details and options.

yNone

Ignored.

Returns:
labels_ndarray

self.labels_ attribute.

Notes

Acceptable X types are as follows.

For metric of "precomputed", X should either be a distance vector of length n_samples*(n_samples-1)/2 (see scipy.spatial.distance.pdist) or a square distance matrix of shape (n_samples, n_samples) (see scipy.spatial.distance.squareform).

Otherwise, X should be real-valued matrix (dense numpy.ndarray, or an object coercible to) with n_samples rows and n_features columns.

In the latter case, it might be a good idea to standardise or at least somehow preprocess the coordinates of the input data points by calling, for instance, X = (X-X.mean(axis=0))/X.std(axis=None, ddof=1) so that the dataset is centred at 0 and has total variance of 1. This way the method becomes translation and scale invariant. What’s more, if data are recorded with small precision (say, up to few decimal digits), adding a tiny bit of Gaussian noise will ensure the solution is unique (note that this generally applies to other distance-based clustering algorithms as well).

For the clustering result, refer to the labels_ and n_clusters_ attributes.

class genieclust.Genie(n_clusters=2, *, gini_threshold=0.3, M=1, metric='l2', compute_full_tree=False, compute_all_cuts=False, postprocess='boundary', quitefastmst_params={}, verbose=False)

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection

Parameters:
n_clustersint

The number of clusters to detect.

If M > 1 and postprocess is not "all", setting n_clusters = 1 makes the algorithm behave like a noise point/outlier detector.

gini_thresholdfloat

The threshold for the Genie correction in [0,1].

The Gini index is used to quantify the inequality of the cluster size distribution. Low thresholds highly penalise the formation of small clusters. Threshold of 1.0 disables the correction: in such a case, if M = 1, then the method is equivalent to the single linkage algorithm.

The algorithm tends to be stable with respect to small changes to the threshold, as they usually do not affect the output clustering. Usually, thresholds of 0.1, 0.3, 0.5, and 0.7 are worth giving a try.

Mint

The smoothing factor for the mutual reachability distance [2]. M = 1 and M = 2 indicate the original distance as given by the metric parameter; see genieclust.MSTClusterMixin for more details.

M = 1 gives the original Genie algorithm [1] (with no noise/boundary point detection) with respect to the chosen distance.

metricstr

The metric used to compute the linkage; see genieclust.MSTClusterMixin for more details. Defaults to "l2", i.e., the Euclidean distance.

compute_full_treebool

Whether to determine the entire cluster hierarchy and the linkage matrix.

Only available if M = 1. Enables plotting dendrograms or cutting the cluster hierarchy at an arbitrary level; see the children_, distances_, counts_ attributes.

compute_all_cutsbool

Whether to compute the requested n_clusters-partition and all the coarser-grained ones.

If True, then the labels_ attribute will be a matrix; see below.

postprocess{"boundary", "none", "all"}

Controls the treatment of noise/boundary points after the clusters are identified; see genieclust.MSTClusterMixin for more details.

quitefastmst_paramsdict

Additional parameters to be passed to quitefastmst.mst_euclid if metric is "l2"

verbosebool

Whether to print diagnostic messages and progress information onto stderr.

Attributes:
labels_ndarray

Detected cluster labels.

If compute_all_cuts is False (the default), it is an integer vector such that labels_[i] gives the cluster ID (between 0 and n_clusters_ - 1) of the i-th object. If M > 1, noise/boundary points are labelled -1 (unless taken care of at the postprocessing stage).

Otherwise, i.e., if compute_all_cuts is True, all partitions of cardinality down to n_clusters are determined; labels_[j,i] denotes the cluster ID of the i-th point in a j-partition. We assume that both the 0- and 1- partition distinguishes only between noise- and non-noise points, however, no postprocessing is conducted on the 0-partition (there might be points with labels of -1 even if postprocess is "all").

n_clusters_int

The actual number of clusters detected by the algorithm. It can be different from the requested one if there are too many noise points in the dataset.

n_samples_int

The number of points in the dataset.

n_features_int

The number of features in the dataset.

If the information is not available, it will be set to -1.

children_None or ndarray

If compute_full_tree is True, this is a matrix whose i-th row provides the information on the clusters merged in the i-th iteration. See the description of Z[:,0] and Z[:,1] in scipy.cluster.hierarchy.linkage. Together with distances_ and counts_, this constitutes the linkage matrix that can be used for plotting the dendrogram.

distances_None or ndarray

If compute_full_tree is True, this is a vector that gives the distance between two clusters merged in each iteration, see the description of Z[:,2] in scipy.cluster.hierarchy.linkage.

As the original Genie algorithm does not guarantee that distances are ordered increasingly (there are other hierarchical clustering linkages that violate the ultrametricity property too), Genie automatically applies the following correction:

distances_ = genieclust.tools.cummin(distances_[::-1])[::-1].

counts_None or ndarray

If compute_full_tree is True, this is a vector giving the number of elements in a cluster created in each iteration. See the description of Z[:,3] in scipy.cluster.hierarchy.linkage.

Methods

fit(X[, y])

Perform cluster analysis of a dataset.

fit_predict(X[, y])

Perform cluster analysis of a dataset and return the predicted labels.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

Notes

Genie is a robust and outlier-resistant hierarchical clustering algorithm [1], originally published in the R package genie. This new implementation is, amongst others, much faster and now features optional noise point detection (if M > 1).

Genie is based on the minimum spanning tree (MST) of the pairwise distance graph of a given point set (refer to genieclust.MSTClusterMixin and [3] for more details). Just like the single linkage, it consumes the edges of the MST in increasing order of weights. However, it prevents the formation of clusters of highly imbalanced sizes; once the Gini index of the cluster size distribution raises above an assumed threshold, a forced merge of a point group of the smallest size is undertaken. Its appealing simplicity goes hand in hand with its usability; Genie often outperforms other clustering approaches on benchmark data.

The Genie algorithm itself has \(O(n \sqrt{n})\) time and \(O(n)\) memory complexity provided that a minimum spanning tree of the pairwise distance graph is given. If the Euclidean distance is selected, then quitefastmst.mst_euclid is used to compute the MST; it is quite fast in low-dimensional spaces. Otherwise, an implementation of the Jarník (Prim/Dijkstra)-like \(O(n^2)\)-time algorithm is called.

The Genie correction together with the smoothing factor M > 1 controlling the mutual reachability distance gives a version of the HDBSCAN* [2] algorithm that, contrary to its predecessor, is able to detect a predefined number of clusters. Hence, it is independent of DBSCAN’s eps parameter or HDBSCAN’s min_cluster_size one.

Note that M = 2 corresponds to the original distance. If M > 1, all MST leaves are left out from the clustering process. They may be merged with the nearest clusters at the postprocessing stage, or left marked as “noise” observations.

Environment variables:
OMP_NUM_THREADS

Controls the number of threads used when computing the minimum spanning tree.

References

[1] (1,2)

Gagolewski, M., Bartoszuk, M., Cena, A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8-23. doi:10.1016/j.ins.2016.05.003.

[2] (1,2)

Campello, R.J.G.B., Moulavi, D., Sander, J., Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, doi:10.1007/978-3-642-37456-2_14.

[3]

Gagolewski, M., Cena, A., Bartoszuk, M., Brzozowski, L., Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, doi:10.1007/s00357-024-09483-1.

fit(X, y=None)

Perform cluster analysis of a dataset.

Parameters:
Xobject

Typically a matrix with n_samples rows and n_features columns; see genieclust.MSTClusterMixin.fit_predict for more details.

yNone

Ignored.

Returns:
selfgenieclust.Genie

The object that the method was called on.

Notes

Refer to the labels_ and n_clusters_ attributes for the result.

class genieclust.GIc(n_clusters=2, *, gini_thresholds=[0.1, 0.3, 0.5, 0.7], M=1, metric='l2', compute_full_tree=False, compute_all_cuts=False, postprocess='boundary', add_clusters=0, n_features=None, quitefastmst_params={}, verbose=False)

GIc (Genie+Information Criterion) clustering algorithm

Parameters:
n_clustersint

The number of clusters to detect; see genieclust.Genie for more details.

gini_thresholdsarray_like

A list of Gini’s index thresholds between 0 and 1.

The GIc algorithm optimises the information criterion in an agglomerative way, starting from the intersection of the clusterings returned by Genie(n_clusters=n_clusters+add_clusters, gini_threshold=gini_thresholds[i]), for all i from 0 to len(gini_thresholds)-1.

Mint

The smoothing factor for the mutual reachability distance; see genieclust.Genie for more details.

metricstr

The metric used to compute the linkage; see genieclust.Genie for more details.

compute_full_treebool

Whether to determine the entire cluster hierarchy and the linkage matrix; see genieclust.Genie for more details.

compute_all_cutsbool

Whether to compute the requested n_clusters-partition and all the coarser-grained ones; see genieclust.Genie for more details.

Note that if compute_all_cuts is True, then the i-th cut in the hierarchy behaves as if add_clusters was equal to n_clusters-i. In other words, the returned cuts might be different from those obtained by multiple calls to GIc, each time with different n_clusters and constant add_clusters requested.

add_clustersint

Number of additional clusters to work with internally.

n_featuresfloat or None

The dataset’s (intrinsic) dimensionality.

If None, it will be set based on the shape of the input matrix. Yet, metric of "precomputed" needs this to be set manually.

postprocess{"boundary", "none", "all"}

Controls the treatment of noise/boundary points after the clusters are identified; see genieclust.MSTClusterMixin for more details.

quitefastmst_paramsdict

Additional parameters to be passed to quitefastmst.mst_euclid if metric is "l2"

verbosebool

Whether to print diagnostic messages and progress information onto stderr.

Attributes:
See :any:`genieclust.Genie`.

Methods

fit(X[, y])

Perform cluster analysis of a dataset.

fit_predict(X[, y])

Perform cluster analysis of a dataset and return the predicted labels.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

See also

genieclust.Genie
quitefastmst.mst_euclid

Notes

GIc (Genie+Information Criterion) is an Information-Theoretic Clustering Algorithm. It was proposed by Anna Cena in [1]. It was inspired by Mueller’s (et al.) ITM [2] and Genie [3].

GIc computes an n_clusters-partition based on a pre-computed minimum spanning tree (MST) of the pairwise distance graph of a given point set (refer to genieclust.MSTClusterMixin and [4] for more details). Clusters are merged so as to maximise (heuristically) the information criterion discussed in [2].

GIc uses a bottom-up, agglomerative approach (as opposed to the ITM, which follows a divisive scheme). It greedily selects for merging a pair of clusters that maximises the information criterion. By default, the initial partition is determined by considering the intersection of the partitions found by multiple runs of the Genie method with thresholds [0.1, 0.3, 0.5, 0.7], which we observe to be a sensible choice for most clustering activities. Hence, contrary to the Genie method, we can say that GIc is virtually parameter-free. However, when run with different n_clusters parameter, it does not yield a hierarchy of nested partitions (unless some more laborious parameter tuning is applied).

Environment variables:
OMP_NUM_THREADS

Controls the number of threads used when computing the minimum spanning tree.

References

[1]

Cena, A., Adaptive hierarchical clustering algorithms based on data aggregation methods, PhD Thesis, Systems Research Institute, Polish Academy of Sciences 2018.

[2] (1,2)

Mueller, A., Nowozin, S., Lampert, C.H., Information Theoretic Clustering using Minimum Spanning Trees, DAGM-OAGM, 2012.

[3]

Gagolewski, M., Bartoszuk, M., Cena, A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8-23. doi:10.1016/j.ins.2016.05.003.ing

[4]

Gagolewski, M., Cena, A., Bartoszuk, M., Brzozowski, L., Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, doi:10.1007/s00357-024-09483-1.

fit(X, y=None)

Perform cluster analysis of a dataset.

Parameters:
Xobject

Typically a matrix with n_samples rows and n_features columns; see genieclust.MSTClusterMixin.fit_predict for more details.

yNone

Ignored.

Returns:
selfgenieclust.GIc

The object that the method was called on.

Notes

Refer to the labels_ and n_clusters_ attributes for the result.

Note that for metric of "precomputed", the n_features parameter must be set explicitly.