genieclust¶
- class genieclust.MSTClusterMixin(*, n_clusters, M, metric, postprocess, quitefastmst_params, verbose)¶
A base class for
genieclust.Genie
,genieclust.GIc
, and other MST-based clustering algorithms [2].- Parameters:
- n_clustersint
The number of clusters to detect.
If M > 1 and postprocess is not
"all"
, setting n_clusters = 1 makes the algorithm behave like a noise point/outlier detector.- Mint
Smoothing factor for the mutual reachability distance [1]. M = 1 and M = 2 indicate the original distance as given by the metric parameter.
- metricstr
The metric used to compute the linkage.
One of:
"l2"
(synonym:"euclidean"
; the default),"l1"
(synonym:"manhattan"
,"cityblock"
),"cosinesimil"
(synonym:"cosine"
), or"precomputed"
.For
"precomputed"
, the X argument to thefit
method must be a distance vector or a square-form distance matrix; seescipy.spatial.distance.pdist
.Determining minimum spanning trees with respect to the Euclidean distance (the default) is much faster than with other metrics thanks to the quitefastmst package.
- postprocess{
"boundary"
,"none"
,"all"
} Controls the treatment of noise/boundary points after the clusters are identified.
In effect only if M > 1. Each leaf in the minimum spanning tree is treated as a noise point. We call it a boundary point if it is amongst its adjacent vertex’s M - 1 nearest neighbours. By default, only boundary points are merged with their nearest core points.
To force a classical n_clusters-partition of a data set (with no notion of noise), choose
"all"
. Furthermore,"none"
leaves all leaves marked as noise.- quitefastmst_paramsdict
Additional parameters to be passed to
quitefastmst.mst_euclid
ifmetric
is"l2"
- verbosebool
Whether to print diagnostic messages and progress information onto
stderr
.
- Attributes:
- labels_ndarray
Detected cluster labels.
Normally an integer vector such that
labels_[i]
gives the cluster ID (between 0 and n_clusters_ - 1) of the i-th object. If M > 1, noise/boundary points are labelled-1
(unless taken care of at the postprocessing stage).- n_clusters_int
The actual number of clusters detected by the algorithm.
It can be different from the requested one if there are too many noise points in the dataset.
- n_samples_int
The number of points in the dataset.
- n_features_int
The number of features in the dataset.
If the information is not available, it will be set to
-1
.
Methods
fit_predict
(X[, y])Perform cluster analysis of a dataset and return the predicted labels.
get_metadata_routing
()Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
Notes
If the Euclidean distance is selected, then
quitefastmst.mst_euclid
is used to compute the MST; it is quite fast in low-dimensional spaces. Otherwise, an implementation of the Jarník (Prim/Dijkstra)-like \(O(n^2)\)-time algorithm is called.If M > 1, then the minimum spanning tree is computed with respect to the mutual reachability distance (based, e.g., on the Euclidean metric). Formally, the distance \(d_M(i,j)\) is used instead of the chosen “raw” distance, \(d(i,j)\). It holds \(d_M(i,j)=\max\{d(i,j), c_M(i), c_M(j)\}\), where the “core” distance \(c_M(i)\) is given by \(d(i,k)\) with \(k\) being the \((M-1)\)-th nearest neighbour of \(i\). This makes “noise” and “boundary” points being “pulled away” from each other.
Note that M = 2 corresponds to the original distance.
The underlying clustering algorithm might leave out all MST leaves from the clustering process. They may be merged with the nearest clusters at the postprocessing stage, or left marked as “noise” observations.
- Environment variables:
- OMP_NUM_THREADS
Controls the number of threads used for computing the minimum spanning tree.
References
[1]Campello, R.J.G.B., Moulavi, D., Sander, J., Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, doi:10.1007/978-3-642-37456-2_14.
[2]Gagolewski, M., Cena, A., Bartoszuk, M., Brzozowski, L., Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, doi:10.1007/s00357-024-09483-1.
- fit_predict(X, y=None)¶
Perform cluster analysis of a dataset and return the predicted labels.
- Parameters:
- Xobject
Typically a matrix with
n_samples
rows andn_features
columns; see below for more details and options.- yNone
Ignored.
- Returns:
- labels_ndarray
self.labels_ attribute.
Notes
Acceptable X types are as follows.
For metric of
"precomputed"
, X should either be a distance vector of lengthn_samples*(n_samples-1)/2
(seescipy.spatial.distance.pdist
) or a square distance matrix of shape(n_samples, n_samples)
(seescipy.spatial.distance.squareform
).Otherwise, X should be real-valued matrix (dense
numpy.ndarray
, or an object coercible to) withn_samples
rows andn_features
columns.In the latter case, it might be a good idea to standardise or at least somehow preprocess the coordinates of the input data points by calling, for instance,
X = (X-X.mean(axis=0))/X.std(axis=None, ddof=1)
so that the dataset is centred at 0 and has total variance of 1. This way the method becomes translation and scale invariant. What’s more, if data are recorded with small precision (say, up to few decimal digits), adding a tiny bit of Gaussian noise will ensure the solution is unique (note that this generally applies to other distance-based clustering algorithms as well).For the clustering result, refer to the labels_ and n_clusters_ attributes.
- class genieclust.Genie(n_clusters=2, *, gini_threshold=0.3, M=1, metric='l2', compute_full_tree=False, compute_all_cuts=False, postprocess='boundary', quitefastmst_params={}, verbose=False)¶
Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection
- Parameters:
- n_clustersint
The number of clusters to detect.
If M > 1 and postprocess is not
"all"
, setting n_clusters = 1 makes the algorithm behave like a noise point/outlier detector.- gini_thresholdfloat
The threshold for the Genie correction in [0,1].
The Gini index is used to quantify the inequality of the cluster size distribution. Low thresholds highly penalise the formation of small clusters. Threshold of 1.0 disables the correction: in such a case, if M = 1, then the method is equivalent to the single linkage algorithm.
The algorithm tends to be stable with respect to small changes to the threshold, as they usually do not affect the output clustering. Usually, thresholds of 0.1, 0.3, 0.5, and 0.7 are worth giving a try.
- Mint
The smoothing factor for the mutual reachability distance [2]. M = 1 and M = 2 indicate the original distance as given by the metric parameter; see
genieclust.MSTClusterMixin
for more details.M = 1 gives the original Genie algorithm [1] (with no noise/boundary point detection) with respect to the chosen distance.
- metricstr
The metric used to compute the linkage; see
genieclust.MSTClusterMixin
for more details. Defaults to"l2"
, i.e., the Euclidean distance.- compute_full_treebool
Whether to determine the entire cluster hierarchy and the linkage matrix.
Only available if M = 1. Enables plotting dendrograms or cutting the cluster hierarchy at an arbitrary level; see the children_, distances_, counts_ attributes.
- compute_all_cutsbool
Whether to compute the requested n_clusters-partition and all the coarser-grained ones.
If
True
, then the labels_ attribute will be a matrix; see below.- postprocess{
"boundary"
,"none"
,"all"
} Controls the treatment of noise/boundary points after the clusters are identified; see
genieclust.MSTClusterMixin
for more details.- quitefastmst_paramsdict
Additional parameters to be passed to
quitefastmst.mst_euclid
ifmetric
is"l2"
- verbosebool
Whether to print diagnostic messages and progress information onto
stderr
.
- Attributes:
- labels_ndarray
Detected cluster labels.
If compute_all_cuts is
False
(the default), it is an integer vector such thatlabels_[i]
gives the cluster ID (between 0 and n_clusters_ - 1) of the i-th object. If M > 1, noise/boundary points are labelled-1
(unless taken care of at the postprocessing stage).Otherwise, i.e., if compute_all_cuts is
True
, all partitions of cardinality down to n_clusters are determined;labels_[j,i]
denotes the cluster ID of the i-th point in a j-partition. We assume that both the 0- and 1- partition distinguishes only between noise- and non-noise points, however, no postprocessing is conducted on the 0-partition (there might be points with labels of -1 even if postprocess is"all"
).- n_clusters_int
The actual number of clusters detected by the algorithm. It can be different from the requested one if there are too many noise points in the dataset.
- n_samples_int
The number of points in the dataset.
- n_features_int
The number of features in the dataset.
If the information is not available, it will be set to
-1
.- children_None or ndarray
If compute_full_tree is
True
, this is a matrix whose i-th row provides the information on the clusters merged in the i-th iteration. See the description ofZ[:,0]
andZ[:,1]
inscipy.cluster.hierarchy.linkage
. Together with distances_ and counts_, this constitutes the linkage matrix that can be used for plotting the dendrogram.- distances_None or ndarray
If compute_full_tree is
True
, this is a vector that gives the distance between two clusters merged in each iteration, see the description ofZ[:,2]
inscipy.cluster.hierarchy.linkage
.As the original Genie algorithm does not guarantee that distances are ordered increasingly (there are other hierarchical clustering linkages that violate the ultrametricity property too), Genie automatically applies the following correction:
distances_ = genieclust.tools.cummin(distances_[::-1])[::-1]
.- counts_None or ndarray
If compute_full_tree is
True
, this is a vector giving the number of elements in a cluster created in each iteration. See the description ofZ[:,3]
inscipy.cluster.hierarchy.linkage
.
Methods
fit
(X[, y])Perform cluster analysis of a dataset.
fit_predict
(X[, y])Perform cluster analysis of a dataset and return the predicted labels.
get_metadata_routing
()Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
Notes
Genie is a robust and outlier-resistant hierarchical clustering algorithm [1], originally published in the R package
genie
. This new implementation is, amongst others, much faster and now features optional noise point detection (if M > 1).Genie is based on the minimum spanning tree (MST) of the pairwise distance graph of a given point set (refer to
genieclust.MSTClusterMixin
and [3] for more details). Just like the single linkage, it consumes the edges of the MST in increasing order of weights. However, it prevents the formation of clusters of highly imbalanced sizes; once the Gini index of the cluster size distribution raises above an assumed threshold, a forced merge of a point group of the smallest size is undertaken. Its appealing simplicity goes hand in hand with its usability; Genie often outperforms other clustering approaches on benchmark data.The Genie algorithm itself has \(O(n \sqrt{n})\) time and \(O(n)\) memory complexity provided that a minimum spanning tree of the pairwise distance graph is given. If the Euclidean distance is selected, then
quitefastmst.mst_euclid
is used to compute the MST; it is quite fast in low-dimensional spaces. Otherwise, an implementation of the Jarník (Prim/Dijkstra)-like \(O(n^2)\)-time algorithm is called.The Genie correction together with the smoothing factor M > 1 controlling the mutual reachability distance gives a version of the HDBSCAN* [2] algorithm that, contrary to its predecessor, is able to detect a predefined number of clusters. Hence, it is independent of DBSCAN’s
eps
parameter or HDBSCAN’smin_cluster_size
one.Note that M = 2 corresponds to the original distance. If M > 1, all MST leaves are left out from the clustering process. They may be merged with the nearest clusters at the postprocessing stage, or left marked as “noise” observations.
- Environment variables:
- OMP_NUM_THREADS
Controls the number of threads used when computing the minimum spanning tree.
References
[1] (1,2)Gagolewski, M., Bartoszuk, M., Cena, A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8-23. doi:10.1016/j.ins.2016.05.003.
[2] (1,2)Campello, R.J.G.B., Moulavi, D., Sander, J., Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, doi:10.1007/978-3-642-37456-2_14.
[3]Gagolewski, M., Cena, A., Bartoszuk, M., Brzozowski, L., Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, doi:10.1007/s00357-024-09483-1.
- fit(X, y=None)¶
Perform cluster analysis of a dataset.
- Parameters:
- Xobject
Typically a matrix with
n_samples
rows andn_features
columns; seegenieclust.MSTClusterMixin.fit_predict
for more details.- yNone
Ignored.
- Returns:
- selfgenieclust.Genie
The object that the method was called on.
Notes
Refer to the labels_ and n_clusters_ attributes for the result.
- class genieclust.GIc(n_clusters=2, *, gini_thresholds=[0.1, 0.3, 0.5, 0.7], M=1, metric='l2', compute_full_tree=False, compute_all_cuts=False, postprocess='boundary', add_clusters=0, n_features=None, quitefastmst_params={}, verbose=False)¶
GIc (Genie+Information Criterion) clustering algorithm
- Parameters:
- n_clustersint
The number of clusters to detect; see
genieclust.Genie
for more details.- gini_thresholdsarray_like
A list of Gini’s index thresholds between 0 and 1.
The GIc algorithm optimises the information criterion in an agglomerative way, starting from the intersection of the clusterings returned by
Genie(n_clusters=n_clusters+add_clusters, gini_threshold=gini_thresholds[i])
, for alli
from0
tolen(gini_thresholds)-1
.- Mint
The smoothing factor for the mutual reachability distance; see
genieclust.Genie
for more details.- metricstr
The metric used to compute the linkage; see
genieclust.Genie
for more details.- compute_full_treebool
Whether to determine the entire cluster hierarchy and the linkage matrix; see
genieclust.Genie
for more details.- compute_all_cutsbool
Whether to compute the requested n_clusters-partition and all the coarser-grained ones; see
genieclust.Genie
for more details.Note that if compute_all_cuts is
True
, then the i-th cut in the hierarchy behaves as if add_clusters was equal to n_clusters-i. In other words, the returned cuts might be different from those obtained by multiple calls to GIc, each time with different n_clusters and constant add_clusters requested.- add_clustersint
Number of additional clusters to work with internally.
- n_featuresfloat or None
The dataset’s (intrinsic) dimensionality.
If
None
, it will be set based on the shape of the input matrix. Yet, metric of"precomputed"
needs this to be set manually.- postprocess{
"boundary"
,"none"
,"all"
} Controls the treatment of noise/boundary points after the clusters are identified; see
genieclust.MSTClusterMixin
for more details.- quitefastmst_paramsdict
Additional parameters to be passed to
quitefastmst.mst_euclid
ifmetric
is"l2"
- verbosebool
Whether to print diagnostic messages and progress information onto
stderr
.
- Attributes:
- See :any:`genieclust.Genie`.
Methods
fit
(X[, y])Perform cluster analysis of a dataset.
fit_predict
(X[, y])Perform cluster analysis of a dataset and return the predicted labels.
get_metadata_routing
()Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
See also
genieclust.Genie
quitefastmst.mst_euclid
Notes
GIc (Genie+Information Criterion) is an Information-Theoretic Clustering Algorithm. It was proposed by Anna Cena in [1]. It was inspired by Mueller’s (et al.) ITM [2] and Genie [3].
GIc computes an n_clusters-partition based on a pre-computed minimum spanning tree (MST) of the pairwise distance graph of a given point set (refer to
genieclust.MSTClusterMixin
and [4] for more details). Clusters are merged so as to maximise (heuristically) the information criterion discussed in [2].GIc uses a bottom-up, agglomerative approach (as opposed to the ITM, which follows a divisive scheme). It greedily selects for merging a pair of clusters that maximises the information criterion. By default, the initial partition is determined by considering the intersection of the partitions found by multiple runs of the Genie method with thresholds [0.1, 0.3, 0.5, 0.7], which we observe to be a sensible choice for most clustering activities. Hence, contrary to the Genie method, we can say that GIc is virtually parameter-free. However, when run with different n_clusters parameter, it does not yield a hierarchy of nested partitions (unless some more laborious parameter tuning is applied).
- Environment variables:
- OMP_NUM_THREADS
Controls the number of threads used when computing the minimum spanning tree.
References
[1]Cena, A., Adaptive hierarchical clustering algorithms based on data aggregation methods, PhD Thesis, Systems Research Institute, Polish Academy of Sciences 2018.
[2] (1,2)Mueller, A., Nowozin, S., Lampert, C.H., Information Theoretic Clustering using Minimum Spanning Trees, DAGM-OAGM, 2012.
[3]Gagolewski, M., Bartoszuk, M., Cena, A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8-23. doi:10.1016/j.ins.2016.05.003.ing
[4]Gagolewski, M., Cena, A., Bartoszuk, M., Brzozowski, L., Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, doi:10.1007/s00357-024-09483-1.
- fit(X, y=None)¶
Perform cluster analysis of a dataset.
- Parameters:
- Xobject
Typically a matrix with
n_samples
rows andn_features
columns; seegenieclust.MSTClusterMixin.fit_predict
for more details.- yNone
Ignored.
- Returns:
- selfgenieclust.GIc
The object that the method was called on.
Notes
Refer to the labels_ and n_clusters_ attributes for the result.
Note that for metric of
"precomputed"
, the n_features parameter must be set explicitly.