genieclust¶
- class genieclust.Genie(n_clusters=2, *, gini_threshold=0.3, M=0, metric='l2', coarser=False, quitefastmst_params=None, verbose=False)¶
Genie: Fast and Robust Hierarchical Clustering
- Parameters:
- n_clustersint
The number of clusters to detect.
- gini_thresholdfloat in [0,1], default=0.3
The threshold for the Genie correction.
The Gini index is used to quantify the inequality of the cluster size distribution. Low thresholds highly penalise the formation of small clusters. Threshold of 1.0 disables the correction: in such a case, if M = 0, then the method is equivalent to the single linkage algorithm.
Empirically, the algorithm tends to be stable with respect to small changes to the threshold, as they usually do not affect the output clustering. Usually, thresholds of 0.1, 0.3, 0.5, and 0.7 are worth giving a try.
- Mint, default=0
The smoothing factor for the mutual reachability distance [2]. M = 0 and M = 1 select the original distance as given by the metric parameter; see
deadwood.MSTBasefor more details.- metricstr, default=’l2’
The metric used to compute the linkage; see
deadwood.MSTBasefor more details. Defaults to the Euclidean distance.- coarserbool, default=False
Whether to compute the requested n_clusters-partition and all the coarser-grained ones.
If
True, then the labels_matrix_ attribute will additionally be determined; see below.- quitefastmst_paramsdict
Additional parameters to be passed to
quitefastmst.mst_euclidifmetricis"l2"- verbosebool, default=False
Whether to print diagnostic messages and progress information onto
stderr.
- Attributes:
- labels_ndarray, shape (n_samples,)
Detected cluster labels.
An integer vector such that
labels_[i]gives the cluster ID (between 0 and n_clusters_ - 1) of the i-th object.- labels_matrix_None or ndarray, shape (n_clusters, n_samples)
Available if coarser is True. labels_matrix[i,:] represents an i+1-partition.
- n_clusters_int
The actual number of clusters detected by the algorithm.
- n_samples_int
The number of points in the dataset.
- n_features_int
The number of features in the dataset.
If the information is not available, it will be set to
-1.- children_None or ndarray
A matrix whose i-th row provides the information on the clusters merged in the i-th iteration. See the description of
Z[:,0]andZ[:,1]inscipy.cluster.hierarchy.linkage. Together with distances_ and counts_, this constitutes the linkage matrix that can be used for plotting the dendrogram.- distances_None or ndarray
A vector giving the distances between two clusters merged in each iteration, see the description of
Z[:,2]inscipy.cluster.hierarchy.linkage.The original Genie algorithm does not guarantee that distances are ordered increasingly (there are other hierarchical clustering linkages that violate the ultrametricity property too). Thus, we automatically apply the following correction:
distances_ = genieclust.tools.cummin(distances_[::-1])[::-1].- counts_None or ndarray
A vector giving the number of elements in a cluster created in each iteration. See the description of
Z[:,3]inscipy.cluster.hierarchy.linkage.
Methods
fit(X[, y])Perform cluster analysis of a dataset.
fit_predict(X[, y])Performs cluster or anomaly analysis of a dataset and returns the predicted labels.
get_metadata_routing()Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
set_params(**params)Set the parameters of this estimator.
Notes
Genie is a robust hierarchical clustering algorithm [1]. Its original implementation was included in the R package
genie. This is its faster and more capable variant.The idea behind Genie is beautifully simple. First, make each individual point the only member of its own cluster. Then, keep merging pairs of the closest clusters, one after another. However, to prevent the formation of clusters of highly imbalanced sizes, a point group of the smallest size is sometimes combined with its nearest counterpart. Its appealing simplicity goes hand in hand with its usability; Genie often outperforms other clustering approaches on benchmark data.
Genie is based on Euclidean minimum spanning trees (MST; refer to
deadwood.MSTBaseand [3] for more details). If the Euclidean distance is selected, thenquitefastmst.mst_euclidis used to compute the MST; it is quite fast in low-dimensional spaces. Otherwise, an implementation of the Jarník (Prim/Dijkstra)-like \(O(n^2)\)-time algorithm is called. The Genie algorithm itself has \(O(n \sqrt{n})\) time and \(O(n)\) memory complexity if an MST is already provided.As with all distance-based methods (this includes k-means and DBSCAN as well), applying data preprocessing and feature engineering techniques (e.g., feature scaling, feature selection, dimensionality reduction) might lead to more meaningful results.
genieclust also allows clustering with respect to mutual reachability distances, enabling it to act as an alternative to HDBSCAN* [2] that can identify any number of clusters or their entire hierarchy. When combined with the deadwood package, it can act as an outlier detector.
- Environment variables:
- OMP_NUM_THREADS
Controls the number of threads used when computing the minimum spanning tree.
References
[1]M. Gagolewski, M. Bartoszuk, A. Cena, Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8-23, https://doi.org/10.1016/j.ins.2016.05.003
[2] (1,2)R.J.G.B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, https://doi.org/10.1007/978-3-642-37456-2_14
[3]M. Gagolewski, A. Cena, M. Bartoszuk, Ł. Brzozowski, Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, https://doi.org/10.1007/s00357-024-09483-1
- fit(X, y=None)¶
Perform cluster analysis of a dataset.
- Parameters:
- Xobject
Typically a matrix or a data frame with
n_samplesrows andn_featurescolumns; seedeadwood.MSTBase.fit_predictfor more details.- yNone
Ignored.
- Returns:
- selfgenieclust.Genie
The object that the method was called on.
Notes
Refer to the labels_ and n_clusters_ attributes for the result.
- class genieclust.GIc(n_clusters=2, *, gini_thresholds=[0.1, 0.3, 0.5, 0.7], metric='l2', coarser=False, add_clusters=0, n_features=None, quitefastmst_params=None, verbose=False)¶
GIc (Genie+Information Criterion) clustering algorithm
[TESTING]
- Parameters:
- n_clustersint
The number of clusters to detect; see
genieclust.Geniefor more details.- gini_thresholdsarray_like
A list of Gini’s index thresholds between 0 and 1.
The GIc algorithm optimises the information criterion agglomeratively, starting from the intersection of the clusterings returned by
Genie(n_clusters=n_clusters+add_clusters, gini_threshold=gini_thresholds[i]), for allifrom0tolen(gini_thresholds)-1.- metricstr, default=’l2’
The metric used to compute the linkage; see
genieclust.Geniefor more details.- coarserbool, default=False
Whether to compute the requested n_clusters-partition and all the coarser-grained ones; see
genieclust.Geniefor more details.Note that if coarser is
True, then the i-th cut in the hierarchy behaves as if add_clusters was equal to n_clusters-i. In other words, the returned cuts might be different from those obtained by multiple calls to GIc, each time with different n_clusters and constant add_clusters requested.- add_clustersint
Number of additional clusters to work with internally.
- n_featuresfloat or None
The dataset’s (intrinsic) dimensionality.
If
None, it will be set based on the shape of the input matrix. Yet, metric of"precomputed"needs this to be set manually.- quitefastmst_paramsdict
Additional parameters to be passed to
quitefastmst.mst_euclidifmetricis"l2"- verbosebool
Whether to print diagnostic messages and progress information onto
stderr.
- Attributes:
- See :any:`genieclust.Genie`.
Methods
fit(X[, y])Perform cluster analysis of a dataset.
fit_predict(X[, y])Performs cluster or anomaly analysis of a dataset and returns the predicted labels.
get_metadata_routing()Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
set_params(**params)Set the parameters of this estimator.
See also
genieclust.Geniequitefastmst.mst_euclid
Notes
GIc (Genie+Information Criterion) is an Information-Theoretic Clustering Algorithm. It was proposed by Anna Cena in [1]. GIc was inspired by ITM [2] and Genie [3].
GIc computes an n_clusters-partition based on a pre-computed minimum spanning tree (MST) of the pairwise distance graph of a given point set (refer to
deadwood.MSTBaseand [4] for more details). Clusters are merged so as to maximise (heuristically) the information criterion discussed in [2].GIc uses a bottom-up, agglomerative approach (as opposed to the ITM, which follows a divisive scheme). It greedily selects for merging a pair of clusters that maximises the information criterion. By default, the initial partition is determined by considering the intersection of the partitions found by multiple runs of the Genie method with thresholds [0.1, 0.3, 0.5, 0.7], which we observe to be a sensible choice for most clustering activities. Hence, contrary to the Genie method, we can say that GIc is virtually parameter-free. However, when run with different n_clusters parameter, it does not yield a hierarchy of nested partitions (unless some more laborious parameter tuning is applied).
- Environment variables:
- OMP_NUM_THREADS
Controls the number of threads used when computing the minimum spanning tree.
References
[1]A. Cena, Adaptive hierarchical clustering algorithms based on data aggregation methods, PhD Thesis, Systems Research Institute, Polish Academy of Sciences, 2018
[2] (1,2)A. Mueller, S. Nowozin, C.H. Lampert, Information Theoretic Clustering using Minimum Spanning Trees, DAGM-OAGM, 2012
[3]M. Gagolewski, M. Bartoszuk, A. Cena, Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8-23, https://doi.org/10.1016/j.ins.2016.05.003
[4]M. Gagolewski, A. Cena, M. Bartoszuk, Ł. Brzozowski, Clustering with minimum spanning trees: How good can it be?, Journal of Classification 42, 2025, 90-112, https://doi.org/10.1007/s00357-024-09483-1
- fit(X, y=None)¶
Perform cluster analysis of a dataset.
- Parameters:
- Xobject
Typically a matrix or a data frame with
n_samplesrows andn_featurescolumns; seedeadwood.MSTBase.fit_predictfor more details.- yNone
Ignored.
- Returns:
- selfgenieclust.GIc
The object that the method was called on.
Notes
Refer to the labels_ and n_clusters_ attributes for the result.
Note that for metric of
"precomputed", the n_features parameter must be set explicitly.