genieclust.GIc¶
- class genieclust.GIc(n_clusters=2, *, gini_thresholds=[0.1, 0.3, 0.5, 0.7], M=1, affinity='l2', exact=True, compute_full_tree=False, compute_all_cuts=False, postprocess='boundary', cast_float32=True, mlpack_enabled='auto', mlpack_leaf_size=1, nmslib_n_neighbors=64, nmslib_params_init={'method': 'hnsw'}, nmslib_params_index={'post': 2}, nmslib_params_query={}, add_clusters=0, n_features=None, verbose=False)¶
GIc (Genie+Information Criterion) clustering algorithm
- Parameters:
- n_clustersint
See genieclust.Genie.
- gini_thresholdsarray_like
A list of Gini index thresholds between 0 and 1.
The GIc algorithm optimises the information criterion in an agglomerative way, starting from the intersection of the clusterings returned by
Genie(n_clusters=n_clusters+add_clusters, gini_threshold=gini_thresholds[i])
, for alli
from0
tolen(gini_thresholds)-1
.- Mint
See genieclust.Genie.
- affinitystr
See genieclust.Genie.
- exactbool, default=True
See genieclust.Genie.
- compute_full_treebool
See genieclust.Genie.
- compute_all_cutsbool
See genieclust.Genie.
Note that if compute_all_cuts is
True
, then the i-th cut in the hierarchy behaves as if add_clusters was equal ton_clusters-i
. In other words, the returned cuts might be different from those obtained by multiple calls to GIc, each time with different n_clusters and constant add_clusters requested.- postprocess{“boundary”, “none”, “all”}
See genieclust.Genie.
- cast_float32bool
See genieclust.Genie.
- mlpack_enabled“auto” or bool
See genieclust.Genie.
- mlpack_leaf_sizeint
See genieclust.Genie.
- nmslib_n_neighborsint
See genieclust.Genie.
- nmslib_params_initdict
See genieclust.Genie.
- nmslib_params_indexdict
See genieclust.Genie.
- nmslib_params_querydict
See genieclust.Genie.
- add_clustersint
Number of additional clusters to work with internally.
- n_featuresfloat or None
Dataset’s (intrinsic) dimensionality.
If
None
, it will be set based on the shape of the input matrix. Yet, affinity of"precomputed"
needs this to be set manually.- verbosebool
See genieclust.Genie.
See also
Notes
GIc (Genie+Information Criterion) is an Information-Theoretic Clustering Algorithm. It was proposed by Anna Cena in [1] and had been inspired by Mueller’s (et al.) ITM [2] and Gagolewski’s (et al.) Genie [3]; see also [4].
GIc computes an n_clusters-partition based on a pre-computed minimum spanning tree. Clusters are merged so as to maximise (heuristically) the information criterion discussed in [2].
GIc uses a bottom-up, agglomerative approach (as opposed to the ITM, which follows a divisive scheme). It greedily selects for merging a pair of clusters that maximises the information criterion [2]. By default, the initial partition is determined by considering the intersection of the partitions found by multiple runs of the Genie++ method with thresholds [0.1, 0.3, 0.5, 0.7], which we observe to be a sensible choice for most clustering activities. Hence, contrary to the Genie method, we can say that GIc as virtually parameter-less. However, when run with different n_clusters parameter, it does not yield a hierarchy of nested partitions (unless some more manual parameter tuning is applied).
- Environment variables:
- OMP_NUM_THREADS
See genieclust.Genie.
References
[1]Cena A., Adaptive hierarchical clustering algorithms based on data aggregation methods, PhD Thesis, Systems Research Institute, Polish Academy of Sciences 2018.
[2] (1,2,3)Mueller A., Nowozin S., Lampert C.H., Information Theoretic Clustering using Minimum Spanning Trees, DAGM-OAGM, 2012.
[3]Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8-23. doi:10.1016/j.ins.2016.05.003.ing
[4]Gagolewski M., Cena A., Bartoszuk M., Brzozowski L., Clustering with minimum spanning trees: How good can it be?, Journal of Classification, 2024, in press, doi:10.1007/s00357-024-09483-1.
- Attributes:
- See class `genieclust.Genie`.
Methods
fit
(X[, y])Perform cluster analysis of a dataset.
fit_predict
(X[, y])Perform cluster analysis of a dataset and return the predicted labels.
get_metadata_routing
()Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
- fit(X, y=None)¶
Perform cluster analysis of a dataset.
- Parameters:
- Xobject
See genieclust.Genie.fit.
- yNone
Ignored.
- Returns:
- selfgenieclust.GIc
The object that the method was called on.
See also
genieclust.Genie.fit
genieclust.GIc.fit_predict
Notes
Refer to the labels_ and n_clusters_ attributes for the result.
Note that for affinity of
"precomputed"
, the n_features parameter must be set explicitly.