genieclust.GIc#

class genieclust.GIc(n_clusters=2, *, gini_thresholds=[0.1, 0.3, 0.5, 0.7], M=1, affinity='l2', exact=True, compute_full_tree=False, compute_all_cuts=False, postprocess='boundary', cast_float32=True, mlpack_enabled='auto', mlpack_leaf_size=1, nmslib_n_neighbors=64, nmslib_params_init={'method': 'hnsw'}, nmslib_params_index={'post': 2}, nmslib_params_query={}, add_clusters=0, n_features=None, verbose=False)#

GIc (Genie+Information Criterion) clustering algorithm

Parameters:
n_clustersint

See genieclust.Genie.

gini_thresholdsarray_like

A list of Gini index thresholds between 0 and 1.

The GIc algorithm optimises the information criterion in an agglomerative way, starting from the intersection of the clusterings returned by Genie(n_clusters=n_clusters+add_clusters, gini_threshold=gini_thresholds[i]), for all i from 0 to len(gini_thresholds)-1.

Mint

See genieclust.Genie.

affinitystr

See genieclust.Genie.

exactbool, default=True

See genieclust.Genie.

compute_full_treebool

See genieclust.Genie.

compute_all_cutsbool

See genieclust.Genie.

Note that if compute_all_cuts is True, then the i-th cut in the hierarchy behaves as if add_clusters was equal to n_clusters-i. In other words, the returned cuts might be different from those obtained by multiple calls to GIc, each time with different n_clusters and constant add_clusters requested.

postprocess{“boundary”, “none”, “all”}

See genieclust.Genie.

cast_float32bool

See genieclust.Genie.

mlpack_enabled“auto” or bool

See genieclust.Genie.

mlpack_leaf_sizeint

See genieclust.Genie.

nmslib_n_neighborsint

See genieclust.Genie.

nmslib_params_initdict

See genieclust.Genie.

nmslib_params_indexdict

See genieclust.Genie.

nmslib_params_querydict

See genieclust.Genie.

add_clustersint

Number of additional clusters to work with internally.

n_featuresfloat or None

Dataset’s (intrinsic) dimensionality.

If None, it will be set based on the shape of the input matrix. Yet, affinity of "precomputed" needs this to be set manually.

verbosebool

See genieclust.Genie.

See also

genieclust.Genie

Notes

GIc (Genie+Information Criterion) is an Information-Theoretic Clustering Algorithm. It was proposed by Anna Cena in [1] and had been inspired by Mueller’s (et al.) ITM [2] and Gagolewski’s (et al.) Genie [3]; see also [4].

GIc computes an n_clusters-partition based on a pre-computed minimum spanning tree. Clusters are merged so as to maximise (heuristically) the information criterion discussed in [2].

GIc uses a bottom-up, agglomerative approach (as opposed to the ITM, which follows a divisive scheme). It greedily selects for merging a pair of clusters that maximises the information criterion [2]. By default, the initial partition is determined by considering the intersection of the partitions found by multiple runs of the Genie++ method with thresholds [0.1, 0.3, 0.5, 0.7], which we observe to be a sensible choice for most clustering activities. Hence, contrary to the Genie method, we can say that GIc as virtually parameter-less. However, when run with different n_clusters parameter, it does not yield a hierarchy of nested partitions (unless some more manual parameter tuning is applied).

Environment variables:
OMP_NUM_THREADS

See genieclust.Genie.

References

[1]

Cena A., Adaptive hierarchical clustering algorithms based on data aggregation methods, PhD Thesis, Systems Research Institute, Polish Academy of Sciences 2018.

[2] (1,2,3)

Mueller A., Nowozin S., Lampert C.H., Information Theoretic Clustering using Minimum Spanning Trees, DAGM-OAGM, 2012.

[3]

Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8-23. doi:10.1016/j.ins.2016.05.003.

[4]

Gagolewski M., Cena A., Bartoszuk M., Brzozowski L., Clustering with minimum spanning trees: How good can it be?, 2023, under review (preprint), doi:10.48550/arXiv.2303.05679.

Attributes:
See class `genieclust.Genie`.

Methods

fit(X[, y])

Perform cluster analysis of a dataset.

fit_predict(X[, y])

Perform cluster analysis of a dataset and return the predicted labels.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

fit(X, y=None)#

Perform cluster analysis of a dataset.

Parameters:
Xobject

See genieclust.Genie.fit.

yNone

Ignored.

Returns:
selfgenieclust.GIc

The object that the method was called on.

See also

genieclust.Genie.fit
genieclust.GIc.fit_predict

Notes

Refer to the labels_ and n_clusters_ attributes for the result.

Note that for affinity of "precomputed", the n_features parameter must be set explicitly.