genieclust.Genie#

class genieclust.Genie(n_clusters=2, *, gini_threshold=0.3, M=1, affinity='l2', exact=True, compute_full_tree=False, compute_all_cuts=False, postprocess='boundary', cast_float32=True, mlpack_enabled='auto', mlpack_leaf_size=1, nmslib_n_neighbors=64, nmslib_params_init={'method': 'hnsw'}, nmslib_params_index={'post': 2}, nmslib_params_query={}, verbose=False)#

Genie++ hierarchical clustering algorithm

Parameters:
n_clustersint

Number of clusters to detect.

If M > 1 and postprocess is not "all", n_clusters = 1 can act as a noise point/outlier detector. The approximate method (see parameter exact) can sometimes fail to detect the coarsest-grained partitions; in such a case, more clusters might be returned (with a warning).

gini_thresholdfloat

Threshold for the Genie correction in [0,1].

The Gini index is used to quantify the inequality of the cluster size distribution. Low thresholds highly penalise the formation of small clusters. Threshold of 1.0 disables the correction and for M = 1 makes the method be equivalent to the single linkage algorithm.

The algorithm tends to be stable with respect to small changes to the threshold — they do not tend to affect the output clustering. Usually, thresholds of 0.1, 0.3, 0.5, and 0.7 are worth giving a try.

Mint

Smoothing factor for the mutual reachability distance [6].

M = 1 gives the original Genie algorithm [1] (with no noise point detection) with respect to the chosen affinity as-is. Note that for M > 1 we need additionally \(O(M n)\) working memory for storing of points’ nearest neighbours.

affinitystr

Metric used to compute the linkage.

  • For exact = True:

    One of: "l2" (synonym: "euclidean"), "l1" (synonym: "manhattan", "cityblock"), "cosinesimil" (synonym: "cosine"), or "precomputed".

    In the latter case, the X argument to the fit method must be a distance vector or a square-form distance matrix, see scipy.spatial.distance.pdist.

  • For exact = False:

    Any dissimilarity supported by nmslib, see [5] and https://github.com/nmslib/nmslib/blob/master/manual/spaces.md, for instance: "l2", "l2_sparse", "l1", "l1_sparse", "linf", "linf_sparse", "cosinesimil", "cosinesimil_sparse", "negdotprod", "negdotprod_sparse", "angulardist", "angulardist_sparse", "leven", "normleven", "jaccard_sparse", "bit_jaccard", "bit_hamming".

exactbool

Whether to compute the minimum spanning tree exactly or rather estimate it based on an approximate near-neighbour graph.

The exact method has time complexity of \(O(d n^2)\) [2] (however, see mlpack_enabled) but only needs \(O(n)\) memory.

If exact is False, the minimum spanning tree is approximated based on an approximate \(k\)-nearest neighbours graph found by nmslib [5]. This is typically very fast but requires \(O(k n)\) memory.

compute_full_treebool

Whether to determine the whole cluster hierarchy and the linkage matrix.

Only available if M = 1 and exact is True. Enables plotting of dendrograms or cutting of the hierarchy at an arbitrary level, see the children_, distances_, counts_ attributes.

compute_all_cutsbool

Whether to compute the requested n_clusters-partition and all the coarser-grained ones.

If True, then the labels_ attribute will be a matrix, see below.

postprocess{“boundary”, “none”, “all”}

Controls the treatment of noise points after the clusters are identified.

In effect only if M > 1. Each leaf in the minimum spanning tree is treated as a noise point. We call it a boundary point if it is amongst its adjacent vertex’s M - 1 nearest neighbours. By default, only boundary points are merged with their nearest core points.

To force a classical n_clusters-partition of a data set (with no notion of noise), choose "all". Furthermore, "none" leaves all leaves, i.e., noise points (including the boundary ones) as-is.

cast_float32bool

Whether casting of data type to float32 is to be performed.

If exact is True, it decreases the run-time ca. 2 times at a cost of greater memory use. Otherwise, note that nmslib requires float32 data anyway when using dense or sparse numeric matrix inputs.

By setting cast_float32 to False a user assures themself that the inputs are of acceptable form.

mlpack_enabled“auto” or bool

Whether mlpack.emst should be used for computing the Euclidean minimum spanning tree instead of the Jarník-Prim algorithm when exact is True.

Often fast for very low-dimensional spaces. As the name suggests, only affinity of 'l2' is supported (and M = 1). By default, we rely on mlpack if it is installed and n_features <= 6.

mlpack_leaf_sizeint

Leaf size in the kd-tree when mlpack.emst is used.

According to the mlpack manual, leaves of size 1 give the best performance at the cost of greater memory use.

nmslib_n_neighborsint

The number of approximate nearest neighbours used to estimate the minimum spanning tree when when exact is False.

If the number of nearest neighbours is too small, the nearest neighbour graph might be disconnected and the number of obtained clusters might be greater than the requested one.

nmslib_n_neighbors must not be less than M - 1.

nmslib_params_initdict

A dictionary of parameters to be passed to nmslib.init when exact is False.

See https://github.com/nmslib/nmslib/blob/master/manual/methods.md, https://github.com/nmslib/nmslib/blob/master/manual/spaces.md and https://nmslib.github.io/nmslib/ for more details. The space, data_type, and dtype parameters will be set based on the chosen affinity and the input X.

nmslib_params_indexdict

A dictionary of parameters to be passed to index.createIndex, where index is the object constructed with nmslib.init.

The indexThreadQty parameter will be set based on the OMP_NUM_THREADS environment variable.

nmslib_params_querydict

A dictionary of parameters to be passed to index.setQueryTimeParams, where index is the object constructed with nmslib.init.

verbosebool

Whether to print diagnostic messages and progress information on stderr.

Notes

Genie is a robust and outlier resistant hierarchical clustering algorithm [1], originally published as an R package genie. This new implementation is, amongst others, much faster and now features optional smoothing and noise point detection (if M > 1).

Genie is based on a minimum spanning tree (MST) of the pairwise distance graph of a given point set. Just like the single linkage, it consumes the edges of the MST in increasing order of weights. However, it prevents the formation of clusters of highly imbalanced sizes; once the Gini index of the cluster size distribution raises above an assumed threshold, a forced merge of a point group of the smallest size is performed. Its appealing simplicity goes hand in hand with its usability; Genie often outperforms other clustering approaches on benchmark data.

The Genie algorithm itself has \(O(n \sqrt{n})\) time and \(O(n)\) memory complexity provided that a minimum spanning tree of the pairwise distance graph is given. Generally, our parallelised implementation of a Jarník (Prim/Dijkstra)-like method [2] will be called to compute an MST, which takes \(O(d n^2)\) time. However, mlpack [3] provides a very fast alternative in the case of Euclidean spaces of (very) low dimensionality and M = 1, see [4] and the mlpack_enabled parameter. Moreover, in the approximate method (exact = False) we apply the Kruskal algorithm on the near-neighbour graph determined by nmslib [5]. Albeit this only gives some sort of a spanning forest, such a data structure turns out to be very suitable for our clustering task (note that the set of connected components will determine the top level of the identified cluster hierarchy).

Note that the mlpack and nmslib dependencies are optional. Make sure they are installed on your system.

The Genie correction together with the smoothing factor M > 1 gives a robustified version of the HDBSCAN* [6] algorithm that, contrary to its predecessor, is able to detect a predefined number of clusters. Hence, it is independent of the DBSCAN’s somewhat magical eps parameter or the HDBSCAN’s min_cluster_size one. If M > 1, then the minimum spanning tree is computed with respect to the mutual reachability distance (based, e.g., on the Euclidean metric). Formally, the distance \(m(i,j)\) is used instead of the chosen “raw” distance, \(d(i,j)\). It holds \(m(i,j)=\max(d(i,j), c(i), c(j))\), where the “core” distance \(c(i)\) is given by \(d(i,k)\) with \(k\) being the (M - 1)-th nearest neighbour of \(i\). This makes “noise” and “boundary” points being “pulled away” from each other, however, note that M = 2 corresponds to the original distance. During the clustering procedure, all leaves of the MST do not take part in the clustering process. They may be merged with the nearest clusters during the postprocessing stage, or left marked as “noise” observations.

Environment variables:
OMP_NUM_THREADS

Controls the number of threads used for computing the minimum spanning tree (not supported by mlpack).

References

[1] (1,2)

Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8-23. doi:10.1016/j.ins.2016.05.003.

[2] (1,2)

Olson C.F., Parallel algorithms for hierarchical clustering, Parallel Computing 21(8), 1995, 1313-1325. doi:10.1016/0167-8191(95)00017-I.

[3]

Curtin R.R., Edel M., Lozhnikov M., Mentekidis Y., Ghaisas S., Zhang S., mlpack 3: A fast, flexible machine learning library, Journal of Open Source Software 3(26), 726, 2018. doi:10.21105/joss.00726.

[4]

March W.B., Ram P., Gray A.G., Fast Euclidean Minimum Spanning Tree: Algorithm, Analysis, and Applications, Proc. ACM SIGKDD’10, 2010, 603-611.

[5] (1,2,3)

Naidan B., Boytsov L., Malkov Y., Novak D., Non-metric space library (NMSLIB) manual, version 2.0, 2019. https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf.

[6] (1,2)

Campello R.J.G.B., Moulavi D., Sander J., Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160-172, doi:10.1007/978-3-642-37456-2_14.

[7]

Gagolewski M., Cena A., Bartoszuk M., Brzozowski L., Clustering with minimum spanning trees: How good can it be?, 2023, under review (preprint), doi:10.48550/arXiv.2303.05679.

Attributes:
labels_ndarray

Detected cluster labels.

If compute_all_cuts is False (the default), this is an integer vector such that labels_[i] gives the cluster ID (between 0 and n_clusters_ - 1) of the i-th object. If M > 1, noise points are labelled -1 (unless taken care of in the postprocessing stage).

Otherwise, i.e., if compute_all_cuts is True, all partitions of cardinality down to n_clusters are determined; labels_[j,i] denotes the cluster ID of the i-th point in a j-partition. We assume that both the 0- and 1- partition distinguishes only between noise- and non-noise points, however, no postprocessing is conducted on the 0-partition (there might be points with labels of -1 even if postprocess is "all").

Note that the approximate method (exact of False) might fail to determine the fine-grained clusters (if the approximate neighbour graph is disconnected) - the actual number of clusters detected can be larger.

n_clusters_int

The actual number of clusters detected by the algorithm.

As we argued above, the approximate method might sometimes yield a more fine-grained partition than the requested one (with a warning). Moreover, there might be too many noise points in the dataset.

n_samples_int

The number of points in the fitted dataset.

n_features_int

The number of features in the fitted dataset.

If the information is not available, it is be set to -1.

children_None or ndarray

If compute_full_tree is True, this is a matrix whose i-th row provides the information on the clusters merged in the i-th iteration. See the description of Z[:,0] and Z[:,1] in scipy.cluster.hierarchy.linkage. Together with distances_ and counts_, this constitutes the linkage matrix that can be used for plotting the dendrogram.

distances_None or ndarray

If compute_full_tree is True, this is a vector that gives the distance between two clusters merged in each iteration, see the description of Z[:,2] in scipy.cluster.hierarchy.linkage.

As the original Genie algorithm does not guarantee that that distances are ordered increasingly (there are other hierarchical clustering linkages that violate the ultrametricity property as well), these are corrected by applying distances_ = genieclust.tools.cummin(distances_[::-1])[::-1].

counts_None or ndarray

If compute_full_tree is True, this is a vector giving the number of elements in a cluster created in each iteration. See the description of Z[:,3] in scipy.cluster.hierarchy.linkage.

Methods

fit(X[, y])

Perform cluster analysis of a dataset.

fit_predict(X[, y])

Perform cluster analysis of a dataset and return the predicted labels.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

fit(X, y=None)#

Perform cluster analysis of a dataset.

Parameters:
Xobject

Typically a matrix with n_samples rows and n_features columns, see below for more details and options.

yNone

Ignored.

Returns:
selfgenieclust.Genie

The object that the method was called on.

Notes

Refer to the labels_ and n_clusters_ attributes for the result.

Acceptable X types depend whether we use the exact or the approximate method.

  • X when exact = True.

    For affinity of "precomputed", X should either be a distance vector of length n_samples*(n_samples-1)/2 (see scipy.spatial.distance.pdist) or a square distance matrix of shape (n_samples, n_samples) (see scipy.spatial.distance.squareform).

    Otherwise, X should be real-valued matrix (dense numpy.ndarray, or an object coercible to) with n_samples rows and n_features columns.

    In the latter case, it might be a good idea to standardise or at least somehow preprocess the coordinates of the input data points by calling, for instance, X = (X-X.mean(axis=0))/X.std(axis=None, ddof=1) so that the dataset is centred at 0 and has total variance of 1. This way the method becomes translation and scale invariant. What’s more, if data are recorded with small precision (say, up to few decimal digits), adding a tiny bit of Gaussian noise will assure the solution is unique (note that this generally applies to other distance-based clustering algorithms as well).

  • X when exact = False.

    The approximate method relies on nmslib for locating the nearest neighbours. Therefore, it supports all datatypes described in https://github.com/nmslib/nmslib/blob/master/manual/spaces.md. Depending on the chosen affinity, X may hence be a real-valued numpy.ndarray matrix with n_samples rows and n_features columns, a scipy.sparse.csr_matrix object, or an array of ASCII strings.

References

[1]

Naidan B., Boytsov L., Malkov Y., Novak D., Non-metric space library (NMSLIB) manual, version 2.0, 2019. https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf.