genieclust.cluster_validity

Internal cluster validity indices

The greater the index value, the more valid (whatever that means) the assessed partition. For consistency, the Ball-Hall and Davies-Bouldin indexes take negative values.

These measures were critically reviewed in (Gagolewski, Bartoszuk, Cena, 2022; https://doi.org/10.1016/j.ins.2021.10.004; preprint); see Section 2 therein for the respective definitions.

For even more details, see the Framework for Benchmarking Clustering Algorithms.

genieclust.cluster_validity.calinski_harabasz_index(X, y)

Computes the value of the Caliński-Harabasz index [3].

See [1] and [2] for the definition and discussion.

Parameters:
Xc_contiguous ndarray, shape (n, d)

n data points in a feature space of dimensionality d

yarray_like

A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.

Returns:
indexfloat

Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.

See also

genieclust.cluster_validity.calinski_harabasz_index

The Caliński-Harabasz index

genieclust.cluster_validity.dunnowa_index

Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)

genieclust.cluster_validity.generalised_dunn_index

Generalised Dunn indices (by Bezdek and Pal)

genieclust.cluster_validity.negated_ball_hall_index

The Ball-Hall index (negated)

genieclust.cluster_validity.negated_davies_bouldin_index

The Davies-Bouldin index (negated)

genieclust.cluster_validity.negated_wcss_index

Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)

genieclust.cluster_validity.silhouette_index

The Silhouette index (average silhouette score)

genieclust.cluster_validity.silhouette_w_index

The Silhouette W index (mean of the cluster average silhouette widths)

genieclust.cluster_validity.wcnn_index

The within-cluster near-neighbours index

References

[1]

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.

[2]

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).

[3]

Calinski T., Harabasz J., A dendrite method for cluster analysis, Communications in Statistics 3(1), 1974, 1–27, https://doi.org/10.1080/03610927408827101.

genieclust.cluster_validity.dunnowa_index(X, y, M=25, owa_numerator='SMin:5', owa_denominator='Const')

Computes the generalised Dunn indices based on near-neighbours and OWA operators [2].

See [1] and [2] for the definition and discussion.

Parameters:
Xc_contiguous ndarray, shape (n, d)

n data points in a feature space of dimensionality d

yarray_like

A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.

Mint

number of nearest neighbours

owa_numerator, owa_denominatorstr

specifies the OWA operators to use in the definition of the DuNN index; one of: "Mean", "Min", "Max", "Const", "SMin:D", "SMax:D", where code{D} is an integer defining the degree of smoothness

Returns:
indexfloat

Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.

See also

genieclust.cluster_validity.calinski_harabasz_index

The Caliński-Harabasz index

genieclust.cluster_validity.dunnowa_index

Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)

genieclust.cluster_validity.generalised_dunn_index

Generalised Dunn indices (by Bezdek and Pal)

genieclust.cluster_validity.negated_ball_hall_index

The Ball-Hall index (negated)

genieclust.cluster_validity.negated_davies_bouldin_index

The Davies-Bouldin index (negated)

genieclust.cluster_validity.negated_wcss_index

Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)

genieclust.cluster_validity.silhouette_index

The Silhouette index (average silhouette score)

genieclust.cluster_validity.silhouette_w_index

The Silhouette W index (mean of the cluster average silhouette widths)

genieclust.cluster_validity.wcnn_index

The within-cluster near-neighbours index

References

[1]

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.

[2] (1,2)

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).

genieclust.cluster_validity.generalised_dunn_index(X, y, lowercase_d=1, uppercase_d=2)

Computes the generalised Dunn indices (by Bezdek and Pal) [3].

See [1] and [2] for the definition and discussion.

Parameters:
Xc_contiguous ndarray, shape (n, d)

n data points in a feature space of dimensionality d

yarray_like

A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.

Mint

number of nearest neighbours

lowercase_dint

an integer between 1 and 5, denoting \(d_1\), …, \(d_5\) in the definition of the generalised Dunn index (numerator: min, max, and mean pairwise intracluster distance, distance between cluster centroids, weighted point-centroid distance, respectively)

uppercase_dint

an integer between 1 and 3, denoting \(D_1\), …, \(D_3\) in the definition of the generalised Dunn index (denominator: max and min pairwise intracluster distance, average point-centroid distance, respectively)

Returns:
indexfloat

Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.

See also

genieclust.cluster_validity.calinski_harabasz_index

The Caliński-Harabasz index

genieclust.cluster_validity.dunnowa_index

Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)

genieclust.cluster_validity.generalised_dunn_index

Generalised Dunn indices (by Bezdek and Pal)

genieclust.cluster_validity.negated_ball_hall_index

The Ball-Hall index (negated)

genieclust.cluster_validity.negated_davies_bouldin_index

The Davies-Bouldin index (negated)

genieclust.cluster_validity.negated_wcss_index

Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)

genieclust.cluster_validity.silhouette_index

The Silhouette index (average silhouette score)

genieclust.cluster_validity.silhouette_w_index

The Silhouette W index (mean of the cluster average silhouette widths)

genieclust.cluster_validity.wcnn_index

The within-cluster near-neighbours index

References

[1]

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.

[2]

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).

[3]

Bezdek J., Pal N., Some new indexes of cluster validity, IEEE Transactions on Systems, Man, and Cybernetics, Part B 28, 1998, 301-315, https://doi.org/10.1109/3477.678624/.

genieclust.cluster_validity.negated_ball_hall_index(X, y)

Computes the value of the negated Ball-Hall index [3].

See [1] and [2] for the definition and discussion.

Parameters:
Xc_contiguous ndarray, shape (n, d)

n data points in a feature space of dimensionality d

yarray_like

A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.

Returns:
indexfloat

Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.

See also

genieclust.cluster_validity.calinski_harabasz_index

The Caliński-Harabasz index

genieclust.cluster_validity.dunnowa_index

Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)

genieclust.cluster_validity.generalised_dunn_index

Generalised Dunn indices (by Bezdek and Pal)

genieclust.cluster_validity.negated_ball_hall_index

The Ball-Hall index (negated)

genieclust.cluster_validity.negated_davies_bouldin_index

The Davies-Bouldin index (negated)

genieclust.cluster_validity.negated_wcss_index

Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)

genieclust.cluster_validity.silhouette_index

The Silhouette index (average silhouette score)

genieclust.cluster_validity.silhouette_w_index

The Silhouette W index (mean of the cluster average silhouette widths)

genieclust.cluster_validity.wcnn_index

The within-cluster near-neighbours index

References

[1]

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.

[2]

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).

[3]

Ball G.H., Hall D.J., ISODATA: A novel method of data analysis and pattern classification, Technical report No. AD699616, Stanford Research Institute, 1965.

genieclust.cluster_validity.negated_davies_bouldin_index(X, y)

Computes the value of the Davies-Bouldin index [3].

See [1] and [2] for the definition and discussion.

Parameters:
Xc_contiguous ndarray, shape (n, d)

n data points in a feature space of dimensionality d

yarray_like

A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.

Returns:
indexfloat

Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.

See also

genieclust.cluster_validity.calinski_harabasz_index

The Caliński-Harabasz index

genieclust.cluster_validity.dunnowa_index

Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)

genieclust.cluster_validity.generalised_dunn_index

Generalised Dunn indices (by Bezdek and Pal)

genieclust.cluster_validity.negated_ball_hall_index

The Ball-Hall index (negated)

genieclust.cluster_validity.negated_davies_bouldin_index

The Davies-Bouldin index (negated)

genieclust.cluster_validity.negated_wcss_index

Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)

genieclust.cluster_validity.silhouette_index

The Silhouette index (average silhouette score)

genieclust.cluster_validity.silhouette_w_index

The Silhouette W index (mean of the cluster average silhouette widths)

genieclust.cluster_validity.wcnn_index

The within-cluster near-neighbours index

References

[1]

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.

[2]

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).

[3]

Davies D.L., Bouldin D.W., A Cluster Separation Measure, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1 (2), 1979, 224-227, https://doi.org/10.1109/TPAMI.1979.4766909.

genieclust.cluster_validity.negated_wcss_index(X, y)

Computes the value of the negated within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm)

See [1] and [2] for the definition and discussion.

Parameters:
Xc_contiguous ndarray, shape (n, d)

n data points in a feature space of dimensionality d

yarray_like

A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.

Returns:
indexfloat

Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.

See also

genieclust.cluster_validity.calinski_harabasz_index

The Caliński-Harabasz index

genieclust.cluster_validity.dunnowa_index

Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)

genieclust.cluster_validity.generalised_dunn_index

Generalised Dunn indices (by Bezdek and Pal)

genieclust.cluster_validity.negated_ball_hall_index

The Ball-Hall index (negated)

genieclust.cluster_validity.negated_davies_bouldin_index

The Davies-Bouldin index (negated)

genieclust.cluster_validity.negated_wcss_index

Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)

genieclust.cluster_validity.silhouette_index

The Silhouette index (average silhouette score)

genieclust.cluster_validity.silhouette_w_index

The Silhouette W index (mean of the cluster average silhouette widths)

genieclust.cluster_validity.wcnn_index

The within-cluster near-neighbours index

References

[1]

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.

[2]

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).

genieclust.cluster_validity.silhouette_index(X, y)

Computes the value of the The Silhouette index (average silhouette score) [3].

See [1] and [2] for the definition and discussion.

Parameters:
Xc_contiguous ndarray, shape (n, d)

n data points in a feature space of dimensionality d

yarray_like

A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.

Returns:
indexfloat

Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.

See also

genieclust.cluster_validity.calinski_harabasz_index

The Caliński-Harabasz index

genieclust.cluster_validity.dunnowa_index

Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)

genieclust.cluster_validity.generalised_dunn_index

Generalised Dunn indices (by Bezdek and Pal)

genieclust.cluster_validity.negated_ball_hall_index

The Ball-Hall index (negated)

genieclust.cluster_validity.negated_davies_bouldin_index

The Davies-Bouldin index (negated)

genieclust.cluster_validity.negated_wcss_index

Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)

genieclust.cluster_validity.silhouette_index

The Silhouette index (average silhouette score)

genieclust.cluster_validity.silhouette_w_index

The Silhouette W index (mean of the cluster average silhouette widths)

genieclust.cluster_validity.wcnn_index

The within-cluster near-neighbours index

References

[1]

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.

[2]

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).

[3]

Rousseeuw P.J., Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis, Computational and Applied Mathematics 20, 1987, 53-65, https://doi.org/10.1016/0377-0427(87)90125-7.

genieclust.cluster_validity.silhouette_w_index(X, y)

Computes the value of the The Silhouette W index (mean of the cluster average silhouette widths) [3].

See [1] and [2] for the definition and discussion.

Parameters:
Xc_contiguous ndarray, shape (n, d)

n data points in a feature space of dimensionality d

yarray_like

A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.

Returns:
indexfloat

Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.

See also

genieclust.cluster_validity.calinski_harabasz_index

The Caliński-Harabasz index

genieclust.cluster_validity.dunnowa_index

Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)

genieclust.cluster_validity.generalised_dunn_index

Generalised Dunn indices (by Bezdek and Pal)

genieclust.cluster_validity.negated_ball_hall_index

The Ball-Hall index (negated)

genieclust.cluster_validity.negated_davies_bouldin_index

The Davies-Bouldin index (negated)

genieclust.cluster_validity.negated_wcss_index

Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)

genieclust.cluster_validity.silhouette_index

The Silhouette index (average silhouette score)

genieclust.cluster_validity.silhouette_w_index

The Silhouette W index (mean of the cluster average silhouette widths)

genieclust.cluster_validity.wcnn_index

The within-cluster near-neighbours index

References

[1]

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.

[2]

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).

[3]

Rousseeuw P.J., Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis, Computational and Applied Mathematics 20, 1987, 53-65, https://doi.org/10.1016/0377-0427(87)90125-7.

genieclust.cluster_validity.wcnn_index(X, y, M=25)

Computes the within-cluster near-neighbours index [2].

See [1] and [2] for the definition and discussion.

Parameters:
Xc_contiguous ndarray, shape (n, d)

n data points in a feature space of dimensionality d

yarray_like

A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.

Mint

number of nearest neighbours

Returns:
indexfloat

Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.

See also

genieclust.cluster_validity.calinski_harabasz_index

The Caliński-Harabasz index

genieclust.cluster_validity.dunnowa_index

Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)

genieclust.cluster_validity.generalised_dunn_index

Generalised Dunn indices (by Bezdek and Pal)

genieclust.cluster_validity.negated_ball_hall_index

The Ball-Hall index (negated)

genieclust.cluster_validity.negated_davies_bouldin_index

The Davies-Bouldin index (negated)

genieclust.cluster_validity.negated_wcss_index

Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)

genieclust.cluster_validity.silhouette_index

The Silhouette index (average silhouette score)

genieclust.cluster_validity.silhouette_w_index

The Silhouette W index (mean of the cluster average silhouette widths)

genieclust.cluster_validity.wcnn_index

The within-cluster near-neighbours index

References

[1]

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.

[2] (1,2)

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).