genieclust.cluster_validity¶
Internal cluster validity indices
The greater the index value, the more valid (whatever that means) the assessed partition. For consistency, the Ball-Hall and Davies-Bouldin indexes take negative values.
These measures were critically reviewed in (Gagolewski, Bartoszuk, Cena, 2022; https://doi.org/10.1016/j.ins.2021.10.004; preprint); see Section 2 therein for the respective definitions.
For even more details, see the Framework for Benchmarking Clustering Algorithms.
- genieclust.cluster_validity.calinski_harabasz_index(X, y)¶
Computes the value of the Caliński-Harabasz index [3].
See [1] and [2] for the definition and discussion.
- Parameters:
- Xc_contiguous ndarray, shape (n, d)
n data points in a feature space of dimensionality d
- yarray_like
A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.
- Returns:
- indexfloat
Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.
See also
genieclust.cluster_validity.calinski_harabasz_index
The Caliński-Harabasz index
genieclust.cluster_validity.dunnowa_index
Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)
genieclust.cluster_validity.generalised_dunn_index
Generalised Dunn indices (by Bezdek and Pal)
genieclust.cluster_validity.negated_ball_hall_index
The Ball-Hall index (negated)
genieclust.cluster_validity.negated_davies_bouldin_index
The Davies-Bouldin index (negated)
genieclust.cluster_validity.negated_wcss_index
Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)
genieclust.cluster_validity.silhouette_index
The Silhouette index (average silhouette score)
genieclust.cluster_validity.silhouette_w_index
The Silhouette W index (mean of the cluster average silhouette widths)
genieclust.cluster_validity.wcnn_index
The within-cluster near-neighbours index
References
[1]Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.
[2]Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).
[3]Calinski T., Harabasz J., A dendrite method for cluster analysis, Communications in Statistics 3(1), 1974, 1–27, https://doi.org/10.1080/03610927408827101.
- genieclust.cluster_validity.dunnowa_index(X, y, M=25, owa_numerator='SMin:5', owa_denominator='Const')¶
Computes the generalised Dunn indices based on near-neighbours and OWA operators [2].
See [1] and [2] for the definition and discussion.
- Parameters:
- Xc_contiguous ndarray, shape (n, d)
n data points in a feature space of dimensionality d
- yarray_like
A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.
- Mint
number of nearest neighbours
- owa_numerator, owa_denominatorstr
specifies the OWA operators to use in the definition of the DuNN index; one of:
"Mean"
,"Min"
,"Max"
,"Const"
,"SMin:D"
,"SMax:D"
, where code{D} is an integer defining the degree of smoothness
- Returns:
- indexfloat
Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.
See also
genieclust.cluster_validity.calinski_harabasz_index
The Caliński-Harabasz index
genieclust.cluster_validity.dunnowa_index
Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)
genieclust.cluster_validity.generalised_dunn_index
Generalised Dunn indices (by Bezdek and Pal)
genieclust.cluster_validity.negated_ball_hall_index
The Ball-Hall index (negated)
genieclust.cluster_validity.negated_davies_bouldin_index
The Davies-Bouldin index (negated)
genieclust.cluster_validity.negated_wcss_index
Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)
genieclust.cluster_validity.silhouette_index
The Silhouette index (average silhouette score)
genieclust.cluster_validity.silhouette_w_index
The Silhouette W index (mean of the cluster average silhouette widths)
genieclust.cluster_validity.wcnn_index
The within-cluster near-neighbours index
References
[1]Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.
[2] (1,2)Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).
- genieclust.cluster_validity.generalised_dunn_index(X, y, lowercase_d=1, uppercase_d=2)¶
Computes the generalised Dunn indices (by Bezdek and Pal) [3].
See [1] and [2] for the definition and discussion.
- Parameters:
- Xc_contiguous ndarray, shape (n, d)
n data points in a feature space of dimensionality d
- yarray_like
A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.
- Mint
number of nearest neighbours
- lowercase_dint
an integer between 1 and 5, denoting \(d_1\), …, \(d_5\) in the definition of the generalised Dunn index (numerator: min, max, and mean pairwise intracluster distance, distance between cluster centroids, weighted point-centroid distance, respectively)
- uppercase_dint
an integer between 1 and 3, denoting \(D_1\), …, \(D_3\) in the definition of the generalised Dunn index (denominator: max and min pairwise intracluster distance, average point-centroid distance, respectively)
- Returns:
- indexfloat
Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.
See also
genieclust.cluster_validity.calinski_harabasz_index
The Caliński-Harabasz index
genieclust.cluster_validity.dunnowa_index
Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)
genieclust.cluster_validity.generalised_dunn_index
Generalised Dunn indices (by Bezdek and Pal)
genieclust.cluster_validity.negated_ball_hall_index
The Ball-Hall index (negated)
genieclust.cluster_validity.negated_davies_bouldin_index
The Davies-Bouldin index (negated)
genieclust.cluster_validity.negated_wcss_index
Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)
genieclust.cluster_validity.silhouette_index
The Silhouette index (average silhouette score)
genieclust.cluster_validity.silhouette_w_index
The Silhouette W index (mean of the cluster average silhouette widths)
genieclust.cluster_validity.wcnn_index
The within-cluster near-neighbours index
References
[1]Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.
[2]Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).
[3]Bezdek J., Pal N., Some new indexes of cluster validity, IEEE Transactions on Systems, Man, and Cybernetics, Part B 28, 1998, 301-315, https://doi.org/10.1109/3477.678624/.
- genieclust.cluster_validity.negated_ball_hall_index(X, y)¶
Computes the value of the negated Ball-Hall index [3].
See [1] and [2] for the definition and discussion.
- Parameters:
- Xc_contiguous ndarray, shape (n, d)
n data points in a feature space of dimensionality d
- yarray_like
A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.
- Returns:
- indexfloat
Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.
See also
genieclust.cluster_validity.calinski_harabasz_index
The Caliński-Harabasz index
genieclust.cluster_validity.dunnowa_index
Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)
genieclust.cluster_validity.generalised_dunn_index
Generalised Dunn indices (by Bezdek and Pal)
genieclust.cluster_validity.negated_ball_hall_index
The Ball-Hall index (negated)
genieclust.cluster_validity.negated_davies_bouldin_index
The Davies-Bouldin index (negated)
genieclust.cluster_validity.negated_wcss_index
Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)
genieclust.cluster_validity.silhouette_index
The Silhouette index (average silhouette score)
genieclust.cluster_validity.silhouette_w_index
The Silhouette W index (mean of the cluster average silhouette widths)
genieclust.cluster_validity.wcnn_index
The within-cluster near-neighbours index
References
[1]Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.
[2]Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).
[3]Ball G.H., Hall D.J., ISODATA: A novel method of data analysis and pattern classification, Technical report No. AD699616, Stanford Research Institute, 1965.
- genieclust.cluster_validity.negated_davies_bouldin_index(X, y)¶
Computes the value of the Davies-Bouldin index [3].
See [1] and [2] for the definition and discussion.
- Parameters:
- Xc_contiguous ndarray, shape (n, d)
n data points in a feature space of dimensionality d
- yarray_like
A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.
- Returns:
- indexfloat
Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.
See also
genieclust.cluster_validity.calinski_harabasz_index
The Caliński-Harabasz index
genieclust.cluster_validity.dunnowa_index
Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)
genieclust.cluster_validity.generalised_dunn_index
Generalised Dunn indices (by Bezdek and Pal)
genieclust.cluster_validity.negated_ball_hall_index
The Ball-Hall index (negated)
genieclust.cluster_validity.negated_davies_bouldin_index
The Davies-Bouldin index (negated)
genieclust.cluster_validity.negated_wcss_index
Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)
genieclust.cluster_validity.silhouette_index
The Silhouette index (average silhouette score)
genieclust.cluster_validity.silhouette_w_index
The Silhouette W index (mean of the cluster average silhouette widths)
genieclust.cluster_validity.wcnn_index
The within-cluster near-neighbours index
References
[1]Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.
[2]Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).
[3]Davies D.L., Bouldin D.W., A Cluster Separation Measure, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1 (2), 1979, 224-227, https://doi.org/10.1109/TPAMI.1979.4766909.
- genieclust.cluster_validity.negated_wcss_index(X, y)¶
Computes the value of the negated within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm)
See [1] and [2] for the definition and discussion.
- Parameters:
- Xc_contiguous ndarray, shape (n, d)
n data points in a feature space of dimensionality d
- yarray_like
A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.
- Returns:
- indexfloat
Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.
See also
genieclust.cluster_validity.calinski_harabasz_index
The Caliński-Harabasz index
genieclust.cluster_validity.dunnowa_index
Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)
genieclust.cluster_validity.generalised_dunn_index
Generalised Dunn indices (by Bezdek and Pal)
genieclust.cluster_validity.negated_ball_hall_index
The Ball-Hall index (negated)
genieclust.cluster_validity.negated_davies_bouldin_index
The Davies-Bouldin index (negated)
genieclust.cluster_validity.negated_wcss_index
Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)
genieclust.cluster_validity.silhouette_index
The Silhouette index (average silhouette score)
genieclust.cluster_validity.silhouette_w_index
The Silhouette W index (mean of the cluster average silhouette widths)
genieclust.cluster_validity.wcnn_index
The within-cluster near-neighbours index
References
[1]Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.
[2]Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).
- genieclust.cluster_validity.silhouette_index(X, y)¶
Computes the value of the The Silhouette index (average silhouette score) [3].
See [1] and [2] for the definition and discussion.
- Parameters:
- Xc_contiguous ndarray, shape (n, d)
n data points in a feature space of dimensionality d
- yarray_like
A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.
- Returns:
- indexfloat
Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.
See also
genieclust.cluster_validity.calinski_harabasz_index
The Caliński-Harabasz index
genieclust.cluster_validity.dunnowa_index
Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)
genieclust.cluster_validity.generalised_dunn_index
Generalised Dunn indices (by Bezdek and Pal)
genieclust.cluster_validity.negated_ball_hall_index
The Ball-Hall index (negated)
genieclust.cluster_validity.negated_davies_bouldin_index
The Davies-Bouldin index (negated)
genieclust.cluster_validity.negated_wcss_index
Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)
genieclust.cluster_validity.silhouette_index
The Silhouette index (average silhouette score)
genieclust.cluster_validity.silhouette_w_index
The Silhouette W index (mean of the cluster average silhouette widths)
genieclust.cluster_validity.wcnn_index
The within-cluster near-neighbours index
References
[1]Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.
[2]Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).
[3]Rousseeuw P.J., Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis, Computational and Applied Mathematics 20, 1987, 53-65, https://doi.org/10.1016/0377-0427(87)90125-7.
- genieclust.cluster_validity.silhouette_w_index(X, y)¶
Computes the value of the The Silhouette W index (mean of the cluster average silhouette widths) [3].
See [1] and [2] for the definition and discussion.
- Parameters:
- Xc_contiguous ndarray, shape (n, d)
n data points in a feature space of dimensionality d
- yarray_like
A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.
- Returns:
- indexfloat
Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.
See also
genieclust.cluster_validity.calinski_harabasz_index
The Caliński-Harabasz index
genieclust.cluster_validity.dunnowa_index
Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)
genieclust.cluster_validity.generalised_dunn_index
Generalised Dunn indices (by Bezdek and Pal)
genieclust.cluster_validity.negated_ball_hall_index
The Ball-Hall index (negated)
genieclust.cluster_validity.negated_davies_bouldin_index
The Davies-Bouldin index (negated)
genieclust.cluster_validity.negated_wcss_index
Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)
genieclust.cluster_validity.silhouette_index
The Silhouette index (average silhouette score)
genieclust.cluster_validity.silhouette_w_index
The Silhouette W index (mean of the cluster average silhouette widths)
genieclust.cluster_validity.wcnn_index
The within-cluster near-neighbours index
References
[1]Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.
[2]Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).
[3]Rousseeuw P.J., Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis, Computational and Applied Mathematics 20, 1987, 53-65, https://doi.org/10.1016/0377-0427(87)90125-7.
- genieclust.cluster_validity.wcnn_index(X, y, M=25)¶
Computes the within-cluster near-neighbours index [2].
See [1] and [2] for the definition and discussion.
- Parameters:
- Xc_contiguous ndarray, shape (n, d)
n data points in a feature space of dimensionality d
- yarray_like
A vector of “small” integers representing a partition of the n input points; y[i] is the cluster ID of the i-th point, where 0 <= y[i] < K and K is the number of clusters.
- Mint
number of nearest neighbours
- Returns:
- indexfloat
Computed index value. The greater the index value, the more valid (whatever that means) the assessed partition.
See also
genieclust.cluster_validity.calinski_harabasz_index
The Caliński-Harabasz index
genieclust.cluster_validity.dunnowa_index
Generalised Dunn indices based on near-neighbours and OWA operators (by Gagolewski)
genieclust.cluster_validity.generalised_dunn_index
Generalised Dunn indices (by Bezdek and Pal)
genieclust.cluster_validity.negated_ball_hall_index
The Ball-Hall index (negated)
genieclust.cluster_validity.negated_davies_bouldin_index
The Davies-Bouldin index (negated)
genieclust.cluster_validity.negated_wcss_index
Within-cluster sum of squares (used as the objective function in the k-means and Ward algorithm) (negated)
genieclust.cluster_validity.silhouette_index
The Silhouette index (average silhouette score)
genieclust.cluster_validity.silhouette_w_index
The Silhouette W index (mean of the cluster average silhouette widths)
genieclust.cluster_validity.wcnn_index
The within-cluster near-neighbours index
References
[1]Gagolewski M., A Framework for Benchmarking Clustering Algorithms, SoftwareX 20, 2022, 101270. https://clustering-benchmarks.gagolewski.com.
[2] (1,2)Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 620–636, 2021, https://doi.org/10.1016/j.ins.2021.10.004 (preprint).