genieclust.compare_partitions#

External cluster validity measures and partition similarity scores

These indices can be used for comparing the outputs of clustering algorithms with reference (ground truth) labels.

For more details, see the Framework for Benchmarking Clustering Algorithms.

genieclust.compare_partitions.adjusted_fm_score(x, y)#

The Fowlkes-Mallows index adjusted for chance

Parameters:
xndarray

A confusion matrix (contingency table) or a vector of labels (if y is not None)

yNone or ndarray

a vector of labels or None

Returns:
double

Partition similarity measure.

See also

genieclust.compare_partitions.compare_partitions

Computes multiple similarity scores Computes multiple similarity scores based on two label vectors

Notes

See compare_partitions() for more details.

genieclust.compare_partitions.adjusted_mi_score(x, y)#

Adjusted mutual information score \((\mathrm{AMI}_\mathrm{sum})\)

Parameters:
xndarray

A confusion matrix (contingency table) or a vector of labels (if y is not None)

yNone or ndarray

a vector of labels or None

Returns:
double

Partition similarity measure.

See also

genieclust.compare_partitions.compare_partitions

Computes multiple similarity scores

Notes

See compare_partitions() for more details.

genieclust.compare_partitions.adjusted_rand_score(x, y)#

The Rand index adjusted for chance

Parameters:
xndarray

A confusion matrix (contingency table) or a vector of labels (if y is not None)

yNone or ndarray

a vector of labels or None

Returns:
double

Partition similarity measure.

See also

genieclust.compare_partitions.compare_partitions

Computes multiple similarity scores

Notes

See compare_partitions() for more details.

genieclust.compare_partitions.compare_partitions(C)#

Computes a series of external cluster validity measures

Parameters:
xndarray

A confusion matrix (contingency table) or a vector of labels (if y is not None)

yNone or ndarray

a vector of labels or None

psi_clippedbool

Whether pair sets index should be clipped to 0

Returns:
scoresdict

A dictionary with the following keys:

'ar'

Adjusted Rand index

'r'

Rand index (unadjusted for chance)

'afm'

Adjusted Fowlkes-Mallows index

'fm'

Fowlkes-Mallows index (unadjusted for chance)

'mi'

Mutual information score

'nmi'

Normalised mutual information \((\mathrm{NMI}_\mathrm{sum})\)

'ami'

Adjusted mutual information \((\mathrm{AMI}_\mathrm{sum})\)

'npa'

Normalised pivoted accuracy

'psi'

Pair sets index

'spsi'

Simplified pair sets index

'nca'

Normalised clustering accuracy; it is assumed that rows in C represent the ground-truth partition

Notes

Let x and y represent two partitions of the same set with \(n\) elements into, respectively, \(K\) and \(L\) nonempty and pairwise disjoint subsets. For instance, these can be two clusterings of a dataset with \(n\) observations specified as vectors of labels. Moreover, let C be the confusion matrix with \(K\) rows and \(L\) columns, corresponding to x and y; see also confusion_matrix().

This function implements a few scores that aim to quantify the similarity between x and y. They can be used as external cluster validity measures, where we assume that x is the reference (ground-truth) partition whilst y is the vector of predicted cluster memberships; compare [5].

All indices except normalized_clustering_accuracy can act as a pairwise partition similarity score: they are symmetric, i.e., index(x, y) == index(y, x).

Each index except mi_score (which computes the mutual information score) outputs the value of 1.0 if two identical partitions are given. Note that partitions are always defined up to a bijection of the set of possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4) represent the same 2-partition.

normalized_clustering_accuracy [2] is an asymmetric external cluster validity measure which assumes that the label vector x (or rows in the confusion matrix) represents the reference (ground truth) partition. It is an average proportion of correctly classified points in each cluster above the worst case scenario of uniform membership assignment, with cluster ID matching based on the solution to the maximal linear sum assignment problem; see normalized_confusion_matrix()). It is given by: \((\max_\sigma \frac{1}{K} \sum_{j=1}^K \frac{c_{\sigma(j), j}-c_{\sigma(j),\cdot}/K}{c_{\sigma(j),\cdot}-c_{\sigma(j),\cdot}/K})\), where \(C\) is a confusion matrix with \(K\) rows and \(L\) columns, \(\sigma\) is a permutation of the set \(\{1,\dots,\max(K,L)\}\), and and \(c_{i, \cdot}=c_{i, 1}+...+c_{i, L}\) is the i-th row sum, under the assumption that \(c_{i,j}=0\) for \(i>K\) or \(j>L\) and \(0/0=0\).

normalized_pivoted_accuracy is defined as \((\max_\sigma \sum_{j=1}^{\max(K,L)} c_{\sigma(j),j}/n-1/\max(K,L))/(1-1/\max(K,L))\), where \(\sigma\) is a permutation of the set \(\{1,\dots,\max(K,L)\}\), and \(n\) is the sum of all elements in \(C\). For non-square matrices, missing rows/columns are assumed to be filled with 0s.

pair_sets_index (PSI) was introduced in [3]. The simplified PSI assumes E=1 in the definition of the index, i.e., uses Eq. (20) in [3] instead of Eq. (18). For non-square matrices, missing rows/columns are assumed to be filled with 0s.

rand_score gives the Rand score (the “probability” of agreement between the two partitions) and adjusted_rand_score is its version corrected for chance [1] (especially Eqs. (2) and (4) therein): its expected value is 0.0 for two independent partitions. Due to the adjustment, the resulting index may be negative for some inputs.

Similarly, fm_score gives the Fowlkes-Mallows (FM) score and adjusted_fm_score is its adjusted-for-chance version [1].

mi_score, adjusted_mi_score and normalized_mi_score are information-theoretic indices based on mutual information, see the definition of \(\mathrm{AMI}_\mathrm{sum}\) and \(\mathrm{NMI}_\mathrm{sum}\) in [4].

References

[1] (1,2)

Hubert L., Arabie P., Comparing partitions, Journal of Classification 2(1), 1985, 193-218.

[2]

Gagolewski M., Normalised clustering accuracy: An asymmetric external cluster validity measure, 2023, under review (preprint). https://doi.org/10.48550/arXiv.2209.02935.

[3] (1,2)

Rezaei M., Franti P., Set matching measures for external cluster validity, IEEE Transactions on Knowledge and Data Mining 28(8), 2016, 2173-2186. https://doi.org/10.1109/TKDE.2016.2551240.

[4]

Vinh N.X., Epps J., Bailey J., Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, Journal of Machine Learning Research 11, 2010, 2837-2854.

[5]

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, https://clustering-benchmarks.gagolewski.com

Examples

>>> x = np.r_[1, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2]
>>> y = np.r_[2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1]
>>> C = genieclust.compare_partitions.confusion_matrix(x, y)
>>> C
array([[ 1, 10],
       [ 8,  2]])
>>> {k : round(v, 2) for k, v in
...      genieclust.compare_partitions.compare_partitions(C).items()}
{'ar': 0.49, 'r': 0.74, 'fm': 0.73, 'afm': 0.49, 'mi': 0.29, 'nmi': 0.41, 'ami': 0.39, 'npa': 0.71, 'psi': 0.65, 'spsi': 0.63, 'nca': 0.71}
>>> {k : round(v, 2) for k, v in
...      genieclust.compare_partitions.compare_partitions(x,y).items()}
{'ar': 0.49, 'r': 0.74, 'fm': 0.73, 'afm': 0.49, 'mi': 0.29, 'nmi': 0.41, 'ami': 0.39, 'npa': 0.71, 'psi': 0.65, 'spsi': 0.63, 'nca': 0.71}
>>> round(genieclust.compare_partitions.normalized_clustering_accuracy(x, y), 2)
0.71
genieclust.compare_partitions.confusion_matrix(x, y)#

Computes the confusion matrix for two label vectors

Parameters:
x, yarray_like

Two vectors of “small” integers of identical lengths.

Returns:
Cndarray

A (dense) confusion matrix (contingency table) with max(x)-min(x)+1 rows and max(y)-min(y)+1 columns.

See also

genieclust.compare_partitions.normalize_confusion_matrix

Permutes the rows and columns of a confusion matrix so that the sum of the elements of the main diagonal is the largest possible

Examples

>>> x = np.r_[1, 2, 1, 2, 2, 2, 3, 1, 2, 1, 2, 1, 2, 2]
>>> y = np.r_[3, 3, 3, 3, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2]
>>> C = genieclust.compare_partitions.confusion_matrix(x, y)
>>> C
array([[1, 0, 4],
       [0, 6, 2],
       [0, 0, 1]])
genieclust.compare_partitions.fm_score(x, y)#

The original Fowlkes-Mallows index (not adjusted for chance)

Parameters:
xndarray

A confusion matrix (contingency table) or a vector of labels (if y is not None)

yNone or ndarray

a vector of labels or None

Returns:
double

Partition similarity measure.

See also

genieclust.compare_partitions.compare_partitions

Computes multiple similarity scores

Notes

See compare_partitions() for more details.

genieclust.compare_partitions.mi_score(x, y)#

Mutual information score

Parameters:
xndarray

A confusion matrix (contingency table) or a vector of labels (if y is not None)

yNone or ndarray

a vector of labels or None

Returns:
double

Partition similarity measure.

See also

genieclust.compare_partitions.compare_partitions

Computes multiple similarity scores

Notes

See compare_partitions() for more details.

genieclust.compare_partitions.normalize_confusion_matrix(C)#

Permutes the rows and columns of a confusion matrix so that the sum of the elements on the main diagonal is the largest possible (by solving the maximal assignment problem).

Parameters:
Cndarray

A confusion matrix (contingency table), whose row count is not greater than the column count

Returns:
ndarray

A normalised confusion matrix of the same shape as C.

See also

genieclust.compare_partitions.confusion_matrix

Determines the confusion matrix

genieclust.compare_partitions.normalized_confusion_matrix

Determines the confusion matrix and permutes the rows and columns so that the sum of the elements of the main diagonal is the largest possible

genieclust.compare_partitions.normalizing_permutation

The underlying function to determine the ordering permutation of the columns

Notes

This function comes in handy when C summarises the results generated by clustering algorithms, where the actual label values do not matter (e.g., (1, 2, 0) can be remapped to (0, 2, 1) with no change in meaning).

Examples

>>> x = np.r_[1, 2, 1, 2, 2, 2, 3, 1, 2, 1, 2, 1, 2, 2]
>>> y = np.r_[3, 3, 3, 3, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2]
>>> C = genieclust.compare_partitions.confusion_matrix(x, y)
>>> C
array([[1, 0, 4],
       [0, 6, 2],
       [0, 0, 1]])
>>> genieclust.compare_partitions.normalize_confusion_matrix(C)
array([[4, 0, 1],
       [2, 6, 0],
       [1, 0, 0]])
genieclust.compare_partitions.normalized_clustering_accuracy(x, y)#

Normalised clustering accuracy (NCA) [1].

This measure is asymmetric – it is assumed that x represents the ground truth labels, whilst y give predicted cluster IDs.

Parameters:
xndarray

A confusion matrix (contingency table) or a vector of labels (if y is not None)

yNone or ndarray

a vector of labels or None

Returns:
double

Similarity score.

See also

genieclust.compare_partitions.compare_partitions

Computes multiple similarity scores

genieclust.compare_partitions.normalized_confusion_matrix

Determines the confusion matrix and permutes the rows and columns so that the sum of the elements of the main diagonal is the largest possible

Notes

See compare_partitions() for more details.

References

[1]

Gagolewski M., Normalised clustering accuracy: An asymmetric external cluster validity measure, 2023, under review (preprint). https://doi.org/10.48550/arXiv.2209.02935

genieclust.compare_partitions.normalized_confusion_matrix(x, y)#

Computes the confusion matrix for two label vectors and permutes its rows and columns so that the sum of the elements of the main diagonal is the largest possible (by solving the maximal assignment problem).

Parameters:
x, yarray_like

Two vectors of “small” integers of identical lengths.

force_doublebool

If the return dtype should be ‘double’.

Returns:
Cndarray

A (dense) confusion matrix (contingency table) with max(x)-min(x)+1 rows and max(y)-min(y)+1 columns.

See also

genieclust.compare_partitions.normalizing_permutation

The underlying function to determine the ordering permutation of the columns

genieclust.compare_partitions.confusion_matrix

Determines the confusion matrix

Examples

>>> x = np.r_[1, 2, 1, 2, 2, 2, 3, 1, 2, 1, 2, 1, 2, 2]
>>> y = np.r_[3, 3, 3, 3, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2]
>>> genieclust.compare_partitions.normalized_confusion_matrix(x, y)
array([[4, 0, 1],
       [2, 6, 0],
       [1, 0, 0]])
genieclust.compare_partitions.normalized_mi_score(x, y)#

Normalised mutual information score \((\mathrm{NMI}_\mathrm{sum})\)

Parameters:
xndarray

A confusion matrix (contingency table) or a vector of labels (if y is not None)

yNone or ndarray

a vector of labels or None

Returns:
double

Partition similarity measure.

See also

genieclust.compare_partitions.compare_partitions

Computes multiple similarity scores

Notes

See compare_partitions() for more details.

genieclust.compare_partitions.normalized_pivoted_accuracy(x, y)#

Normalised pivoted accuracy (NPA)

Parameters:
xndarray

A confusion matrix (contingency table) or a vector of labels (if y is not None)

yNone or ndarray

a vector of labels or None

Returns:
double

Partition similarity measure.

See also

genieclust.compare_partitions.compare_partitions

Computes multiple similarity scores

genieclust.compare_partitions.normalized_confusion_matrix

Determines the confusion matrix and permutes the rows and columns so that the sum of the elements of the main diagonal is the largest possible

Notes

See compare_partitions() for more details.

genieclust.compare_partitions.normalizing_permutation(C)#

Determines the permutation of columns of a confusion matrix so that the sum of the elements on the main diagonal is the largest possible (by solving the maximal assignment problem).

Parameters:
Cndarray

A confusion matrix (contingency table), whose row count is not greater than the column count; can be a matrix of elements of the double type

Returns:
ndarray

A vector of indexes

See also

genieclust.compare_partitions.confusion_matrix

Determines the confusion matrix

genieclust.compare_partitions.normalized_confusion_matrix

Determines the confusion matrix and permutes the rows and columns so that the sum of the elements of the main diagonal is the largest possible

Notes

This function comes in handy when C summarises the results generated by clustering algorithms, where the actual label values do not matter (e.g., (1, 2, 0) can be remapped to (0, 2, 1) with no change in meaning).

Examples

>>> x = np.r_[1, 2, 1, 2, 2, 2, 3, 1, 2, 1, 2, 1, 2, 2]
>>> y = np.r_[3, 3, 3, 3, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2]
>>> C = genieclust.compare_partitions.confusion_matrix(x, y)
>>> C
array([[1, 0, 4],
       [0, 6, 2],
       [0, 0, 1]])
>>> I = genieclust.compare_partitions.normalizing_permutation(C)
>>> I
array([2, 1, 0])
>>> C[:, I]
array([[4, 0, 1],
       [2, 6, 0],
       [1, 0, 0]])
genieclust.compare_partitions.pair_sets_index(x, y)#

Pair sets index (PSI) [1]

For non-square confusion matrices, missing rows/columns are assumed to be filled with 0s.

Parameters:
xndarray

A confusion matrix (contingency table) or a vector of labels (if y is not None)

yNone or ndarray

a vector of labels or None

simplifiedbool

Whether to assume E=1 in the definition of the index, i.e.,use Eq. (20) in [1] instead of Eq. (18).

clippedbool

Whether the result should be clipped to the unit interval.

Returns:
double

Partition similarity measure.

See also

genieclust.compare_partitions.compare_partitions

Computes multiple similarity scores

genieclust.compare_partitions.normalized_confusion_matrix

Determines the confusion matrix and permutes the rows and columns so that the sum of the elements of the main diagonal is the largest possible

Notes

See compare_partitions() for more details.

References

[1] (1,2)

Rezaei M., Franti P., Set matching measures for external cluster validity, IEEE Transactions on Knowledge and Data Mining 28(8), 2016, 2173-2186. https://doi.org/10.1109/TKDE.2016.2551240.

genieclust.compare_partitions.rand_score(x, y)#

The original Rand index not adjusted for chance

Parameters:
xndarray

A confusion matrix (contingency table) or a vector of labels (if y is not None)

yNone or ndarray

a vector of labels or None

Returns:
double

Partition similarity measure.

See also

genieclust.compare_partitions.compare_partitions

Computes multiple similarity scores

Notes

See compare_partitions() for more details.