genieclust.compare_partitions¶
Partition similarity scores

genieclust.compare_partitions.
adjusted_fm_score
(x, y)¶ The FowlkesMallows index adjusted for chance
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths, representing two partitions of the same set.
 Returns
 double
Partition similarity measure.
See also
genieclust.compare_partitions.compare_partitions
Computes multiple similarity scores based on a confusion matrix
genieclust.compare_partitions.compare_partitions2
Computes multiple similarity scores based on two label vectors
Notes
See genieclust.compare_partitions.compare_partitions for more details.

genieclust.compare_partitions.
adjusted_mi_score
(x, y)¶ Adjusted mutual information score \((\mathrm{AMI}_\mathrm{sum})\)
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths, representing two partitions of the same set.
 Returns
 double
Partition similarity measure.
See also
genieclust.compare_partitions.compare_partitions
Computes multiple similarity scores based on a confusion matrix
genieclust.compare_partitions.compare_partitions2
Computes multiple similarity scores based on two label vectors
Notes
See genieclust.compare_partitions.compare_partitions for more details.

genieclust.compare_partitions.
adjusted_rand_score
(x, y)¶ The Rand index adjusted for chance
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths, representing two partitions of the same set.
 Returns
 double
Partition similarity measure.
See also
genieclust.compare_partitions.compare_partitions
Computes multiple similarity scores based on a confusion matrix
genieclust.compare_partitions.compare_partitions2
Computes multiple similarity scores based on two label vectors
Notes
See genieclust.compare_partitions.compare_partitions for more details.

genieclust.compare_partitions.
compare_partitions
(C)¶ Computes a series of partition similarity scores
 Parameters
 Cndarray
A
c_contiguous
confusion matrix (contingency table) with \(K\) rows and \(L\) columns, where \(K \le L\).
 Returns
 scoresdict
A dictionary with the following keys:
'ar'
Adjusted Rand index
'r'
Rand index (unadjusted for chance)
'afm'
Adjusted FowlkesMallows index
'fm'
FowlkesMallows index (unadjusted for chance)
'mi'
Mutual information score
'nmi'
Normalised mutual information \((\mathrm{NMI}_\mathrm{sum})\)
'ami'
Adjusted mutual information \((\mathrm{AMI}_\mathrm{sum})\)
'nacc'
Normalised accuracy (purity)
'psi'
Pair sets index
See also
genieclust.compare_partitions.confusion_matrix
Computes a confusion matrix
genieclust.compare_partitions.compare_partitions2
A wrapper around this function that accepts two label vectors on input
genieclust.compare_partitions.adjusted_rand_score
genieclust.compare_partitions.rand_score
genieclust.compare_partitions.adjusted_fm_score
genieclust.compare_partitions.fm_score
genieclust.compare_partitions.mi_score
genieclust.compare_partitions.normalized_mi_score
genieclust.compare_partitions.adjusted_mi_score
genieclust.compare_partitions.normalized_accuracy
genieclust.compare_partitions.pair_sets_index
Notes
Let x and y represent two partitions of the same set with \(n\) elements into, respectively, \(K\) and \(L\) nonempty and pairwise disjoint subsets. For instance, these can be two clusterings of a dataset with \(n\) observations specified as vectors of labels. Moreover, let C be the confusion matrix (with \(K\) rows and \(L\) columns, \(K \leq L\)) corresponding to x and y, see also genieclust.compare_partitions.confusion_matrix.
This function implements a few scores that aim to quantify the similarity between x and y. Partition similarity scores can be used as external cluster validity measures — for comparing the outputs of clustering algorithms with reference (ground truth) labels, see, e.g., https://github.com/gagolews/clustering_benchmarks_v1 for a suite of benchmark datasets.
Every index except mi_score (which computes the mutual information score) outputs 1 given two identical partitions. Note that partitions are always defined up to a bijection of the set of possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4) represent the same 2partition.
rand_score gives the Rand score (the “probability” of agreement between the two partitions) and adjusted_rand_score is its version corrected for chance [1] (especially Eqs. (2) and (4) therein): its expected value is 0.0 for two independent partitions. Due to the adjustment, the resulting index might also be negative for some inputs.
Similarly, fm_score gives the FowlkesMallows (FM) score and adjusted_fm_score is its adjustedforchance version [1].
Note that both the (unadjusted) Rand and FM scores are bounded from below by \(1/(K+1)\) if \(K = L\), hence their adjusted versions are preferred.
mi_score, adjusted_mi_score and normalized_mi_score are informationtheoretic indices based on mutual information, see the definition of \(\mathrm{AMI}_\mathrm{sum}\) and \(\mathrm{NMI}_\mathrm{sum}\) in [4].
normalized_accuracy is defined as \((\mathrm{Accuracy}(C_\sigma)1/L)/(11/L)\), where \(C_\sigma\) is a version of the confusion matrix for given x and y, \(K \leq L\), with columns permuted based on the solution to the Maximal Linear Sum Assignment Problem. \(\mathrm{Accuracy}(C_\sigma)\) is sometimes referred to as Purity, e.g., in [2].
pair_sets_index gives the Pair Sets Index (PSI) adjusted for chance [3], \(K \leq L\). Pairing is based on the solution to the Linear Sum Assignment Problem of a transformed version of the confusion matrix.
References
 1(1,2)
Hubert L., Arabie P., Comparing Partitions, Journal of Classification 2(1), 1985, 193218.
 2
Rendon E., Abundez I., Arizmendi A., Quiroz E.M., Internal versus external cluster validation indexes, International Journal of Computers and Communications 5(1), 2011, 2734.
 3
Rezaei M., Franti P., Set matching measures for external cluster validity, IEEE Transactions on Knowledge and Data Mining 28(8), 2016, 21732186. doi:10.1109/TKDE.2016.2551240.
 4
Vinh N.X., Epps J., Bailey J., Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, Journal of Machine Learning Research 11, 2010, 28372854.
Examples
>>> x = np.r_[1, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2] >>> y = np.r_[2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1] >>> C = genieclust.compare_partitions.confusion_matrix(x, y) >>> C array([[ 1, 10], [ 8, 2]]) >>> {k : round(v, 2) for k, v in ... genieclust.compare_partitions.compare_partitions(C).items()} {'ar': 0.49, 'r': 0.74, 'fm': 0.73, 'afm': 0.49, 'mi': 0.29, 'nmi': 0.41, 'ami': 0.39, 'nacc': 0.71, 'psi': 0.65} >>> {k : round(v, 2) for k, v in ... genieclust.compare_partitions.compare_partitions2(x,y).items()} {'ar': 0.49, 'r': 0.74, 'fm': 0.73, 'afm': 0.49, 'mi': 0.29, 'nmi': 0.41, 'ami': 0.39, 'nacc': 0.71, 'psi': 0.65} >>> round(genieclust.compare_partitions.adjusted_rand_score(x, y), 2) 0.49

genieclust.compare_partitions.
compare_partitions2
(x, y)¶ Computes a series of partition similarity scores
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths, representing two partitions of the same set.
 Returns
 scoresdict
See genieclust.compare_partitions.compare_partitions.
See also
genieclust.compare_partitions.compare_partitions
The underlying function
genieclust.compare_partitions.confusion_matrix
Determines the contingency table
Notes
Calls
genieclust.compare_partitions.compare_partitions(C)
, whereC = genieclust.compare_partitions.confusion_matrix(x, y)
.

genieclust.compare_partitions.
confusion_matrix
(x, y)¶ Computes the confusion matrix for two label vectors
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths.
 Returns
 Cndarray
A (dense) confusion matrix (contingency table) with
max(x)min(x)+1
rows andmax(y)min(y)+1
columns.
See also
genieclust.compare_partitions.normalize_confusion_matrix
Applies pivoting
Examples
>>> x = np.r_[1, 2, 1, 2, 2, 2, 3, 1, 2, 1, 2, 1, 2, 2] >>> y = np.r_[3, 3, 3, 3, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2] >>> C = genieclust.compare_partitions.confusion_matrix(x, y) >>> C array([[1, 0, 4], [0, 6, 2], [0, 0, 1]])

genieclust.compare_partitions.
fm_score
(x, y)¶ The original FowlkesMallows index (not adjusted for chance)
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths, representing two partitions of the same set.
 Returns
 double
Partition similarity measure.
See also
genieclust.compare_partitions.compare_partitions
Computes multiple similarity scores based on a confusion matrix
genieclust.compare_partitions.compare_partitions2
Computes multiple similarity scores based on two label vectors
Notes
See genieclust.compare_partitions.compare_partitions for more details.

genieclust.compare_partitions.
mi_score
(x, y)¶ Mutual information score
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths, representing two partitions of the same set.
 Returns
 double
Partition similarity measure.
See also
genieclust.compare_partitions.compare_partitions
Computes multiple similarity scores based on a confusion matrix
genieclust.compare_partitions.compare_partitions2
Computes multiple similarity scores based on two label vectors
Notes
See genieclust.compare_partitions.compare_partitions for more details.

genieclust.compare_partitions.
normalize_confusion_matrix
(C)¶ Applies pivoting to a given confusion matrix
 Parameters
 Cndarray
A
c_contiguous
confusion matrix (contingency table).
 Returns
 ndarray
A normalised confusion matrix of the same shape as C.
See also
genieclust.compare_partitions.confusion_matrix
Determines the confusion matrix
Notes
This function permutes the columns of C so as to relocate the largest elements in each row onto the main diagonal.
It may come in handy whenever C summarises the results generated by clustering algorithms, where actual label values do not matter (e.g., (1, 2, 0) can be remapped to (0, 2, 1) with no change in meaning).
Examples
>>> x = np.r_[1, 2, 1, 2, 2, 2, 3, 1, 2, 1, 2, 1, 2, 2] >>> y = np.r_[3, 3, 3, 3, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2] >>> C = genieclust.compare_partitions.confusion_matrix(x, y) >>> C array([[1, 0, 4], [0, 6, 2], [0, 0, 1]]) >>> genieclust.compare_partitions.normalize_confusion_matrix(C) array([[4, 0, 1], [2, 6, 0], [1, 0, 0]])

genieclust.compare_partitions.
normalized_accuracy
()¶ genieclust.compare_partitions.normalized accuracy(x, y)
Normalised accuracy
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths, representing two partitions of the same set.
 Returns
 double
Partition similarity measure.
See also
genieclust.compare_partitions.compare_partitions
Computes multiple similarity scores based on a confusion matrix
genieclust.compare_partitions.compare_partitions2
Computes multiple similarity scores based on two label vectors
Notes
See genieclust.compare_partitions.compare_partitions for more details.

genieclust.compare_partitions.
normalized_confusion_matrix
(x, y)¶ Computes the confusion matrix for two label vectors and applies pivoting
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths.
 Returns
 Cndarray
A (dense) confusion matrix (contingency table) with
max(x)min(x)+1
rows andmax(y)min(y)+1
columns.
See also
genieclust.compare_partitions.normalize_confusion_matrix
Applies pivoting
genieclust.compare_partitions.confusion_matrix
Determines the confusion matrix
Examples
>>> x = np.r_[1, 2, 1, 2, 2, 2, 3, 1, 2, 1, 2, 1, 2, 2] >>> y = np.r_[3, 3, 3, 3, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2] >>> genieclust.compare_partitions.normalized_confusion_matrix(x, y) array([[4, 0, 1], [2, 6, 0], [1, 0, 0]])

genieclust.compare_partitions.
normalized_mi_score
(x, y)¶ Normalised mutual information score \((\mathrm{NMI}_\mathrm{sum})\)
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths, representing two partitions of the same set.
 Returns
 double
Partition similarity measure.
See also
genieclust.compare_partitions.compare_partitions
Computes multiple similarity scores based on a confusion matrix
genieclust.compare_partitions.compare_partitions2
Computes multiple similarity scores based on two label vectors
Notes
See genieclust.compare_partitions.compare_partitions for more details.

genieclust.compare_partitions.
pair_sets_index
(x, y)¶ Pair Sets Index (PSI) adjusted for chance
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths, representing two partitions of the same set.
 Returns
 double
Partition similarity measure.
See also
genieclust.compare_partitions.compare_partitions
Computes multiple similarity scores based on a confusion matrix
genieclust.compare_partitions.compare_partitions2
Computes multiple similarity scores based on two label vectors
Notes
See genieclust.compare_partitions.compare_partitions for more details.

genieclust.compare_partitions.
rand_score
(x, y)¶ The original Rand index not adjusted for chance
 Parameters
 x, yarray_like
Two vectors of “small” integers of identical lengths, representing two partitions of the same set.
 Returns
 double
Partition similarity measure.
See also
genieclust.compare_partitions.compare_partitions
Computes multiple similarity scores based on a confusion matrix
genieclust.compare_partitions.compare_partitions2
Computes multiple similarity scores based on two label vectors
Notes
See genieclust.compare_partitions.compare_partitions for more details.