R Package genieclust Reference¶
adjusted_rand_score¶
Pairwise Partition Similarity Scores
Description¶
Let x
and y
represent two partitions of a set of n
elements
into \(K\) and \(L\) , respectively, nonempty and pairwise
disjoint subsets, e.g., two clusterings of a dataset with n
observations represented as label vectors. These functions quantify the
similarity between x
and y
. They can be used as external
cluster validity measures, i.e., in the presence of reference
(groundtruth) partitions.
Usage¶
adjusted_rand_score(x, y = NULL)
rand_score(x, y = NULL)
adjusted_fm_score(x, y = NULL)
fm_score(x, y = NULL)
mi_score(x, y = NULL)
normalized_mi_score(x, y = NULL)
adjusted_mi_score(x, y = NULL)
normalized_accuracy(x, y = NULL)
pair_sets_index(x, y = NULL)
Arguments¶
Argument 
Description 


an integer vector of length n (or an
object coercible to) representing a
Kpartition of an nset, or a
confusion matrix with K rows and L
columns (see 

an integer vector of length n (or an object coercible to) representing an Lpartition of the same set), or NULL (if x is an K*L confusion matrix) 
Details¶
Every index except mi_score()
(which computes the mutual information
score) outputs 1 given two identical partitions. Note that partitions
are always defined up to a bijection of the set of possible labels,
e.g., (1, 1, 2, 1) and (4, 4, 2, 4) represent the same 2partition.
rand_score()
gives the Rand score (the “probability” of agreement
between the two partitions) and adjusted_rand_score()
is its version
corrected for chance, see (Hubert, Arabie, 1985), its expected value is
0.0 given two independent partitions. Due to the adjustment, the
resulting index might also be negative for some inputs.
Similarly, fm_score()
gives the FowlkesMallows (FM) score and
adjusted_fm_score()
is its adjustedforchance version, see (Hubert,
Arabie, 1985).
Note that both the (unadjusted) Rand and FM scores are bounded from below by \(1/(K+1)\) if \(K=L\) , hence their adjusted versions are preferred.
mi_score()
, adjusted_mi_score()
and normalized_mi_score()
are informationtheoretic scores, based on mutual information, see the
definition of \(AMI_{sum}\) and \(NMI_{sum}\) in (Vinh et al.,
2010).
normalized_accuracy()
is defined as
\((Accuracy(C_\sigma)1/L)/(11/L)\) , where \(C_\sigma\) is a
version of the confusion matrix for given x
and y
,
\(K \leq L\) , with columns permuted based on the solution to the
Maximal Linear Sum Assignment Problem. \(Accuracy(C_\sigma)\) is
sometimes referred to as Purity, e.g., in (Rendon et al. 2011).
pair_sets_index()
gives the Pair Sets Index (PSI) adjusted for
chance (Rezaei, Franti, 2016), \(K \leq L\) . Pairing is based on
the solution to the Linear Sum Assignment Problem of a transformed
version of the confusion matrix.
Value¶
A single real value giving the similarity score.
References¶
Hubert L., Arabie P., Comparing Partitions, Journal of Classification 2(1), 1985, 193218, esp. Eqs. (2) and (4).
Rendon E., Abundez I., Arizmendi A., Quiroz E.M., Internal versus external cluster validation indexes, International Journal of Computers and Communications 5(1), 2011, 2734.
Rezaei M., Franti P., Set matching measures for external cluster validity, IEEE Transactions on Knowledge and Data Mining 28(8), 2016, 21732186.
Vinh N.X., Epps J., Bailey J., Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, Journal of Machine Learning Research 11, 2010, 28372854.
Examples¶
y_true < iris[[5]]
y_pred < kmeans(as.matrix(iris[1:4]), 3)$cluster
adjusted_rand_score(y_true, y_pred)
rand_score(table(y_true, y_pred)) # the same
adjusted_fm_score(y_true, y_pred)
fm_score(y_true, y_pred)
mi_score(y_true, y_pred)
normalized_mi_score(y_true, y_pred)
adjusted_mi_score(y_true, y_pred)
normalized_accuracy(y_true, y_pred)
pair_sets_index(y_true, y_pred)
emst_mlpack¶
Euclidean Minimum Spanning Tree
Description¶
Provides access to an implementation of the DualTree Borůvka algorithm
based on kdtrees. It is fast for (very) lowdimensional Euclidean
spaces. For higher dimensional spaces (say, over 5 features) or other
metrics, use the parallelised Primlike algorithm implemented in
`mst
<#mst>`__ .
Usage¶
emst_mlpack(X, verbose = FALSE)
Arguments¶
Argument 
Description 


a numeric matrix (or an object coercible to one, e.g., a data frame with numericlike columns) 

logical; whether to print diagnostic messages 
Details¶
Calls emstreeR::mlpack_mst()
and converts the result so that it is
compatible with the output of `mst
<#mst>`__ .
If the emstreeR
package is not available, an error is generated.
Value¶
An object of class mst
, see `mst
<#mst>`__ for details.
References¶
March W.B., Ram P., Gray A.G., Fast Euclidean Minimum Spanning Tree: Algorithm, Analysis, and Applications, Proc. ACM SIGKDD’10 (2010) 603611, https://mlpack.org/papers/emst.pdf .
Curtin R.R., Edel M., Lozhnikov M., Mentekidis Y., Ghaisas S., Zhang S., mlpack 3: A fast, flexible machine learning library, Journal of Open Source Software 3(26), 726, 2018.
gclust¶
The Genie++ Hierarchical Clustering Algorithm
Description¶
A reimplementation of Genie  a robust and outlier resistant clustering
algorithm (see Gagolewski, Bartoszuk, Cena, 2016). The Genie algorithm
is based on a minimum spanning tree (MST) of the pairwise distance graph
of a given point set. Just like single linkage, it consumes the edges of
the MST in increasing order of weights. However, it prevents the
formation of clusters of highly imbalanced sizes; once the Gini index
(see `gini_index
<#giniindex>`__ ) of the cluster size distribution
raises above gini_threshold
, a forced merge of a point group of the
smallest size is performed. Its appealing simplicity goes hand in hand
with its usability; Genie often outperforms other clustering approaches
on benchmark data, such as
https://github.com/gagolews/clustering_benchmarks_v1 .
The clustering can now also be computed with respect to the mutual
reachability distance (based, e.g., on the Euclidean metric), which is
used in the definition of the HDBSCAN* algorithm (see Campello et al.,
2015). If M
> 1, then the mutual reachability distance
\(m(i,j)\) with smoothing factor M
is used instead of the chosen
“raw” distance \(d(i,j)\) . It holds
\(m(i,j)=\max(d(i,j), c(i), c(j))\) , where \(c(i)\) is
\(d(i,k)\) with \(k\) being the ( M
1)th nearest neighbour
of \(i\) . This makes “noise” and “boundary” points being “pulled
away” from each other.
The Genie correction together with the smoothing factor M
> 1 (note
that M
= 2 corresponds to the original distance) gives a robustified
version of the HDBSCAN* algorithm that is able to detect a predefined
number of clusters. Hence it does not dependent on the DBSCAN’s somehow
magical eps
parameter or the HDBSCAN’s min_cluster_size
one.
Usage¶
gclust(d, ...)
gclust.default(
d,
gini_threshold = 0.3,
distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"),
cast_float32 = TRUE,
verbose = FALSE,
...
)
gclust.dist(d, gini_threshold = 0.3, verbose = FALSE, ...)
gclust.mst(d, gini_threshold = 0.3, verbose = FALSE, ...)
genie(d, ...)
genie.default(
d,
k,
gini_threshold = 0.3,
distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"),
M = 1L,
postprocess = c("boundary", "none", "all"),
detect_noise = M > 1L,
cast_float32 = TRUE,
verbose = FALSE,
...
)
genie.dist(
d,
k,
gini_threshold = 0.3,
M = 1L,
postprocess = c("boundary", "none", "all"),
detect_noise = M > 1L,
verbose = FALSE,
...
)
genie.mst(
d,
k,
gini_threshold = 0.3,
postprocess = c("boundary", "none", "all"),
detect_noise = FALSE,
verbose = FALSE,
...
)
Arguments¶
Argument 
Description 


a numeric matrix (or an object
coercible to one, e.g., a data frame
with numericlike columns) or an
object of class 

further arguments passed to other methods. 

threshold for the Genie correction, i.e., the Gini index of the cluster size distribution; Threshold of 1.0 disables the correction. Low thresholds highly penalise the formation of small clusters. 

metric used to compute the linkage,
one of: 

logical; whether to compute the distances using 32bit instead of 64bit precision floatingpoint arithmetic (up to 2x faster). 

logical; whether to print diagnostic messages and progress information. 

the desired number of clusters to
detect, 

smoothing factor; 

one of 

whether the minimum spanning tree’s
leaves should be marked as noise
points, defaults to 
Details¶
Note that as in the case of all the distancebased methods, the standardisation of the input features is definitely worth giving a try.
If d
is a numeric matrix or an object of class dist
,
`mst
<#mst>`__ will be called to compute an MST, which generally
takes at most \(O(n^2)\) time (the algorithm we provide is
parallelised, environment variable OMP_NUM_THREADS
controls the
number of threads in use). However, see
`emst_mlpack
<#emstmlpack>`__ for a very fast alternative in the
case of Euclidean spaces of (very) low dimensionality and M
= 1.
Given an minimum spanning tree, the algorithm runs in
\(O(n \sqrt{n})\) time. Therefore, if you want to test different
gini_threshold
s, (or k
s), it is best to explicitly compute the
MST first.
According to the algorithm’s original definition, the resulting
partition tree (dendrogram) might violate the ultrametricity property
(merges might occur at levels that are not increasing w.r.t. a
betweencluster distance). Departures from ultrametricity are corrected
by applying height = rev(cummin(rev(height)))
.
Value¶
gclust()
computes the whole clustering hierarchy; it returns a list
of class hclust
, see `hclust
<#hclust>`__ . Use
link{cutree}()
to obtain an arbitrary kpartition.
genie()
returns a k
partition  a vector with elements in
1,…,k, whose ith element denotes the ith input point’s cluster
identifier. Missing values ( NA
) denote noise points (if
detect_noise
is TRUE
).
Seealso¶
`mst
<#mst>`__ for the minimum spanning tree routines.
`adjusted_rand_score
<#adjustedrandscore>`__ (amongst others) for
external cluster validity measures (partition similarity scores).
References¶
Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlierresistant hierarchical clustering algorithm, Information Sciences 363, 2016, 823.
Campello R., Moulavi D., Zimek A., Sander J., Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data 10(1), 2015, 5:1–5:51.
Examples¶
library("datasets")
data("iris")
X < iris[1:4]
h < gclust(X)
y_pred < cutree(h, 3)
y_test < iris[,5]
plot(iris[,2], iris[,3], col=y_pred,
pch=as.integer(iris[,5]), asp=1, las=1)
adjusted_rand_score(y_test, y_pred)
pair_sets_index(y_test, y_pred)
## Fast for lowdimensional Euclidean spaces:
if (require("emstreeR")) h < gclust(emst_mlpack(X))
genieclustpackage¶
The Genie++ Hierarchical Clustering Algorithm (with Extras)
Description¶
See `genie
<#genie>`__ for more details.
Author¶
Marek Gagolewski
gini_index¶
Inequity (Inequality) Measures
Description¶
gini_index()
gives the normalised Gini index and
bonferroni_index()
implements the Bonferroni index.
Usage¶
gini_index(x)
bonferroni_index(x)
Arguments¶
Argument 
Description 


numeric vector of nonnegative values 
Details¶
Both indices can be used to quantify the “inequity” of a numeric sample. They can be perceived as measures of data dispersion. For constant vectors (perfect equity), the indices yield values of 0. Vectors with all elements but one equal to 0 (perfect inequity), are assigned scores of 1. Both indices follow the PigouDalton principle (are Schurconvex): setting \(x_i = x_i  h\) and \(x_j = x_j + h\) with \(h > 0\) and \(x_i  h \geq x_j + h\) (taking from the “rich” and giving to the “poor”) decreases the inequity.
These indices have applications in economics, amongst others. The Gini clustering algorithm uses the Gini index as a measure of the inequality of cluster sizes.
The normalised Gini index is given by:
The normalised Bonferroni index is given by:
Time complexity: \(O(n)\) for sorted (increasingly) data. Otherwise, the vector will be sorted.
In particular, for ordered inputs, it holds:
where \(\sigma\) is an ordering permutation of \((x_1,\dots,x_n)\) .
Value¶
The value of the inequity index, a number in \([0, 1]\) .
References¶
Bonferroni C., Elementi di Statistica Generale, Libreria Seber, Firenze, 1930.
Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlierresistant hierarchical clustering algorithm, Information Sciences 363, 2016, pp. 823. doi:10.1016/j.ins.2016.05.003
Gini C., Variabilita e Mutabilita, Tipografia di Paolo Cuppini, Bologna, 1912.
Examples¶
gini_index(c(2, 2, 2, 2, 2)) # no inequality
gini_index(c(0, 0, 10, 0, 0)) # one has it all
gini_index(c(7, 0, 3, 0, 0)) # give to the poor, take away from the rich
gini_index(c(6, 0, 3, 1, 0)) # (a.k.a. PigouDalton principle)
bonferroni_index(c(2, 2, 2, 2, 2))
bonferroni_index(c(0, 0, 10, 0, 0))
bonferroni_index(c(7, 0, 3, 0, 0))
bonferroni_index(c(6, 0, 3, 1, 0))
mst¶
Minimum Spanning Tree of the Pairwise Distance Graph
Description¶
An parallelised implementation of a Jarník (Prim/Dijkstra)like algorithm for determining a(*) minimum spanning tree (MST) of a complete undirected graph representing a set of n points with weights given by a pairwise distance matrix.
(*) Note that there might be multiple minimum trees spanning a given graph.
Usage¶
mst(d, ...)
mst.default(
d,
distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"),
M = 1L,
cast_float32 = TRUE,
verbose = FALSE,
...
)
mst.dist(d, M = 1L, verbose = FALSE, ...)
Arguments¶
Argument 
Description 


either a numeric matrix (or an
object coercible to one, e.g., a
data frame with numericlike
columns) or an object of class


further arguments passed to or from other methods. 

metric used to compute the linkage,
one of: 

smoothing factor; 

logical; whether to compute the distances using 32bit instead of 64bit precision floatingpoint arithmetic (up to 2x faster). 

logical; whether to print diagnostic messages and progress information. 
Details¶
If d
is a numeric matrix of size \(n p\) , the \(n (n1)/2\)
distances are computed on the fly, so that \(O(n M)\) memory is
used.
The algorithm is parallelised; set the OMP_NUM_THREADS
environment
variable `Sys.setenv
<#sys.setenv>`__ to control the number of
threads used.
Time complexity is \(O(n^2)\) for the method accepting an object of
class dist
and \(O(p n^2)\) otherwise.
If M
>= 2, then the mutual reachability distance \(m(i,j)\) with
smoothing factor M
(see Campello et al. 2015) is used instead of the
chosen “raw” distance \(d(i,j)\) . It holds
\(m(i, j)=\max(d(i,j), c(i), c(j))\) , where \(c(i)\) is
\(d(i, k)\) with \(k\) being the ( M
1)th nearest
neighbour of \(i\) . This makes “noise” and “boundary” points being
“pulled away” from each other. Genie++ clustering algorithm (see
`gclust
<#gclust>`__ ) with respect to the mutual reachability
distance gains the ability to identify some observations are noise
points.
Note that the case M
= 2 corresponds to the original distance, but
we are determining the 1nearest neighbours separately as well, which is
a bit suboptimal; you can file a feature request if this makes your data
analysis tasks too slow.
Value¶
Matrix of class mst
with n1 rows and 3 columns: from
, to
and dist
. It holds from
< to
. Moreover, dist
is sorted
nondecreasingly. The ith row gives the ith edge of the MST.
(from[i], to[i])
defines the vertices (in 1,…,n) and dist[i]
gives the weight, i.e., the distance between the corresponding points.
The method
attribute gives the name of the distance used. The
Labels
attribute gives the labels of all the input points.
If M
> 1, the nn
attribute gives the indices of the M
1
nearest neighbours of each point.
Seealso¶
`emst_mlpack
<#emstmlpack>`__ for a very fast alternative in case of
(very) lowdimensional Euclidean spaces (and M
= 1).
References¶
V. Jarník, O jistém problému minimálním, Práce Moravské Přírodovědecké Společnosti 6 (1930) 57–63.
C.F. Olson, Parallel algorithms for hierarchical clustering, Parallel Comput. 21 (1995) 1313–1325.
R. Prim, Shortest connection networks and some generalisations, Bell Syst. Tech. J. 36 (1957) 1389–1401.
Campello R., Moulavi D., Zimek A., Sander J., Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data 10(1) (2015) 5:1–5:51.
Examples¶
library("datasets")
data("iris")
X < iris[1:4]
tree < mst(X)