mst: Minimum Spanning Tree of the Pairwise Distance Graph
Description
An parallelised implementation of a Jarnik (Prim/Dijkstra)like algorithm for determining a(*) minimum spanning tree (MST) of a complete undirected graph representing a set of n points with weights given by a pairwise distance matrix.
(*) Note that there might be multiple minimum trees spanning a given graph.
Usage
mst(d, ...)
## Default S3 method:
mst(
d,
distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"),
M = 1L,
cast_float32 = TRUE,
verbose = FALSE,
...
)
## S3 method for class 'dist'
mst(d, M = 1L, verbose = FALSE, ...)
Arguments

either a numeric matrix (or an object coercible to one, e.g., a data frame with numericlike columns) or an object of class 

further arguments passed to or from other methods 

metric used to compute the linkage, one of: 

smoothing factor; 

logical; whether to compute the distances using 32bit instead of 64bit precision floatingpoint arithmetic (up to 2x faster) 

logical; whether to print diagnostic messages and progress information 
Details
If d
is a numeric matrix of size \(n p\), the \(n (n1)/2\) distances are computed on the fly, so that \(O(n M)\) memory is used.
The algorithm is parallelised; set the OMP_NUM_THREADS
environment variable Sys.setenv
to control the number of threads used.
Time complexity is \(O(n^2)\) for the method accepting an object of class dist
and \(O(p n^2)\) otherwise.
If M
>= 2, then the mutual reachability distance \(m(i,j)\) with smoothing factor M
(see Campello et al. 2013) is used instead of the chosen “raw” distance \(d(i,j)\). It holds \(m(i, j)=\max(d(i,j), c(i), c(j))\), where \(c(i)\) is \(d(i, k)\) with \(k\) being the (M
1)th nearest neighbour of \(i\). This makes “noise” and “boundary” points being “pulled away” from each other. Genie++ clustering algorithm (see gclust
) with respect to the mutual reachability distance gains the ability to identify some observations are noise points.
Note that the case M
= 2 corresponds to the original distance, but we are determining the 1nearest neighbours separately as well, which is a bit suboptimal; you can file a feature request if this makes your data analysis tasks too slow.
Value
Matrix of class mst
with n1 rows and 3 columns: from
, to
and dist
. It holds from
< to
. Moreover, dist
is sorted nondecreasingly. The ith row gives the ith edge of the MST. (from[i], to[i])
defines the vertices (in 1,…,n) and dist[i]
gives the weight, i.e., the distance between the corresponding points.
The method
attribute gives the name of the distance used. The Labels
attribute gives the labels of all the input points.
If M
> 1, the nn
attribute gives the indices of the M
1 nearest neighbours of each point.
References
Jarnik V., O jistem problemu minimalnim, Prace Moravske Prirodovedecke Spolecnosti 6, 1930, 5763.
Olson C.F., Parallel algorithms for hierarchical clustering, Parallel Comput. 21, 1995, 13131325.
Prim R., Shortest connection networks and some generalisations, Bell Syst. Tech. J. 36, 1957, 13891401.
Campello R.J.G.B., Moulavi D., Sander J., Densitybased clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160172, doi:10.1007/9783642374562_14.
See Also
The official online manual of genieclust at https://genieclust.gagolewski.com/
Gagolewski M., genieclust: Fast and robust hierarchical clustering, SoftwareX 15:100722, 2021, doi:10.1016/j.softx.2021.100722.
emst_mlpack()
for a very fast alternative in case of (very) lowdimensional Euclidean spaces (and M
= 1).
Examples
library("datasets")
data("iris")
X < iris[1:4]
tree < mst(X)