mst: Minimum Spanning Tree of the Pairwise Distance Graph¶
An parallelised implementation of a Jarnik (Prim/Dijkstra)-like algorithm for determining a(*) minimum spanning tree (MST) of a complete undirected graph representing a set of n points with weights given by a pairwise distance matrix.
(*) Note that there might be multiple minimum trees spanning a given graph.
mst(d, ...) ## Default S3 method: mst( d, distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), M = 1L, cast_float32 = TRUE, verbose = FALSE, ... ) ## S3 method for class 'dist' mst(d, M = 1L, verbose = FALSE, ...)
either a numeric matrix (or an object coercible to one, e.g., a data frame with numeric-like columns) or an object of class
further arguments passed to or from other methods.
metric used to compute the linkage, one of:
logical; whether to compute the distances using 32-bit instead of 64-bit precision floating-point arithmetic (up to 2x faster).
logical; whether to print diagnostic messages and progress information.
d is a numeric matrix of size n p, the n (n-1)/2 distances are computed on the fly, so that O(n M) memory is used.
The algorithm is parallelised; set the
OMP_NUM_THREADS environment variable
Sys.setenv to control the number of threads used.
Time complexity is O(n^2) for the method accepting an object of class
dist and O(p n^2) otherwise.
M >= 2, then the mutual reachability distance m(i,j) with smoothing factor
M (see Campello et al. 2015) is used instead of the chosen “raw” distance d(i,j). It holds m(i, j)=\max(d(i,j), c(i), c(j)), where c(i) is d(i, k) with k being the (
M-1)-th nearest neighbour of i. This makes “noise” and “boundary” points being “pulled away” from each other. Genie++ clustering algorithm (see
gclust) with respect to the mutual reachability distance gains the ability to identify some observations are noise points.
Note that the case
M = 2 corresponds to the original distance, but we are determining the 1-nearest neighbours separately as well, which is a bit suboptimal; you can file a feature request if this makes your data analysis tasks too slow.
Matrix of class
mst with n-1 rows and 3 columns:
dist. It holds
dist is sorted nondecreasingly. The i-th row gives the i-th edge of the MST.
(from[i], to[i]) defines the vertices (in 1,…,n) and
dist[i] gives the weight, i.e., the distance between the corresponding points.
method attribute gives the name of the distance used. The
Labels attribute gives the labels of all the input points.
M > 1, the
nn attribute gives the indices of the
M-1 nearest neighbours of each point.
V. Jarnik, O jistem problemu minimalnim, Prace Moravske Prirodovedecke Spolecnosti 6 (1930) 57-63.
Olson C.F., Parallel algorithms for hierarchical clustering, Parallel Comput. 21 (1995) 1313-1325.
Prim R., Shortest connection networks and some generalisations, Bell Syst. Tech. J. 36 (1957) 1389-1401.
Campello R., Moulavi D., Zimek A., Sander J., Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data 10(1) (2015) 5:1-5:51.
The official online manual of genieclust at https://genieclust.gagolewski.com/
emst_mlpack() for a very fast alternative in case of (very) low-dimensional Euclidean spaces (and
M = 1).
library("datasets") data("iris") X <- iris[1:4] tree <- mst(X)