Basic Usage (R)

The latest stable release of the R package genieclust is available from the CRAN repository. Install it by calling:

install.packages("genieclust")

Below are a few basic examples of interacting with the package.

library("genieclust")

Let’s take the Sustainable Society Indices dataset that measures the Human, Environmental, and Economic Wellbeing in each country. There are seven categories on the scale \([0, 10]\).

# see https://github.com/gagolews/genieclust/tree/master/devel/sphinx/weave
ssi <- read.csv("ssi_2016_categories.csv", comment.char="#")
X <- as.matrix(ssi[,-1])    # everything except the Country (1st) column
dimnames(X)[[1]] <- ssi[,1] # set row names
head(X)  # preview
##           BasicNeeds PersonalDevelopmentAndHealth WellBalancedSociety
## Albania       9.6058                       7.9596              6.9926
## Algeria       9.0212                       7.3365              4.2039
## Angola        5.9728                       5.6928              2.1401
## Argentina     9.8320                       8.3506              3.8952
## Armenia       9.4469                       7.4205              6.2892
## Australia    10.0000                       8.5909              6.1055
##           NaturalResources ClimateAndEnergy Transition Economy
## Albania             6.6343           4.6217     2.1025  3.0565
## Algeria             5.2772           2.6627     3.0741  6.1543
## Angola              6.7594           6.2122     1.8988  3.7535
## Argentina           5.4535           3.3003     6.3899  5.3406
## Armenia             6.4363           2.8543     2.4342  3.8296
## Australia           4.1307           1.6278     7.5395  7.5931

The API of genieclust is compatible with R’s workhorse for hierarchical clustering, stats::hclust(), which accepts a complete pairwise distance matrix. Yet, for better performance, it is better to feed genieclust::gclust() with the input data matrix directly:

# faster than gclust(dist(X)):
h <- gclust(X)  # default: gini_threshold=0.3, distance="euclidean"
print(h)
## 
## Call:
## gclust.mst(d = tree, gini_threshold = gini_threshold, verbose = verbose)
## 
## Cluster method   : Genie(0.3) 
## Distance         : euclidean 
## Number of objects: 154

In order to extract a desired k-partition, we can call stats::cutree():

y_pred <- cutree(h, k=3)  # also: genie(X, 3)
sample(y_pred, 25)  # preview
##                Iran            Slovenia             Morocco 
##                   1                   3                   1 
## Trinidad and Tobago        Sierra Leone            Colombia 
##                   1                   2                   1 
##        Turkmenistan               Syria             Nigeria 
##                   1                   1                   2 
##             Tunisia               Ghana              Sweden 
##                   1                   2                   1 
##         Netherlands              Bhutan              Mexico 
##                   3                   1                   1 
##             Jamaica        Saudi Arabia             Hungary 
##                   1                   1                   3 
##                Peru          Kazakhstan               Spain 
##                   1                   1                   3 
##               Japan        Korea, North               China 
##                   1                   1                   1 
##             Liberia 
##                   2

This gives the cluster IDs allocated to each country.

Note

If we are only interested in the partition of a specific cardinality, calling genie() directly will be slightly faster than referring to cutree(gclust(...)).

Let’s depict the obtained partition using the rworldmap package:

library("rworldmap")  # see the package's manual for details
mapdata <- data.frame(Country=dimnames(X)[[1]], Cluster=y_pred)
mapdata <- joinCountryData2Map(mapdata, joinCode="NAME", nameJoinColumn="Country")
mapCountryData(mapdata, nameColumnToPlot="Cluster", catMethod="categorical",
    missingCountryCol="white", colourPalette=palette.colors(3, "R4"),
    mapTitle="")
../_images/ssi-map-1.png

Figure 15 Countries grouped with respect to the SSI categories

We can compute, for instance, the average indicators in each identified group:

t(aggregate(as.data.frame(X), list(Cluster=y_pred), mean))[-1, ]
##                                [,1]   [,2]   [,3]
## BasicNeeds                   9.0679 5.2689 9.8178
## PersonalDevelopmentAndHealth 7.5081 5.9312 8.2995
## WellBalancedSociety          4.8869 2.8682 6.8272
## NaturalResources             5.6633 7.0040 6.3743
## ClimateAndEnergy             3.6241 7.0818 3.5947
## Transition                   4.0749 2.6300 7.3402
## Economy                      5.5127 3.5411 4.2742

Dendrograms

Dendrogram plotting is also possible. For better readability, we will restrict ourselves to a smaller sample; namely, to the 37 members of the OECD:

oecd <- c("Australia", "Austria", "Belgium", "Canada", "Chile", "Colombia",
"Czech Republic", "Denmark", "Estonia", "Finland", "France", "Germany",
"Greece", "Hungary", "Iceland", "Ireland", "Israel", "Italy", "Japan",
"Korea, South", "Latvia", "Lithuania", "Luxembourg", "Mexico", "Netherlands",
"New Zealand", "Norway", "Poland", "Portugal", "Slovak Republic", "Slovenia",
"Spain", "Sweden", "Switzerland", "Turkey", "United Kingdom", "United States")
X_oecd <- X[dimnames(X)[[1]] %in% oecd, ]
h_oecd <- gclust(X_oecd)
plot(as.dendrogram(h_oecd), horiz=TRUE)
../_images/ssi-oecd-dendrogram-1.png

Figure 16 Cluster dendrogram for the OECD countries

Outlier Detection with Deadwood

The Deadwood outlier detection algorithm can be run on the clustered dataset to identify anomalous points in each cluster. See the deadwood package tutorials for more details.

Remarks

genieclust also features partition similarity scores (such as the adjusted Rand index) and internal cluster validity measures (e.g., the Caliński-Harabasz index).

For more details, refer to the package’s documentation.

To learn more about R, check out Marek’s open-access textbook Deep R Programming [6].