Example: String Data and Grouping of DNA¶
The genieclust package also allows for clustering of character string data. Let’s perform an example grouping based on Levenshtein’s edit distance.
We’ll use one of the benchmark datasets mentioned in [GBC16] as an example:
import numpy as np
# see https://github.com/gagolews/genieclust/tree/master/devel/sphinx/rmd
strings = np.loadtxt("actg1.data.gz", dtype=np.str).tolist()
strings[:5] # preview
## ['tataacaaccctgattacatcaagctacgctccggtgcgttgcctcggacgagtgctaatccctccccactgactgtattcatcttgacaata',
## 'atgtctccaaagcgtgaccttctagacccgagacgacatatggaggcttggagccgtacctgtgtgaggaaactgtagtacccaaagctattca',
## 'gcaattgaagtccagatctaggtatcgtccaagcatattgcctttaagaaatatatttgaccctgtctcttcgtggaggtacacgtcacggaatcgtaagatttccttgg',
## 'gacaattatcgcggctttcgccatgcagagtctcgtacaatttgtttcacgcccaatattttccgtgcttcgcgagctaggcagccagggcatttttgga',
## 'ttagagcgcttaaccccacaggaaccgagttcccctcatgtggcaaggttctcccgcctcaggtatcacagaaacaaggtatgtagccctaggctacgagc']
It comes with a set of reference labels, giving the “true” grouping assigned by an expert:
labels_true = np.loadtxt("actg1.labels0.gz", dtype=np.intp)-1
n_clusters = len(np.unique(labels_true))
print(n_clusters)
## 20
Clustering in the string domain relies on the near-neighbour search routines implemented in the nmslib package.
import genieclust
g = genieclust.Genie(
n_clusters=n_clusters,
exact=False, # use nmslib
cast_float32=False, # do not convert the string list to a matrix
nmslib_params_index=dict(post=0), # faster
affinity="leven")
labels_pred = g.fit_predict(strings)
The adjusted Rand index can be used as an external cluster validity metric:
genieclust.compare_partitions.adjusted_rand_score(labels_true, labels_pred)
## 0.9352814722212013
This indicates a very high degree of similarity between the reference and the obtained clusterings.