genieclust.internal#

Internal functions and classes

class genieclust.internal.DisjointSets#

Disjoint Sets (Union-Find)

Parameters:
nPy_ssize_t

The cardinality of the set whose partitions are generated.

Notes

Represents a partition of the set \(\{0,1,...,n-1\}\) for some \(n\).

Path compression for find() is implemented, but the union() operation is naive (neither it is union by rank nor by size), see https://en.wikipedia.org/wiki/Disjoint-set_data_structure. This is by design, as some other operations in the current package rely on the assumption that the parent ID of each element is always <= than itself.

find()#

Finds the subset ID for a given x

Parameters:
xPy_ssize_t

An integer in {0,…,n-1}, representing an element to find.

Returns:
Py_ssize_t

The ID of the parent of x.

get_k()#

Returns the current number of subsets

get_n()#

Returns the number of elements in the set being partitioned

to_list()#

Gets parent IDs of all the elements

Returns:
ndarray, shape (n,)

A list m such that m[x] denotes the (recursive) parent ID of x, for x = 0,1,…,n-1.

to_list_normalized()#

Gets the normalised elements’ membership information

Returns:
ndarray, shape (n,)

A list m such that m[x] denotes the normalised parent ID of x. The resulting values are in {0,1,…,k-1}, where k is the current number of subsets in the partition.

to_lists()#

Returns a list of lists representing the current partition

Returns:
list of lists

A list of length k, where k is the current number of sets in the partition. Each list element is a list with values in {0,…,n-1}.

Notes

This is a slow operation. Do you really need it?

union()#

Merges the sets containing given x and y

Parameters:
xPy_ssize_t

Integer in {0,…,n-1}, representing an element of the first set to be merged.

yPy_ssize_t

Integer in {0,…,n-1}, representing an element of the second set to be merged.

Returns:
parentPy_ssize_t

The id of the parent of x or y, whichever is smaller.

Notes

Let px be the parent ID of x, and py be the parent ID of y. If px < py, then the new parent ID of py will be set to py. Otherwise, px will have py as its parent.

If x and y are members of the same subset, an exception is thrown.

class genieclust.internal.GiniDisjointSets#

Augmented disjoint sets (Union-Find) over {0,1,…,n-1}

Parameters:
nPy_ssize_t

The cardinality of the set whose partitions are generated.

Notes

The class allows to compute the normalised Gini index of the distribution of subset sizes, i.e., \(G(x_1,\dots,x_k) = \frac{\sum_{i=1}^{n-1} \sum_{j=i+1}^n |x_i-x_j|}{ (n-1) \sum_{i=1}^n x_i}\), where \(x_i\) is the number of elements in the i-th subset in the current partition.

For a use case, see: Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003

find()#

Finds the subset ID for a given x

Parameters:
xPy_ssize_t

An integer in {0,…,n-1}, representing an element to find.

Returns:
Py_ssize_t

The ID of the parent of x.

get_count()#

Returns the size of the subset containing x

Parameters:
xPy_ssize_t

An integer in {0,…,n-1}, representing an element to find.

Notes

Run time: the cost of find(x)

get_counts()#

Generates an array of subsets’ sizes

Notes

The resulting vector is ordered nondecreasingly.

Run time is \(O(k)\), where k is the current number of subsets.

get_gini()#

Returns the Gini index of the distribution of subsets’ sizes

Notes

Run time is \(O(1)\), as the Gini index is updated during each call to union().

get_k()#

Returns the current number of subsets

get_n()#

Returns the number of elements in the set being partitioned

get_smallest_count()#

Returns the size of the smallest subset

Notes

Run time is O(1).

to_list()#

Gets parent IDs of all the elements

Returns:
ndarray, shape (n,)

A list m such that m[x] denotes the (recursive) parent ID of x, for x = 0,1,…,n-1.

to_list_normalized()#

Gets the normalised elements’ membership information

Returns:
ndarray, shape (n,)

A list m such that m[x] denotes the normalised parent ID of x. The resulting values are in {0,1,…,k-1}, where k is the current number of subsets in the partition.

to_lists()#

Returns a list of lists representing the current partition

Returns:
list of lists

A list of length k, where k is the current number of sets in the partition. Each list element is a list with values in {0,…,n-1}.

Notes

This is a slow operation. Do you really need it?

union()#

Merges the sets containing given x and y

Parameters:
xPy_ssize_t

Integer in {0,…,n-1}, representing an element of the first set to be merged.

yPy_ssize_t

Integer in {0,…,n-1}, representing an element of the second set to be merged.

Returns:
parentPy_ssize_t

The id of the parent of x or y, whichever is smaller.

Notes

Let px be the parent ID of x, and py be the parent ID of y. If px < py, then the new parent ID of py will be set to py. Otherwise, px will have py as its parent.

If x and y are members of the same subset, an exception is thrown.

Update time: pessimistically \(O(\sqrt{n})\), as the Gini index must be recomputed.