genieclust.internal¶
Internal functions and classes
- class genieclust.internal.DisjointSets¶
Disjoint Sets (Union-Find)
- Parameters:
- nPy_ssize_t
The cardinality of the set whose partitions are generated.
Notes
Represents a partition of the set \(\{0,1,...,n-1\}\) for some \(n\).
Path compression for find() is implemented, but the union() operation is naive (neither it is union by rank nor by size), see https://en.wikipedia.org/wiki/Disjoint-set_data_structure. This is by design, as some other operations in the current package rely on the assumption that the parent ID of each element is always <= than itself.
- find(x)¶
Finds the subset ID for a given x
- Parameters:
- xPy_ssize_t
An integer in {0,…,n-1}, representing an element to find.
- Returns:
- Py_ssize_t
The ID of the parent of x.
- get_k()¶
Returns the current number of subsets
- get_n()¶
Returns the number of elements in the set being partitioned
- to_list()¶
Gets parent IDs of all the elements
- Returns:
- ndarray, shape (n,)
A list
m
such thatm[x]
denotes the (recursive) parent ID of x, for x = 0,1,…,n-1.
- to_list_normalized()¶
Gets the normalised elements’ membership information
- Returns:
- ndarray, shape (n,)
A list
m
such thatm[x]
denotes the normalised parent ID of x. The resulting values are in {0,1,…,k-1}, where k is the current number of subsets in the partition.
- to_lists()¶
Returns a list of lists representing the current partition
- Returns:
- list of lists
A list of length k, where k is the current number of sets in the partition. Each list element is a list with values in {0,…,n-1}.
Notes
This is a slow operation. Do you really need it?
- union(x, y)¶
Merges the sets containing given x and y
- Parameters:
- xPy_ssize_t
Integer in {0,…,n-1}, representing an element of the first set to be merged.
- yPy_ssize_t
Integer in {0,…,n-1}, representing an element of the second set to be merged.
- Returns:
- parentPy_ssize_t
The id of the parent of x or y, whichever is smaller.
Notes
Let px be the parent ID of x, and py be the parent ID of y. If px < py, then the new parent ID of py will be set to py. Otherwise, px will have py as its parent.
If x and y are members of the same subset, an exception is thrown.
- class genieclust.internal.GiniDisjointSets¶
Augmented disjoint sets (Union-Find) over {0,1,…,n-1}
- Parameters:
- nPy_ssize_t
The cardinality of the set whose partitions are generated.
Notes
The class allows to compute the normalised Gini index of the distribution of subset sizes, i.e., \(G(x_1,\dots,x_k) = \frac{\sum_{i=1}^{n-1} \sum_{j=i+1}^n |x_i-x_j|}{ (n-1) \sum_{i=1}^n x_i}\), where \(x_i\) is the number of elements in the i-th subset in the current partition.
For a use case, see: Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003
- find(x)¶
Finds the subset ID for a given x
- Parameters:
- xPy_ssize_t
An integer in {0,…,n-1}, representing an element to find.
- Returns:
- Py_ssize_t
The ID of the parent of x.
- get_count(x)¶
Returns the size of the subset containing x
- Parameters:
- xPy_ssize_t
An integer in {0,…,n-1}, representing an element to find.
Notes
Run time: the cost of find(x)
- get_counts()¶
Generates an array of subsets’ sizes
Notes
The resulting vector is ordered nondecreasingly.
Run time is \(O(k)\), where k is the current number of subsets.
- get_gini()¶
Returns the Gini index of the distribution of subsets’ sizes
Notes
Run time is \(O(1)\), as the Gini index is updated during each call to union().
- get_k()¶
Returns the current number of subsets
- get_n()¶
Returns the number of elements in the set being partitioned
- get_smallest_count()¶
Returns the size of the smallest subset
Notes
Run time is O(1).
- to_list()¶
Gets parent IDs of all the elements
- Returns:
- ndarray, shape (n,)
A list
m
such thatm[x]
denotes the (recursive) parent ID of x, for x = 0,1,…,n-1.
- to_list_normalized()¶
Gets the normalised elements’ membership information
- Returns:
- ndarray, shape (n,)
A list
m
such thatm[x]
denotes the normalised parent ID of x. The resulting values are in {0,1,…,k-1}, where k is the current number of subsets in the partition.
- to_lists()¶
Returns a list of lists representing the current partition
- Returns:
- list of lists
A list of length k, where k is the current number of sets in the partition. Each list element is a list with values in {0,…,n-1}.
Notes
This is a slow operation. Do you really need it?
- union(x, y)¶
Merges the sets containing given x and y
- Parameters:
- xPy_ssize_t
Integer in {0,…,n-1}, representing an element of the first set to be merged.
- yPy_ssize_t
Integer in {0,…,n-1}, representing an element of the second set to be merged.
- Returns:
- parentPy_ssize_t
The id of the parent of x or y, whichever is smaller.
Notes
Let px be the parent ID of x, and py be the parent ID of y. If px < py, then the new parent ID of py will be set to py. Otherwise, px will have py as its parent.
If x and y are members of the same subset, an exception is thrown.
Update time: pessimistically \(O(\sqrt{n})\), as the Gini index must be recomputed.