Example: Sparse Data and Movie Recommendation#
To illustrate how genieclust handles sparse data, let’s perform a simple exercise in movie recommendation based on MovieLens data.
Important
Make sure that the nmslib package (an optional dependency) is installed.
import numpy as np
import scipy.sparse
import pandas as pd
First we load the ratings data frame and map the movie IDs to consecutive integers.
ratings = pd.read_csv("ml-9-2018-small/ratings.csv")
ratings["movieId"] -= 1
ratings["userId"] -= 1
old_movieId_map = np.unique(ratings["movieId"])
ratings["movieId"] = np.searchsorted(old_movieId_map, ratings["movieId"])
ratings.head()
## userId movieId rating timestamp
## 0 0 0 4.0 964982703
## 1 0 2 4.0 964981247
## 2 0 5 4.0 964982224
## 3 0 43 5.0 964983815
## 4 0 46 5.0 964982931
Then we read the movie metadata and transform the movie IDs in the same way:
movies = pd.read_csv("ml-9-2018-small/movies.csv")
movies["movieId"] -= 1
movies = movies.loc[movies.movieId.isin(old_movieId_map), :]
movies["movieId"] = np.searchsorted(old_movieId_map, movies["movieId"])
movies.iloc[:, :2].head()
## movieId title
## 0 0 Toy Story (1995)
## 1 1 Jumanji (1995)
## 2 2 Grumpier Old Men (1995)
## 3 3 Waiting to Exhale (1995)
## 4 4 Father of the Bride Part II (1995)
Conversion of ratings to a CSR-format sparse matrix:
n = ratings.movieId.max()+1
d = ratings.userId.max()+1
X = scipy.sparse.dok_matrix((n,d), dtype=np.float32)
X[ratings.movieId, ratings.userId] = ratings.rating
X = X.tocsr()
print(repr(X))
## <9724x610 sparse matrix of type '<class 'numpy.float32'>'
## with 100836 stored elements in Compressed Sparse Row format>
First few observations:
X[:5, :10].todense()
## matrix([[4. , 0. , 0. , 0. , 4. , 0. , 4.5, 0. , 0. , 0. ],
## [0. , 0. , 0. , 0. , 0. , 4. , 0. , 4. , 0. , 0. ],
## [4. , 0. , 0. , 0. , 0. , 5. , 0. , 0. , 0. , 0. ],
## [0. , 0. , 0. , 0. , 0. , 3. , 0. , 0. , 0. , 0. ],
## [0. , 0. , 0. , 0. , 0. , 5. , 0. , 0. , 0. , 0. ]],
## dtype=float32)
Let’s extract 200 clusters with Genie with respect to the cosine similarity between films’ ratings as given by users (two movies considered similar if they get similar reviews). Sparse inputs are supported by the approximate version of the algorithm which relies on the near-neighbour search routines implemented in the nmslib package.
import genieclust
g = genieclust.Genie(n_clusters=200, exact=False, affinity="cosinesimil_sparse")
movies["cluster"] = g.fit_predict(X)
## ---------------------------------------------------------------------------ValueError
## Traceback (most recent call last)Cell In[1], line 3
## 1 import genieclust
## 2 g = genieclust.Genie(n_clusters=200, exact=False,
## affinity="cosinesimil_sparse")
## ----> 3 movies["cluster"] = g.fit_predict(X)
## File ~/.virtualenvs/python3-default/lib/python3.11/site-
## packages/genieclust/genie.py:548, in GenieBase.fit_predict(self, X, y)
## 520 def fit_predict(self, X, y=None):
## 521 """
## 522 Perform cluster analysis of a dataset and return the
## predicted labels.
## 523
## (...)
## 546
## 547 """
## --> 548 self.fit(X)
## 549 return self.labels_
## File ~/.virtualenvs/python3-default/lib/python3.11/site-
## packages/genieclust/genie.py:1051, in Genie.fit(self, X, y)
## 972 """
## 973 Perform cluster analysis of a dataset.
## 974
## (...)
## 1047
## 1048 """
## 1049 cur_state = self._check_params() # re-check, they might have
## changed
## -> 1051 cur_state = self._get_mst(X, cur_state)
## 1053 if cur_state["verbose"]:
## 1054 print("[genieclust] Determining clusters with Genie++.",
## file=sys.stderr)
## File ~/.virtualenvs/python3-default/lib/python3.11/site-
## packages/genieclust/genie.py:511, in GenieBase._get_mst(self, X,
## cur_state)
## 509 cur_state = self._get_mst_exact(X, cur_state)
## 510 else:
## --> 511 cur_state = self._get_mst_approx(X, cur_state)
## 513 # this might be an "intrinsic" dimensionality:
## 514 self.n_features_ = cur_state["n_features"]
## File ~/.virtualenvs/python3-default/lib/python3.11/site-
## packages/genieclust/genie.py:380, in GenieBase._get_mst_approx(self,
## X, cur_state)
## 378 def _get_mst_approx(self, X, cur_state):
## 379 if nmslib is None:
## --> 380 raise ValueError("Package `nmslib` is not available.")
## 382 if cur_state["affinity"] == "precomputed":
## 383 raise ValueError(
## 384 "`affinity` of 'precomputed' can only be used "
## 385 "with `exact` = True.")
## ValueError: Package `nmslib` is not available.
Here are the members of an example cluster:
movies["cluster"] = g.fit_predict(X)
which_cluster = movies.cluster[movies.title=="Monty Python's The Meaning of Life (1983)"]
movies.loc[movies.cluster == int(which_cluster)].title.sort_values()
## ---------------------------------------------------------------------------ValueError
## Traceback (most recent call last)Cell In[1], line 1
## ----> 1 movies["cluster"] = g.fit_predict(X)
## 2 which_cluster = movies.cluster[movies.title=="Monty Python's
## The Meaning of Life (1983)"]
## 3 movies.loc[movies.cluster ==
## int(which_cluster)].title.sort_values()
## File ~/.virtualenvs/python3-default/lib/python3.11/site-
## packages/genieclust/genie.py:548, in GenieBase.fit_predict(self, X, y)
## 520 def fit_predict(self, X, y=None):
## 521 """
## 522 Perform cluster analysis of a dataset and return the
## predicted labels.
## 523
## (...)
## 546
## 547 """
## --> 548 self.fit(X)
## 549 return self.labels_
## File ~/.virtualenvs/python3-default/lib/python3.11/site-
## packages/genieclust/genie.py:1051, in Genie.fit(self, X, y)
## 972 """
## 973 Perform cluster analysis of a dataset.
## 974
## (...)
## 1047
## 1048 """
## 1049 cur_state = self._check_params() # re-check, they might have
## changed
## -> 1051 cur_state = self._get_mst(X, cur_state)
## 1053 if cur_state["verbose"]:
## 1054 print("[genieclust] Determining clusters with Genie++.",
## file=sys.stderr)
## File ~/.virtualenvs/python3-default/lib/python3.11/site-
## packages/genieclust/genie.py:511, in GenieBase._get_mst(self, X,
## cur_state)
## 509 cur_state = self._get_mst_exact(X, cur_state)
## 510 else:
## --> 511 cur_state = self._get_mst_approx(X, cur_state)
## 513 # this might be an "intrinsic" dimensionality:
## 514 self.n_features_ = cur_state["n_features"]
## File ~/.virtualenvs/python3-default/lib/python3.11/site-
## packages/genieclust/genie.py:380, in GenieBase._get_mst_approx(self,
## X, cur_state)
## 378 def _get_mst_approx(self, X, cur_state):
## 379 if nmslib is None:
## --> 380 raise ValueError("Package `nmslib` is not available.")
## 382 if cur_state["affinity"] == "precomputed":
## 383 raise ValueError(
## 384 "`affinity` of 'precomputed' can only be used "
## 385 "with `exact` = True.")
## ValueError: Package `nmslib` is not available.
The above was performed on an abridged version of the MovieLens dataset. The project’s website also features a full database that yields a 53,889x283,228 ratings table (with 27,753,444 non-zero elements) – such a matrix would definitely not fit into our RAM if it was in the dense form. Determining the whole cluster hierarchy takes only 144 seconds. Here is one of 500 clusters extracted:
## 13327 Blackadder Back & Forth (1999)
## 13328 Blackadder's Christmas Carol (1988)
## 3341 Creature Comforts (1989)
## 1197 Grand Day Out with Wallace and Gromit, A (1989)
## 2778 Hard Day's Night, A (1964)
## 2861 Help! (1965)
## 2963 How I Won the War (1967)
## 6006 Monty Python Live at the Hollywood Bowl (1982)
## 1113 Monty Python and the Holy Grail (1975)
## 2703 Monty Python's And Now for Something Completel...
## 1058 Monty Python's Life of Brian (1979)
## 6698 Monty Python's The Meaning of Life (1983)
## 27284 Oliver Twist (1997)
## 2216 Producers, The (1968)
## 4716 Quadrophenia (1979)
## 6027 Secret Policeman's Other Ball, The (1982)
## 27448 The Basket (2000)
## 2792 Tommy (1975)
## 10475 Wallace & Gromit in The Curse of the Were-Rabb...
## 732 Wallace & Gromit: A Close Shave (1995)
## 708 Wallace & Gromit: The Best of Aardman Animatio...
## 1125 Wallace & Gromit: The Wrong Trousers (1993)
## 13239 Wallace and Gromit in 'A Matter of Loaf and De...
## 2772 Yellow Submarine (1968)
## 1250 Young Frankenstein (1974)
## Name: title, dtype: object