Example: Sparse Data and Movie Recommendation

To illustrate how genieclust handles sparse data, let’s perform a simple exercise in movie recommendation based on MovieLens data.

import numpy as np
import scipy.sparse
import pandas as pd

First we load the ratings data frame and map the movie IDs to consecutive integers.

ratings = pd.read_csv("ml-9-2018-small/ratings.csv")
ratings["movieId"] -= 1
ratings["userId"] -= 1
old_movieId_map = np.unique(ratings["movieId"])
ratings["movieId"] = np.searchsorted(old_movieId_map, ratings["movieId"])
ratings.head()
##    userId  movieId  rating  timestamp
## 0       0        0     4.0  964982703
## 1       0        2     4.0  964981247
## 2       0        5     4.0  964982224
## 3       0       43     5.0  964983815
## 4       0       46     5.0  964982931

Then we read the movie meta data and transform the movie IDs in the same way:

movies = pd.read_csv("ml-9-2018-small/movies.csv")
movies["movieId"] -= 1
movies = movies.loc[movies.movieId.isin(old_movieId_map), :]
movies["movieId"] = np.searchsorted(old_movieId_map, movies["movieId"])
movies.iloc[:, :2].head()
##    movieId                               title
## 0        0                    Toy Story (1995)
## 1        1                      Jumanji (1995)
## 2        2             Grumpier Old Men (1995)
## 3        3            Waiting to Exhale (1995)
## 4        4  Father of the Bride Part II (1995)

Conversion of ratings to a CSR-format sparse matrix:

n = ratings.movieId.max()+1
d = ratings.userId.max()+1
X = scipy.sparse.dok_matrix((n,d), dtype=np.float32)
X[ratings.movieId, ratings.userId] = ratings.rating
X = X.tocsr()
print(repr(X))
## <9724x610 sparse matrix of type '<class 'numpy.float32'>'
##         with 100836 stored elements in Compressed Sparse Row format>

First few observations:

X[:5, :10].todense()
## matrix([[4. , 0. , 0. , 0. , 4. , 0. , 4.5, 0. , 0. , 0. ],
##         [0. , 0. , 0. , 0. , 0. , 4. , 0. , 4. , 0. , 0. ],
##         [4. , 0. , 0. , 0. , 0. , 5. , 0. , 0. , 0. , 0. ],
##         [0. , 0. , 0. , 0. , 0. , 3. , 0. , 0. , 0. , 0. ],
##         [0. , 0. , 0. , 0. , 0. , 5. , 0. , 0. , 0. , 0. ]],
## dtype=float32)

Let’s extract 200 clusters with Genie with respect to cosine similarity between films’ ratings as given by users (two movies considered similar if they get similar reviews). Sparse inputs are supported by the approximate version of the algorithm which relies on the near-neighbour search routines implemented in the nmslib package.

import genieclust
g = genieclust.Genie(n_clusters=200, exact=False, affinity="cosinesimil_sparse")
movies["cluster"] = g.fit_predict(X)

Here are the members of an example cluster:

movies["cluster"] = g.fit_predict(X)
which_cluster = movies.cluster[movies.title=="Monty Python's The Meaning of Life (1983)"]
movies.loc[movies.cluster == int(which_cluster)].title.sort_values()
## 2097                                     Airplane! (1980)
## 2907                                 Almost Famous (2000)
## 1914                                  Analyze This (1999)
## 969                             Back to the Future (1985)
## 1486                    Back to the Future Part II (1989)
## 1487                   Back to the Future Part III (1990)
## 2259                          Being John Malkovich (1999)
## 2916                                  Best in Show (2000)
## 921                            Blues Brothers, The (1980)
## 89                                   Bottle Rocket (1996)
## 2084                                     Bowfinger (1999)
## 2190                                Boys Don't Cry (1999)
## 2888                                     Cell, The (2000)
## 955                                      Duck Soup (1933)
## 836                     E.T. the Extra-Terrestrial (1982)
## 1960                                      Election (1999)
## 2036                                Eyes Wide Shut (1999)
## 819                           Fish Called Wanda, A (1988)
## 1232                               Full Monty, The (1997)
## 964                                  Groundhog Day (1993)
## 2605                                 High Fidelity (2000)
## 1211                     Hunt for Red October, The (1990)
## 2382                                      Magnolia (1999)
## 863                Monty Python and the Holy Grail (1975)
## 2094    Monty Python's And Now for Something Completel...
## 820                   Monty Python's Life of Brian (1979)
## 4581            Monty Python's The Meaning of Life (1983)
## 2892       Naked Gun 2 1/2: The Smell of Fear, The (1991)
## 2891    Naked Gun: From the Files of Police Squad!, Th...
## 3010                    O Brother, Where Art Thou? (2000)
## 1394                                  Out of Sight (1998)
## 2443                                 Patriot Games (1992)
## 850                    People vs. Larry Flynt, The (1996)
## 899                            Princess Bride, The (1987)
## 2020                     Run Lola Run (Lola rennt) (1998)
## 1796                                      Rushmore (1998)
## 1979     Star Wars: Episode I - The Phantom Menace (1999)
## 224             Star Wars: Episode IV - A New Hope (1977)
## 898     Star Wars: Episode V - The Empire Strikes Back...
## 911     Star Wars: Episode VI - Return of the Jedi (1983)
## 934                                     Sting, The (1973)
## 987                             This Is Spinal Tap (1984)
## 2174                                   Three Kings (1999)
## 839                                        Top Gun (1986)
## 3016                                       Traffic (2000)
## 1113                           Waiting for Guffman (1996)
## 977                             Young Frankenstein (1974)
## Name: title, dtype: object

The above was performed on an abridged version of the MovieLens dataset. The project’s website also features a full database that yields a 53889x283228 ratings table (with 27753444 non-zero elements) – such a matrix would definitely not fit into our RAM if it was in the dense form. Determining the whole cluster hierarchy takes only 144 secs. Here is one of 500 clusters extracted:

## 13327                       Blackadder Back & Forth (1999)
## 13328                  Blackadder's Christmas Carol (1988)
## 3341                              Creature Comforts (1989)
## 1197       Grand Day Out with Wallace and Gromit, A (1989)
## 2778                            Hard Day's Night, A (1964)
## 2861                                          Help! (1965)
## 2963                              How I Won the War (1967)
## 6006        Monty Python Live at the Hollywood Bowl (1982)
## 1113                Monty Python and the Holy Grail (1975)
## 2703     Monty Python's And Now for Something Completel...
## 1058                   Monty Python's Life of Brian (1979)
## 6698             Monty Python's The Meaning of Life (1983)
## 27284                                  Oliver Twist (1997)
## 2216                                 Producers, The (1968)
## 4716                                   Quadrophenia (1979)
## 6027             Secret Policeman's Other Ball, The (1982)
## 27448                                    The Basket (2000)
## 2792                                          Tommy (1975)
## 10475    Wallace & Gromit in The Curse of the Were-Rabb...
## 732                 Wallace & Gromit: A Close Shave (1995)
## 708      Wallace & Gromit: The Best of Aardman Animatio...
## 1125           Wallace & Gromit: The Wrong Trousers (1993)
## 13239    Wallace and Gromit in 'A Matter of Loaf and De...
## 2772                               Yellow Submarine (1968)
## 1250                             Young Frankenstein (1974)
## Name: title, dtype: object