Jill-Jênn Vie

Researcher at Inria

% Using Ratings & Posters\newline for Anime & Manga Recommendations % Jill-Jênn Vie % RIKEN Center for Advanced Intelligence Project (Tokyo)\newline Mangaki (Paris) — header-includes: - \usepackage{tikz} - \usepackage{array} - \usepackage{icomma} - \usepackage{multicol,booktabs} - \def\R{\mathcal{R}} —

RIKEN Center for Advanced Intelligence Project

\

Recommendation System

Problem

\

Every supervised machine learning algorithm

fit($X$, $y$)

\centering \begin{tabular}{ccc} \toprule \multicolumn{2}{c}{$X$} & $y$\ \cmidrule{1-2} \texttt{user_id} & \texttt{work_id} & \texttt{rating}\ \midrule 24 & 823 & like
12 & 823 & dislike
12 & 25 & favorite
\ldots & \ldots & \ldots\ \bottomrule \end{tabular}

\pause

$\hat{y}$ = predict($X$)

\centering \begin{tabular}{ccc} \toprule \multicolumn{2}{c}{$X$} & $\hat{y}$\ \cmidrule{1-2} \texttt{user_id} & \texttt{work_id} & \texttt{rating}\ \midrule 24 & 25 & \only<2>{?}\only<3>{\alert{disliked}}
12 & 42 & \only<2>{?}\only<3>{\alert{liked}}\ \bottomrule \end{tabular}

Evaluation: Root Mean Squared Error (RMSE)

If I predict $\hat{y_i}$ for each user-work pair to test among $n$,
while truth is $y^*_i$:

\[RMSE(\hat{y}, y^*) = \sqrt{\frac1n \sum_i (\hat{y}_i - y^*_i)^2}.\]

Dataset: Mangaki

\

\pause

Visualizing movies: movie2vec

\alert{Closer} points mean similar taste

\

Find your taste

You will \alert{like} movies that are \alert{close to you}

\

Variants of Matrix Factorization for Recommendation

$R$ ratings, $C$ coefficients, $P$ profiles ($F$ features).

$R = CP = CF^T \Rightarrow r_{ij} \simeq \hat{r}_{ij} \triangleq C_i \cdot F_j$.

Objective functions (reconstruction error) to minimize

SVD : $\sum_{i, j}~(r_{ij} - C_i \cdot F_j)^2$ (deterministic)

\pause

ALS : $\sum_{i, j \textnormal{\alert{ known}}}~(r_{ij} - C_i \cdot F_j)^2$

\pause

\alert<6>{ALS-WR} : $\sum_{i, j \textnormal{\alert{ known}}}~(r_{ij} - C_i \cdot F_j)^2 + \lambda (\sum_i \alert<6>{N_i}   C_i   ^2 + \sum_j \alert<6>{M_j}   F_j   ^2)$

\pause

WALS by Tensorflow™ : $$\sum_{i, j} w_{ij} \cdot (r_{ij} - C_i \cdot F_j)^2 + \lambda (\sum_i   C_i   ^2 + \sum_j   F_j   ^2)$$

\pause

Who do you think wins?

ALS for feature extraction

$R = CP$

Issue: Item Cold-Start

\pause

But we have posters!

Illustration2Vec (Saito and Matsui, 2015)

\centering

{width=40%}\ {width=40%}\

LASSO for explanation of user preferences

$T$ matrix of 15000 works $\times$ 502 tags

Interpretation and explanation

Least Absolute Shrinkage and Selection Operator (LASSO)

\[\frac1{2 N_i} {\lVert \R_i - P_i T^T \rVert}_2^2 + \alpha \alert{ {\lVert P_i \rVert}_1}.\]

\noindent where $N_i$ is the number of items rated by user $i$.

Blending

We would like to do:

\[\hat{r}_{ij}^{BALSE} = \begin{cases} \hat{r}_{ij}^{ALS} & \text{if item $j$ was rated at least $\gamma$ times}\\ \hat{r}_{ij}^{LASSO} & \text{otherwise} \end{cases}\]

But we can’t. Why?

\pause

\[\hat{r}_{ij}^{BALSE} = \alert{\sigma(\beta(R_j - \gamma))} \hat{r}_{ij}^{ALS} + \left(1 - \alert{\sigma(\beta(R_j - \gamma))}\right) \hat{r}_{ij}^{LASSO}\]

\noindent where $R_j$ denotes the number of ratings of item $j$
$\beta$ and $\gamma$ are learned by stochastic gradient descent.

\pause

\centering

We call this gate the \alert{Steins;Gate}.

Blended Alternate Least Squares with Explanation

\centering

\

\pause

We call this model \alert{BALSE}.

Results

\centering

\

Follow us on Twitter: \alert{@MangakiFR}!

\centering {width=50%}\

\raggedright

Read this article (front page of Hacker News!)

Using Posters to Recommend Anime and Mangas in a Cold-Start Scenario

https://github.com/mangaki/balse (PDF is on arXiv)

Compete to Mangaki Data Challenge: \alert{research.mangaki.fr}