Jill-Jênn Vie

Researcher at Inria

% Variational factorization machines for preference elicitation in large-scale recommender systems % Jill-Jênn Vie¹; Tomas Rigaux¹; Hisashi Kashima² % ¹ Inria Saclay, SODA team\newline ² Kyoto University — aspectratio: 169 handout: false institute: \hfill \includegraphics[height=1cm]{figures/inria.png} \includegraphics[height=1cm]{figures/kyoto.png} section-titles: false theme: metropolis header-includes: - \usepackage{bm} - \usepackage{multicol,booktabs} - \usepackage{algorithm,algpseudocode,algorithmicx} - \DeclareMathOperator\logit{logit} - \def\u{\bm{u}} - \def\v{\bm{v}} - \def\X{\bm{X}} - \def\x{\bm{x}} - \def\y{\bm{y}} - \def\E{\mathbb{E}} - \def\N{\mathcal{N}} - \def\Lb{\mathcal{L}_B} - \def\KL{\textnormal{KL}} - \metroset{sectionpage=none} —

Introduction

Outline

Preference elicitation

Factorization Machines (FMs)

In this paper

Recommender Systems

Recommender Systems as Matrix Completion

Problem

Example

\centering \begin{tabular}{ccccc} & \includegraphics[height=2.5cm]{figures/1.jpg} & \includegraphics[height=2.5cm]{figures/2.jpg} & \includegraphics[height=2.5cm]{figures/3.jpg} & \includegraphics[height=2.5cm]{figures/4.jpg}
Satoshi & ? & 5 & 2 & ?
Kasumi & 4 & 1 & ? & 5
Takeshi & 3 & 3 & 1 & 4
Joy & 5 & ? & 2 & ? \end{tabular}

Recommender Systems as Matrix Completion

Problem

Example

\centering \begin{tabular}{ccccc} & \includegraphics[height=2.5cm]{figures/1.jpg} & \includegraphics[height=2.5cm]{figures/2.jpg} & \includegraphics[height=2.5cm]{figures/3.jpg} & \includegraphics[height=2.5cm]{figures/4.jpg}
Satoshi & \alert{3} & 5 & 2 & \alert{2}
Kasumi & 4 & 1 & \alert{4} & 5
Takeshi & 3 & 3 & 1 & 4
Joy & 5 & \alert{2} & 2 & \alert{5} \end{tabular}

Preference Elicitation: select an informative batch of $K$ items

\centering

{width=90%}

Preference Elicitation: learn user embeddings in latent space

\centering

\includegraphics{figures/svd2.png}

Matrix factorization for collaborative filtering

Approximate $R$ ratings $n \times m$ by learning embeddings for user and item

\[\left.\begin{array}{r} U \textnormal{ user embeddings } n \times d\\ V \textnormal{ item embeddings } m \times d \end{array}\right\} \textnormal{ such that } R \simeq \alert{UV^T}\]

\begin{block}{Fit} Learn $U$ and $V$ to \alert{minimize} $ {||R - UV^T||}_2^2 + \lambda \cdot \textnormal{regularization} $ \end{block}

\begin{block}{Predict: Will user $i$ like item $j$?} Just compute $\langle \u_i, \v_j \rangle$ \end{block}

The actual model also contains bias terms for user $i$ and item $j$

\[r_{ij} = \mu + \alert{w^u_i + w^v_j} + \langle \u_i, \v_j \rangle\]

How to model side information?

If you know user $i$ watched item $j$ at the \alert{cinema} (or on TV, or on smartphone), how to model it?

$r_{ij}$: rating of user $i$ on item $j$

Collaborative filtering

\[r_{ij} = w_{\textnormal{user } i} + w_{\textnormal{item } j} + \langle \bm{v}_{\textnormal{user $i$}}, \bm{v}_{\textnormal{item $j$}} \rangle\]

\pause

With side information

\[r_{ij} = w_{\textnormal{user } i} + w_{\textnormal{item } j} + \alert{w_{cinema}} + \langle \bm{v}_{\textnormal{user $i$}}, \bm{v}_{\textnormal{item $j$}} \rangle + \langle \bm{v}_{\textnormal{user $i$}}, \alert{\bm{v}_{\textnormal{cinema}}} \rangle + \langle \bm{v}_{\textnormal{item $j$}}, \alert{\bm{v}_{\textnormal{cinema}}} \rangle\]

Encoding the problem using sparse features

\centering \begin{tabular}{rrr|rrrr|rrr} \toprule \multicolumn{3}{c}{Users} & \multicolumn{4}{c}{Items} & \multicolumn{3}{c}{Formats}
\cmidrule{1-3} \cmidrule{4-6} \cmidrule{7-10} $U_1$ & $U_2$ & $U_3$ & $I_1$ & $I_2$ & $I_3$ & $I_4$ & cinema & TV & mobile
\midrule 0 & \alert1 & 0 & 0 & \alert1 & 0 & 0 & 0 & \alert1 & 0
0 & 0 & \alert1 & 0 & 0 & \alert1 & 0 & 0 & \alert1 & 0
0 & \alert1 & 0 & 0 & 0 & \alert1 & 0 & \alert1 & 0 & 0
0 & \alert1 & 0 & 0 & \alert1 & 0 & 0 & \alert1 & 0 & 0
\alert1 & 0 & 0 & 0 & 0 & 0 & \alert1 & 0 & \alert1 & 0
\bottomrule \end{tabular}

Graphically: factorization machines

\centering

Formally: factorization machines

Learn bias \alert{$w_k$} and embedding \alert{$\bm{v_k}$} for each feature $k$ such that: \(y(\bm{x}) = \mu + \underbrace{\sum_{k = 1}^K \alert{w_k} x_k}_{\textnormal{linear regression}} + \underbrace{\sum_{1 \leq k < l \leq K} x_k x_l \langle \alert{\bm{v_k}}, \alert{\bm{v_l}} \rangle}_{\textnormal{pairwise interactions}}\)

This model is for regression
If classification, use a link function like softmax/sigmoid or Gaussian CDF

\small \fullcite{rendle2012factorization}

Training using, for example, SGD

Take a batch $(\X_B, y_B)$ and update the parameters such that the error is minimized.

\begin{algorithm}[H] \begin{algorithmic} \For {batch $\bm{X}_B, y_B$} \For {$k$ feature involved in this batch $\bm{X}_B$} \State Update $w_k, \bm{v}_k$ to decrease loss estimate $\mathcal{L}$ on $\bm{X}_B$ \EndFor \EndFor \end{algorithmic} \caption{SGD} \label{algo-vfm} \end{algorithm}

Bayesian variational inference

Why do we prefer distributions over point estimates?

Variational inference

\only<1>{Approximate true posterior with an easier distribution (Gaussian)}

\only<2>{\begin{align} \textnormal{Priors } p(w_k) = \N(\nu^w_{g(k)}, 1/\lambda^w_{g(k)}) \qquad p(v_{kf}) = \N(\nu^{v,f}_{g(k)}, 1/\lambda^{v,f}_{g(k)})
\textnormal{Approx. posteriors } q(w_k) = \N(\mu^w_k, (\sigma^w_k)^2) \qquad q(v_{kf}) = \N(\mu^{v,f}_k, (\sigma^{v,f}_k)^2) \end{align
}}

Idea: increase the ELBO $\Rightarrow$ increase the objective

\begin{align} \log p(\y) & \geq \sum_{i = 1}^N \underbrace{\E_{q(\theta)} [\log p(y_i|x_i,\theta)] - \KL(q(\theta)||p(\theta))}_{\textrm{Evidence Lower Bound (ELBO) }}
& \quad = \sum_{i = 1}^N \E_{q(\theta)} [ \log p(y_i|x_i,\theta) ] - \KL(q(w_0)||p(w_0)) - \sum_{k = 1}^K \KL(q(\theta_k)||p(\theta_k)) \end{align
}

Needs to be rescaled for mini-batching (see in the paper)

Graphically: Variational Factorization Machines

\centering

VFM training

\begin{algorithm}[H] \begin{algorithmic} \For {each batch $B \subseteq {1, \ldots, N}$} \State Sample $w_0 \sim q(w_0)$ \For {$k \in F(B)$ feature involved in batch $B$} \State Sample $S$ times $w_k \sim q(w_k)$, $\bm{v}_k \sim q(\bm{v}_k)$ \EndFor \For {$k \in F(B)$ feature involved in batch $B$} \State Update parameters $\mu_k^w, \sigma_k^w, \bm{\mu}_k^v, \bm{\sigma}_k^v$ to increase ELBO estimate \EndFor \State Update hyper-parameters $\mu_0, \sigma_0, \nu, \lambda, \alpha$ \State Keep a moving average of the parameters to compute mean predictions \EndFor \end{algorithmic} \caption{Variational Training (SGVB) of FMs} \label{algo-vfm} \end{algorithm} \vspace{-5mm} Then $\sigma$ can be reused for preference elicitation (see how in the paper)

Stochastic weight averaging

A beneficial regularization: keep all weights over training epochs and average them.

Connections to Polyak-Ruppert averaging, aka stochastic weight averaging

Experiments on real data

\centering \begin{tabular}{llrrrr} \toprule Task & Dataset & #users & #items & #entries & Sparsity
\midrule %fraction & 536 & 20 & 10720 & 0.000
%duolingo & 1213 & 2417 & 1064228 & 0.637 \ \midrule Regression & movie100k & 944 & 1683 & 100000 & 0.937
& movie1M & 6041 & 3707 & 1000209 & 0.955
% & movie10M & 69879 & 10678 & 10000054 & 0.987
\midrule Classification & movie100 & 100 & 100 & 10000 & 0
& movie100k & 944 & 1683 & 100000 & 0.937
& movie1M & 6041 & 3707 & 1000209 & 0.955
& Duolingo & 1213 & 2416 & 1199732 & 0.828
\bottomrule \vspace{-5pt} \end{tabular}

Models

Results on regression

:::::: {.columns} ::: {.column} \vspace{1cm} \centering \begin{tabular}{ccc} \toprule RMSE & Movie100k & Movie1M
\midrule MCMC & \textbf{0.906} & \textbf{0.840}
\textbf{VFM} & \textbf{0.906} & 0.854
VBFM & 0.907 & 0.856
OVBFM & 0.912 & 0.846
\bottomrule \end{tabular}

\vspace{2cm} \footnotetext{OVBFM is online (batch size = 1) of VBFM} ::: ::: {.column} ::: ::::::

Results on classification

:::::: {.columns} ::: {.column} \vspace{1cm} \centering

ACC Movie100k Movie1M Duolingo
MCMC 0.717 0.739 0.848
VFM 0.722 0.746 0.846
VBFM 0.692 0.732 0.842

\vspace{1cm} \footnotetext{In the paper, we also report AUC and mean average precision.}

::: ::: {.column} ::: ::::::

Conclusion

Conclusion

Thanks for listening!

:::::: {.columns} ::: {.column} VFM is implemented in TF & PyTorch\bigskip

$\E_{q(\theta)} [\log p(y_i \bm{x}_i,\theta)]$ becomes
outputs.log_prob(observed).mean()

Same implementation for classification and regression: the only difference in the distribution (Bernoulli vs. Gaussian)

\vspace{1cm}

Feel free to try it on GitHub (vfm.py):

github.com/jilljenn/vae ::: ::: {.column}

See more benchmarks on \href{https://github.com/mangaki/zero}{github.com/mangaki/zero} ::: ::::::