% Variational factorization machines for preference elicitation in large-scale recommender systems % Jill-Jênn Vie¹; Tomas Rigaux¹; Hisashi Kashima² % ¹ Inria Saclay, SODA team\newline ² Kyoto University — aspectratio: 169 handout: false institute: \hfill \includegraphics[height=1cm]{figures/inria.png} \includegraphics[height=1cm]{figures/kyoto.png} section-titles: false theme: metropolis header-includes: - \usepackage{bm} - \usepackage{multicol,booktabs} - \usepackage{algorithm,algpseudocode,algorithmicx} - \DeclareMathOperator\logit{logit} - \def\u{\bm{u}} - \def\v{\bm{v}} - \def\X{\bm{X}} - \def\x{\bm{x}} - \def\y{\bm{y}} - \def\E{\mathbb{E}} - \def\N{\mathcal{N}} - \def\Lb{\mathcal{L}_B} - \def\KL{\textnormal{KL}} - \metroset{sectionpage=none} —
\centering
\begin{tabular}{ccccc}
& \includegraphics[height=2.5cm]{figures/1.jpg} & \includegraphics[height=2.5cm]{figures/2.jpg} & \includegraphics[height=2.5cm]{figures/3.jpg} & \includegraphics[height=2.5cm]{figures/4.jpg}
Satoshi & ? & 5 & 2 & ?
Kasumi & 4 & 1 & ? & 5
Takeshi & 3 & 3 & 1 & 4
Joy & 5 & ? & 2 & ?
\end{tabular}
\centering
\begin{tabular}{ccccc}
& \includegraphics[height=2.5cm]{figures/1.jpg} & \includegraphics[height=2.5cm]{figures/2.jpg} & \includegraphics[height=2.5cm]{figures/3.jpg} & \includegraphics[height=2.5cm]{figures/4.jpg}
Satoshi & \alert{3} & 5 & 2 & \alert{2}
Kasumi & 4 & 1 & \alert{4} & 5
Takeshi & 3 & 3 & 1 & 4
Joy & 5 & \alert{2} & 2 & \alert{5}
\end{tabular}
\centering
{width=90%}
\centering
\includegraphics{figures/svd2.png}
Approximate $R$ ratings $n \times m$ by learning embeddings for user and item
\[\left.\begin{array}{r} U \textnormal{ user embeddings } n \times d\\ V \textnormal{ item embeddings } m \times d \end{array}\right\} \textnormal{ such that } R \simeq \alert{UV^T}\]\begin{block}{Fit} Learn $U$ and $V$ to \alert{minimize} $ {||R - UV^T||}_2^2 + \lambda \cdot \textnormal{regularization} $ \end{block}
\begin{block}{Predict: Will user $i$ like item $j$?} Just compute $\langle \u_i, \v_j \rangle$ \end{block}
The actual model also contains bias terms for user $i$ and item $j$
\[r_{ij} = \mu + \alert{w^u_i + w^v_j} + \langle \u_i, \v_j \rangle\]If you know user $i$ watched item $j$ at the \alert{cinema} (or on TV, or on smartphone), how to model it?
$r_{ij}$: rating of user $i$ on item $j$
\pause
\centering
\begin{tabular}{rrr|rrrr|rrr}
\toprule
\multicolumn{3}{c}{Users} & \multicolumn{4}{c}{Items} & \multicolumn{3}{c}{Formats}
\cmidrule{1-3} \cmidrule{4-6} \cmidrule{7-10}
$U_1$ & $U_2$ & $U_3$ & $I_1$ & $I_2$ & $I_3$ & $I_4$ & cinema & TV & mobile
\midrule
0 & \alert1 & 0 & 0 & \alert1 & 0 & 0 & 0 & \alert1 & 0
0 & 0 & \alert1 & 0 & 0 & \alert1 & 0 & 0 & \alert1 & 0
0 & \alert1 & 0 & 0 & 0 & \alert1 & 0 & \alert1 & 0 & 0
0 & \alert1 & 0 & 0 & \alert1 & 0 & 0 & \alert1 & 0 & 0
\alert1 & 0 & 0 & 0 & 0 & 0 & \alert1 & 0 & \alert1 & 0
\bottomrule
\end{tabular}
\centering
Learn bias \alert{$w_k$} and embedding \alert{$\bm{v_k}$} for each feature $k$ such that: \(y(\bm{x}) = \mu + \underbrace{\sum_{k = 1}^K \alert{w_k} x_k}_{\textnormal{linear regression}} + \underbrace{\sum_{1 \leq k < l \leq K} x_k x_l \langle \alert{\bm{v_k}}, \alert{\bm{v_l}} \rangle}_{\textnormal{pairwise interactions}}\)
This model is for regression
If classification, use a link function like softmax/sigmoid or Gaussian CDF
\small \fullcite{rendle2012factorization}
Take a batch $(\X_B, y_B)$ and update the parameters such that the error is minimized.
\begin{algorithm}[H] \begin{algorithmic} \For {batch $\bm{X}_B, y_B$} \For {$k$ feature involved in this batch $\bm{X}_B$} \State Update $w_k, \bm{v}_k$ to decrease loss estimate $\mathcal{L}$ on $\bm{X}_B$ \EndFor \EndFor \end{algorithmic} \caption{SGD} \label{algo-vfm} \end{algorithm}
\only<1>{Approximate true posterior with an easier distribution (Gaussian)}
\only<2>{\begin{align}
\textnormal{Priors } p(w_k) = \N(\nu^w_{g(k)}, 1/\lambda^w_{g(k)}) \qquad p(v_{kf}) = \N(\nu^{v,f}_{g(k)}, 1/\lambda^{v,f}_{g(k)})
\textnormal{Approx. posteriors } q(w_k) = \N(\mu^w_k, (\sigma^w_k)^2) \qquad q(v_{kf}) = \N(\mu^{v,f}_k, (\sigma^{v,f}_k)^2)
\end{align}}
Idea: increase the ELBO $\Rightarrow$ increase the objective
\begin{align}
\log p(\y) & \geq \sum_{i = 1}^N \underbrace{\E_{q(\theta)} [\log p(y_i|x_i,\theta)] - \KL(q(\theta)||p(\theta))}_{\textrm{Evidence Lower Bound (ELBO) }}
& \quad = \sum_{i = 1}^N \E_{q(\theta)} [ \log p(y_i|x_i,\theta) ] - \KL(q(w_0)||p(w_0)) - \sum_{k = 1}^K \KL(q(\theta_k)||p(\theta_k))
\end{align}
Needs to be rescaled for mini-batching (see in the paper)
\centering
\begin{algorithm}[H] \begin{algorithmic} \For {each batch $B \subseteq {1, \ldots, N}$} \State Sample $w_0 \sim q(w_0)$ \For {$k \in F(B)$ feature involved in batch $B$} \State Sample $S$ times $w_k \sim q(w_k)$, $\bm{v}_k \sim q(\bm{v}_k)$ \EndFor \For {$k \in F(B)$ feature involved in batch $B$} \State Update parameters $\mu_k^w, \sigma_k^w, \bm{\mu}_k^v, \bm{\sigma}_k^v$ to increase ELBO estimate \EndFor \State Update hyper-parameters $\mu_0, \sigma_0, \nu, \lambda, \alpha$ \State Keep a moving average of the parameters to compute mean predictions \EndFor \end{algorithmic} \caption{Variational Training (SGVB) of FMs} \label{algo-vfm} \end{algorithm} \vspace{-5mm} Then $\sigma$ can be reused for preference elicitation (see how in the paper)
A beneficial regularization: keep all weights over training epochs and average them.
Connections to Polyak-Ruppert averaging, aka stochastic weight averaging
\centering
\begin{tabular}{llrrrr}
\toprule
Task & Dataset & #users & #items & #entries & Sparsity
\midrule
%fraction & 536 & 20 & 10720 & 0.000
%duolingo & 1213 & 2417 & 1064228 & 0.637 \ \midrule
Regression & movie100k & 944 & 1683 & 100000 & 0.937
& movie1M & 6041 & 3707 & 1000209 & 0.955
% & movie10M & 69879 & 10678 & 10000054 & 0.987
\midrule
Classification & movie100 & 100 & 100 & 10000 & 0
& movie100k & 944 & 1683 & 100000 & 0.937
& movie1M & 6041 & 3707 & 1000209 & 0.955
& Duolingo & 1213 & 2416 & 1199732 & 0.828
\bottomrule
\vspace{-5pt}
\end{tabular}
:::::: {.columns}
::: {.column}
\vspace{1cm}
\centering
\begin{tabular}{ccc}
\toprule
RMSE & Movie100k & Movie1M
\midrule
MCMC & \textbf{0.906} & \textbf{0.840}
\textbf{VFM} & \textbf{0.906} & 0.854
VBFM & 0.907 & 0.856
OVBFM & 0.912 & 0.846
\bottomrule
\end{tabular}
\vspace{2cm} \footnotetext{OVBFM is online (batch size = 1) of VBFM} ::: ::: {.column} ::: ::::::
:::::: {.columns} ::: {.column} \vspace{1cm} \centering
ACC | Movie100k | Movie1M | Duolingo |
---|---|---|---|
MCMC | 0.717 | 0.739 | 0.848 |
VFM | 0.722 | 0.746 | 0.846 |
VBFM | 0.692 | 0.732 | 0.842 |
\vspace{1cm} \footnotetext{In the paper, we also report AUC and mean average precision.}
::: ::: {.column} ::: ::::::
:::::: {.columns} ::: {.column} VFM is implemented in TF & PyTorch\bigskip
$\E_{q(\theta)} [\log p(y_i | \bm{x}_i,\theta)]$ becomes |
outputs.log_prob(observed).mean()
Same implementation for classification and regression: the only difference in the distribution (Bernoulli vs. Gaussian)
\vspace{1cm}
Feel free to try it on GitHub (vfm.py
):
github.com/jilljenn/vae ::: ::: {.column}
See more benchmarks on \href{https://github.com/mangaki/zero}{github.com/mangaki/zero} ::: ::::::