Jill-Jênn Vie

Researcher at Inria

% (Deep?) Factorization Machines\newline for Optimizing Human Learning % Jill-Jênn Vie\newline RIKEN Center for Advanced Intelligence Project % April 16, 2018 — header-includes: - \usepackage{booktabs} - \usepackage{multirow} - \usepackage{biblatex} - \addbibresource{biblio.bib} - \usepackage{bm,bbm} - \def\R{\mathbf{R}} biblio-style: authoryear handout: true suppress-bibliography: true nocite: | @Vie2018 —

Tokyo lights by night (Shibuya)

\includegraphics[width=\linewidth]{figures/shibuya.jpg}

Tokyo cherry blossoms by day (Shinjuku Gyoen)

\includegraphics[width=\linewidth]{figures/hanami.jpg}

#

\includegraphics[width=\linewidth]{figures/aip.jpg}

Teams (Tokyo, Nihonbashi)

Research interests

Modeling data that comes from humans

Terminology:

Context: Educational Data Mining

How to use logs to optimize the sequences of exercises?
(user $i$ attempted item $j$ and got it correct/incorrect)

\pause This talk [@Vie2018]:

Collaborative filtering

\includegraphics[width=\linewidth]{figures/cf.jpg}

Sparse data

Feature extraction

\includegraphics[width=\linewidth]{figures/svd.png}

Feature extraction

\includegraphics[width=\linewidth]{figures/svd2.png}

What about educational data?

\centering \includegraphics[width=0.5\linewidth]{figures/mirt-here.png}

Interpreting the components

\centering \includegraphics[width=\linewidth]{figures/inter1.jpg}

Interpreting the components

\centering \includegraphics[width=\linewidth]{figures/inter2.jpg}

Useful for providing \alert{feedback} to the user

A first simple, yet reliable model: Item Response Theory

\[\Pr(R_{ij} = 1) = \sigma(\alert{\theta_i} - \alert{d_j}).\]

Training

How to add side information?

Usually, collaborative filtering:

\[r_{ij} = \mu + \mu_{ui} + \mu_{vj} + \bm{u_i}^T \bm{v_j}\]

How to model that the movie was seen on TV, or at the cinema?

\pause

\[r_{ij} = \mu + \mu_{ui} + \mu_{vj} + \mu_{TV} + \bm{u_i}^T \bm{v_j} + \bm{u_i}^T \bm{w_{TV}} + \bm{v_j}^T \bm{w_{TV}}\]

Factorization machines [@rendle2012factorization]

\[y_{FM}(\bm{x}) = \mu + \sum_{k = 1}^N \alert{w_k} x_k + \sum_{1 \leq k, l \leq N} x_k x_l \alert{\bm{v_k}^T} \alert{\bm{v_l}}\]

Special case: collaborative filtering

If $\bm{x} = (\mathbf{1}_i^n, \mathbf{1}_j^m)$ where $\mathbf{1}_i^n$ is a one-hot $n$-dim vector with $i$-th at 1:

\begin{align} y_{FM}(x) & = \mu + w_i + w_{n + j} + \bm{v_i}^T \bm{v_{n + j}}
r_{ij} & = \mu + \mu_{ui} + \mu_{vj} + \bm{u_i}^T \bm{v_j} \end{align
}

Example of FM with educational data

\centering \includegraphics[width=\linewidth]{figures/fm-poster.jpg}

Rows are instances $\bm{x}$, columns are entities $k$
Roses are red, violets are blue

Item Response Theory is a FM

\[y_{FM}(\bm{x}) = \mu + \sum_{k = 1}^N \alert{w_k} x_k + \sum_{1 \leq k, l \leq N} x_k x_l \alert{\bm{v_k}^T} \alert{\bm{v_l}}\]

If $d = 0$ and $x = (\mathbf{1}_i^n, \mathbf{1}_j^m)$ for $n$ users, $m$ items:

\[\sigma(y_{FM}(\bm{x})) = \sigma(w_i + w_{n + j}) = \sigma(\theta_i - d_j)\]

where $\bm{w} = (\bm{\theta}, -\bm{d})$ (concatenation).

We made similar observations for other educational data mining models.

Training of FMs

\[y_{FM}(\bm{x}) = \mu + \sum_{k = 1}^N \alert{w_k} x_k + \sum_{1 \leq k, l \leq N} x_k x_l \alert{\bm{v_k}^T} \alert{\bm{v_l}}\]

\begin{align} w_k, v_{kf} \sim \mathcal{N}(\mu, 1/\lambda)
\mu \sim \mathcal{N}(0, 1)
\lambda \sim \Gamma(1, 1). \end{align
}

$\Phi = \textnormal{probit}$1 so Gibbs sampling can be used
[@freudenthaler2011bayesian].

Experiments

Datasets

\small \input{tables/datasets}

Berkeley has 2 attempts per user over item, in average.

Entries

Better models found

FMs match or outperform other models

\tiny \input{tables/summary-poster}

\normalsize

Results Berkeley

\vspace{1cm} \small \includegraphics[width=\linewidth]{figures/berkeley0-poster.pdf}

Results Assistments

\vspace{1cm} \includegraphics[width=\linewidth]{figures/assistments0-poster.pdf}

Results Berkeley

\input{tables/berkeley0-table-poster}

DeepFM? [@guo2017deepfm]

Combine output of FM with DNN:

\[\hat{y} = \sigma(y_{FM} + y_{DNN})\]

Inspired by Wide and Deep Learning [@cheng2016wide].

FM component

\includegraphics[width=\linewidth]{figures/fm.jpg}

Deep component

\includegraphics[width=\linewidth]{figures/deep.jpg}

DeepFM

\includegraphics[width=\linewidth]{figures/monster.jpg}

Discussion

Next steps

Take away message

Thanks for your attention!

Please come to our workshop in Montréal on June 12:
\alert{Optimizing Human Learning}
https://humanlearn.io

\footnotesize \begin{thebibliography}{9} \bibitem{Vie2018} \fullcite{Vie2018} \bibitem{rendle2012factorization} \fullcite{rendle2012factorization} \end{thebibliography}

  1. Cumulative distribution function of the standard normal distribution.