% Learning representations that evolve over time % JJ Vie % November 22, 2019 — aspectratio: 169 handout: true theme: Frankfurt section-titles: false header-includes: - \usepackage{bm} - \newcommand\mycite[3]{\textcolor{blue}{#1} “#2”.~#3.} —
Ex. skills at Chess, calculus skills, preferences
that evolve over time
and optimize human learning
Elo ratings are updated after each match
If player 1 (550) beats player 2 (600)
Then player 1 will $\uparrow$ (560) and player 2 will $\downarrow$ (590)
\centering
(The Social Network)
\centering
(The Social Network)
\centering
(Not The Social Network)
Used in PISA, GMAT, Pix.
Given outcomes $r \in {0, 1}$, how to estimate $\theta$?
$p = \frac1{1 + e^{-(\theta - d)}} = \sigma(\theta - d)$
Thanks to logistic function: $p’ = p(1 - p)$
$L(\theta) = \log p^r (1 - p)^{1 - r} = r \log p + (1 - r) \log (1 - p)$
$\nabla_\theta L = \frac{\partial L}{\partial \theta} = r - p$
$\theta_{t + 1} = \theta_t + \gamma \underbrace{\nabla_\theta L}_{r - p}$
\pause
Thus it is \alert{online gradient ascent}! K-factor = $\gamma$ = learning rate.\bigskip
The chess statistician Jeff Sonas believes that the original $K=10$ value (for players rated above 2400) is inaccurate in Elo’s work.
Players ability increase as they win matches over other players
So players may have an optimistic strategy to plan their matches
\pause
(If you want to write this paper, I’m in A09.)
\centering
\begin{tabular}{ccccc}
& \includegraphics[height=2.5cm]{figures/1.jpg} & \includegraphics[height=2.5cm]{figures/2.jpg} & \includegraphics[height=2.5cm]{figures/3.jpg} & \includegraphics[height=2.5cm]{figures/4.jpg}
Sacha & \only<2>{\alert{3}} & 5 & 2 & \only<2>{\alert{2}}
Ondine & 4 & 1 & \only<2>{\alert{4}} & 5
Pierre & 3 & 3 & 1 & 4
Joëlle & 5 & \only<2>{\alert{2}} & 2 & \only<2>{\alert{5}}
\end{tabular}
From some $R_{ij}$ infer other $R_{ij}$
Learn model $U, V$ such that $R \simeq UV \quad \widehat{r_{ij}} = \langle \bm{u}i, \bm{v}_j \rangle$
Optimize regularized least squares $\sum{i, j} (\widehat{r_{ij}} - r_{ij}) + \lambda (||U||^2_F + ||V||^2_F)$
Learn model $U, V$ such that $R \simeq \sigma(UV) \quad \widehat{r_{ij}} = \sigma(\langle \bm{u}_i, \bm{v}_j \rangle)$ Optimize likelihood
EM algorithm via MCMC: sample $U$, optimize $V$ (Cai, 2010)
Slow, $d \leq 6$
\centering
{height=7.5cm}
\centering
{height=7.5cm}
\centering
{height=7.5cm}
$\bm{x}$ concatenation of one-hot vectors \only<3->{(ex. at positions $s$ and $t$)}
$\langle \bm{w}, \bm{x} \rangle = \sum_i w_i x_i \only<3->{= w_s + w_t}$
$ | V\bm{x} | ^2 = \sum_{\alert{i, j}} x_i x_j \langle \bm{v}_i, \bm{v}_j \rangle \geq 0$ |
\pause
$\frac12 ( | V\bm{x} | ^2 - \mathbf{1}^T (V \circ V) (\bm{x} \circ \bm{x})) = \sum_{\alert{i < j}} x_i x_j \langle \bm{v}_i, \bm{v}_j \rangle \only<3->{= \langle \bm{v}_s, \bm{v}_t \rangle}$ |
\pause \pause
Factorization machines (Rendle 2012)
$P(\langle \bm{x}, \bm{v}_i \rangle)$ for a polynomial $P$
The Blondel Trilogy
$\theta_{t + 1} = \theta_t - \gamma \nabla_\theta \mathcal{L} \Rightarrow$ Replace $\nabla_\theta \mathcal{L}$ with an unbiased estimate $\tilde\nabla_\theta \mathcal{L}$ \centering \includegraphics[width=0.7\linewidth]{figures/cfirt.pdf}
\centering
<!–
$enc : \mathbf{R}^n \to \mathbf{R}^d$, $\bm{u} = enc(\bm{r})$
$dec : \mathbf{R}^d \to \mathbf{R}^n$, $\bm{r} = dec(\bm{u})$
$\bm{\mu} \cdot \bm{\sigma} = enc(r)$
$u \sim \mathcal{N}(\mu, \sigma)$
$r = dec(u)$
–>
The decoder is the response model $P(X_j = 1) = \sigma(\langle \bm{z}, \bm{v}_j \rangle)$
The encoder is an approximate maximum likelihood estimator
Observe outcomes of students over questions:
$i: (q_t, a_t)t = (j_t, r{ij_t})_{0 \leq t \leq T}$
How to predict future performance of students, given the questions asked to them $(q_t)_t$?
Evolution model of $\bm{u}_T$ over time
$\bm{u}T = enc\theta\left((q_t, a_t){0 \leq t \leq T}\right)$
$\widehat{r{ij_T}} = \sigma(\langle \bm{u}T, \bm{v}{j_T} \rangle)$
\centering
\includegraphics[width=0.5\linewidth]{figures/anki.png}\includegraphics[width=0.5\linewidth]{figures/leitner.png}
Our solution (implemented using queues):
Learn using logistic regression or FMs:
\scriptsize \mycite{Benoît Choffin, Fabrice Popineau, Yolaine Bourda, and Jill-Jênn Vie (2019)}{DAS3H: Modeling Student Learning and Forgetting for Optimally Scheduling Distributed Practice of Skills}{Best Paper Award at EDM 2019}
\centering
\alert{Maximize information} $\rightarrow$ learners fail 50% of the time (good for the examiner, not for the learners)
\pause
\alert{Maximize success rate} $\rightarrow$ we ask too easy questions
\pause
\alert{Maximize the growth of success rate} Multi-Armed Bandits for Intelligent Tutoring Systems (Clement et al., JEDM 2015)
\pause
\alert{Identify a gap from the learner as soon as possible} Rotting bandits are not harder than stochastic ones (Seznec et al., AISTATS 2019)