% Modeling uncertainty for policy learning in education % Jill-Jênn Vie¹; Tomas Rigaux¹; Koh Takeuchi²; Hisashi Kashima² % ¹ Inria Saclay, SODA team\newline ² Kyoto University — aspectratio: 169 handout: false institute: \hfill \includegraphics[height=1cm]{figures/inria.png} \includegraphics[height=1cm]{figures/kyoto.png} section-titles: false theme: metropolis header-includes: - \usepackage{bm} - \usepackage{multicol,booktabs} - \usepackage{algorithm,algpseudocode,algorithmicx} - \DeclareMathOperator\argmax{argmax} - \DeclareMathOperator\logit{logit} - \def\bm{\boldsymbol} - \def\u{\bm{u}} - \def\v{\bm} - \def\X{\bm{X}} - \def\x{\bm{x}} - \def\y{\bm{y}} - \def\E{\mathbb{E}} - \def\F{\mathcal{F}} - \def\I{\mathcal{I}} - \def\N{\mathcal{N}} - \def\Lb{\mathcal{L}_B} - \def\KL{\textnormal{KL}} - \metroset{sectionpage=none} —
Positive prompts: \includegraphics{anime} \hfill Negative prompts: world cup, world history
\centering \pause \pause
{width=50%}
Psychometrics (50s-60s): measuring \alert{latent constructs $\theta$} of examinees
given their \alert{observed answers $X$} to a questionnaire
Led to computerized adaptive tests (70s)
\centering
:::::: {.columns} ::: {.column} \raggedleft ::: ::: {.column} {width=60%} ::: ::::::
Learner $p(X | \theta)$ |
Teacher $\pi(j | X)$ based on estimate $q(\theta | X)$ |
$X = {(\underbrace{j_t}{\in {1, \ldots, M}}, \underbrace{r{ij_t}}_{\in {0, 1}})}_t$
\centering
\centering
\centering
\centering
Green: interactive training set \hfill Red: validation set
Postdoc: assume $\theta$ can evolve and build better model $p(X | \theta)$ |
\centering
{width=90%}
:::::: {.columns} ::: {.column width=”15%”} ::: ::: {.column width=”15%”} ::: ::: {.column width=”70%”} \(p(x_{1:N}, z_{1:N}) = p(z_1) \prod_{n = 2}^N p(z_n|z_{n - 1}) \prod_{n = 1}^N p(x_n|z_n)\)
\(p(V) = \prod_{v \in V} p(v|\textnormal{parents}(v))\) ::: ::::::
\centering
{width=90%}
\raggedright
Modeling learning using HMM
or LSTM: $\hat\theta = f_\psi(X_{1:t}) = f_\psi\left(\left{(j_t, r_{ij_t})\right}_t\right)$
i.e. deep knowledge tracing (Piech et al., NeurIPS 2015)
We want to maximize (log) likelihood: $\argmax_\theta \log p(X|\theta)$ or $\E_{p(\theta)} LL(X|\theta)$
i.e. find its zeroes or go in the direction of gradient $\nabla_\theta LL$
Property: $\E_{p(X | \theta)} \nabla_\theta LL = 0$ at the true $\theta$ parameter |
If $Var_{p(X | \theta)}(\nabla_\theta LL)$ is low, the observation brings few information |
Interesting because (Cramér-Rao bound): $Var(\hat\theta) \geq \I(\theta)^{-1}$
Another index for choosing a question (Chang and Ying, 1996):
\[KLI(\theta) = \int_{B(\theta, r_n)} KL_{X}(\theta_0||\theta) d\theta_0 = \int_{B(\theta, r_n)} \E_{p(X|\theta_0)} \log \frac{p(X|\theta_0)}{p(X|\theta)} \quad r_n \to 0\]Taking the simple model $p_j \triangleq p(X_j=1 | \theta) = \sigma(\theta - d_j)$ |
$\nabla_\theta LL = X_j - p_j$
$\I(\theta) = -\E_{p(X | \theta)} \nabla^2_\theta LL = p_j (1 - p_j)$ |
So in the scalar case, the item of maximum Fisher information is the one of probability \alert{closest to 1/2}, given the current maximum likelihood estimate.
\alert{Maximize information} $\to$ learners fail 50% of the time (good for the examiner, not for the learners)
\alert{Minimize uncertainty} i.e. the entropy of $p(\theta | X)$ |
\alert{Maximize success rate} $\to$ we ask too easy questions
\alert{Maximize the growth of success rate} $\to$ Clement et al. (JEDM 2015) they use $\varepsilon$-greedy where the reward is: success rate over $k$ latest attempts minus success rate over the $k$ previous attempts.
What we are interested in: measuring precisely the knowledge of student without making them fail too much.
\centering
\begin{tabular}{ccccc}
& \includegraphics[height=2.5cm]{figures/1.jpg} & \includegraphics[height=2.5cm]{figures/2.jpg} & \includegraphics[height=2.5cm]{figures/3.jpg} & \includegraphics[height=2.5cm]{figures/4.jpg}
Satoshi & ? & 5 & 2 & ?
Kasumi & 4 & 1 & ? & 5
Takeshi & 3 & 3 & 1 & 4
Joy & 5 & ? & 2 & ?
\end{tabular}
\centering
\begin{tabular}{ccccc}
& \includegraphics[height=2.5cm]{figures/1.jpg} & \includegraphics[height=2.5cm]{figures/2.jpg} & \includegraphics[height=2.5cm]{figures/3.jpg} & \includegraphics[height=2.5cm]{figures/4.jpg}
Satoshi & \alert{3} & 5 & 2 & \alert{2}
Kasumi & 4 & 1 & \alert{4} & 5
Takeshi & 3 & 3 & 1 & 4
Joy & 5 & \alert{2} & 2 & \alert{5}
\end{tabular}
\centering
{width=90%}
\centering
\includegraphics{figures/svd2.png}
Approximate $R$ ratings $n \times m$ by learning embeddings for user and item
\[\left.\begin{array}{r} U \textnormal{ user embeddings } n \times d\\ V \textnormal{ item embeddings } m \times d \end{array}\right\} \textnormal{ such that } R \simeq \alert{UV^T}\]\begin{block}{Fit} Learn $U$ and $V$ to \alert{minimize} $ {||R - UV^T||}_2^2 + \lambda \cdot \textnormal{regularization} $ \end{block}
\begin{block}{Predict: Will user $i$ like item $j$?} Just compute $\langle \u_i, \v_j \rangle$ \end{block}
The actual model also contains bias terms for user $i$ and item $j$
\[r_{ij} = \mu + \alert{w^u_i + w^v_j} + \langle \u_i, \v_j \rangle\]If you know user $i$ watched item $j$ at the \alert{cinema} (or on TV, or on smartphone), how to model it?
$r_{ij}$: rating of user $i$ on item $j$
\pause
\centering
\begin{tabular}{rrr|rrrr|rrr}
\toprule
\multicolumn{3}{c}{Users} & \multicolumn{4}{c}{Items} & \multicolumn{3}{c}{Formats}
\cmidrule{1-3} \cmidrule{4-6} \cmidrule{7-10}
$U_1$ & $U_2$ & $U_3$ & $I_1$ & $I_2$ & $I_3$ & $I_4$ & cinema & TV & mobile
\midrule
0 & \alert1 & 0 & 0 & \alert1 & 0 & 0 & 0 & \alert1 & 0
0 & 0 & \alert1 & 0 & 0 & \alert1 & 0 & 0 & \alert1 & 0
0 & \alert1 & 0 & 0 & 0 & \alert1 & 0 & \alert1 & 0 & 0
0 & \alert1 & 0 & 0 & \alert1 & 0 & 0 & \alert1 & 0 & 0
\alert1 & 0 & 0 & 0 & 0 & 0 & \alert1 & 0 & \alert1 & 0
\bottomrule
\end{tabular}
\centering
Learn bias \alert{$w_k$} and embedding \alert{$\bm{v_k}$} for each feature $k$ such that: \(y(\bm{x}) = \mu + \underbrace{\sum_{k = 1}^K \alert{w_k} x_k}_{\textnormal{linear regression}} + \underbrace{\sum_{1 \leq k < l \leq K} x_k x_l \langle \alert{\bm{v_k}}, \alert{\bm{v_l}} \rangle}_{\textnormal{pairwise interactions}}\)
This model is for regression
If classification, use a link function like softmax/sigmoid or Gaussian CDF
\small \fullcite{rendle2012factorization}
Take a batch $(\X_B, y_B)$ and update the parameters such that the error is minimized.
\begin{algorithm}[H] \begin{algorithmic} \For {batch $\bm{X}_B, y_B$} \For {$k$ feature involved in this batch $\bm{X}_B$} \State Update $w_k, \bm{v}_k$ to decrease loss estimate $\mathcal{L}$ on $\bm{X}_B$ \EndFor \EndFor \end{algorithmic} \caption{SGD} \label{algo-vfm} \end{algorithm}
\only<1>{Approximate true posterior with an easier distribution (Gaussian)}
\only<2>{\begin{align}
\textnormal{Priors } p(w_k) = \N(\nu^w_{g(k)}, 1/\lambda^w_{g(k)}) \qquad p(v_{kf}) = \N(\nu^{v,f}_{g(k)}, 1/\lambda^{v,f}_{g(k)})
\textnormal{Approx. posteriors } q(w_k) = \N(\mu^w_k, (\sigma^w_k)^2) \qquad q(v_{kf}) = \N(\mu^{v,f}_k, (\sigma^{v,f}_k)^2)
\end{align}}
Idea: increase the ELBO $\Rightarrow$ increase the objective
\begin{align}
\log p(\y) & \geq \sum_{i = 1}^N \underbrace{\E_{q(\theta)} [\log p(y_i|x_i,\theta)] - \KL(q(\theta)||p(\theta))}_{\textrm{Evidence Lower Bound (ELBO) }}
& \quad = \sum_{i = 1}^N \E_{q(\theta)} [ \log p(y_i|x_i,\theta) ] - \KL(q(w_0)||p(w_0)) - \sum_{k = 1}^K \KL(q(\theta_k)||p(\theta_k))
\end{align}
Needs to be rescaled for mini-batching (see in the paper)
\centering
\begin{algorithm}[H] \begin{algorithmic} \For {each batch $B \subseteq {1, \ldots, N}$} \State Sample $w_0 \sim q(w_0)$ \For {$k \in F(B)$ feature involved in batch $B$} \State Sample $S$ times $w_k \sim q(w_k)$, $\bm{v}_k \sim q(\bm{v}_k)$ \EndFor \For {$k \in F(B)$ feature involved in batch $B$} \State Update parameters $\mu_k^w, \sigma_k^w, \bm{\mu}_k^v, \bm{\sigma}_k^v$ to increase ELBO estimate \EndFor \State Update hyper-parameters $\mu_0, \sigma_0, \nu, \lambda, \alpha$ \State Keep a moving average of the parameters to compute mean predictions \EndFor \end{algorithmic} \caption{Variational Training (SGVB) of FMs} \label{algo-vfm} \end{algorithm} \vspace{-5mm} Then $\sigma$ can be reused for preference elicitation (see how in the paper)
A beneficial regularization: keep all weights over training epochs and average them.
Connections to Polyak-Ruppert averaging, aka stochastic weight averaging
\centering
\begin{tabular}{llrrrr}
\toprule
Task & Dataset & #users & #items & #entries & Sparsity
\midrule
%fraction & 536 & 20 & 10720 & 0.000
%duolingo & 1213 & 2417 & 1064228 & 0.637 \ \midrule
Regression & movie100k & 944 & 1683 & 100000 & 0.937
& movie1M & 6041 & 3707 & 1000209 & 0.955
% & movie10M & 69879 & 10678 & 10000054 & 0.987
\midrule
Classification & movie100 & 100 & 100 & 10000 & 0
& movie100k & 944 & 1683 & 100000 & 0.937
& movie1M & 6041 & 3707 & 1000209 & 0.955
& Duolingo & 1213 & 2416 & 1199732 & 0.828
\bottomrule
\vspace{-5pt}
\end{tabular}
:::::: {.columns}
::: {.column}
\vspace{1cm}
\centering
\begin{tabular}{ccc}
\toprule
RMSE & Movie100k & Movie1M
\midrule
MCMC & \textbf{0.906} & \textbf{0.840}
\textbf{VFM} & \textbf{0.906} & 0.854
VBFM & 0.907 & 0.856
OVBFM & 0.912 & 0.846
\bottomrule
\end{tabular}
\vspace{2cm} \footnotetext{OVBFM is online (batch size = 1) of VBFM} ::: ::: {.column} ::: ::::::
:::::: {.columns} ::: {.column} \vspace{1cm} \centering
ACC | Movie100k | Movie1M | Duolingo |
---|---|---|---|
MCMC | 0.717 | 0.739 | 0.848 |
VFM | 0.722 | 0.746 | 0.846 |
VBFM | 0.692 | 0.732 | 0.842 |
\vspace{1cm} \footnotetext{In the paper, we also report AUC and mean average precision.}
::: ::: {.column} ::: ::::::
:::::: {.columns} ::: {.column} VFM is implemented in TF & PyTorch\bigskip
$\E_{q(\theta)} [\log p(y_i | \bm{x}_i,\theta)]$ becomes |
outputs.log_prob(observed).mean()
Same implementation for classification and regression: the only difference in the distribution (Bernoulli vs. Gaussian)
\vspace{1cm}
Feel free to try it on GitHub (vfm.py
):
github.com/jilljenn/vae ::: ::: {.column}
See more benchmarks on \href{https://github.com/mangaki/zero}{github.com/mangaki/zero} ::: ::::::