Jill-Jênn Vie

Researcher at Inria

% Knowledge Tracing\newline \& Optimizing Human Learning % Jill-\only<1>{Jênn}\only<2>{\jin} Vie \and Hisashi Kashima % SJTU-AIP workshop, March 8, 2019 — theme: Frankfurt handout: false institute: \includegraphics[height=9mm]{figures/aip-logo.png} \quad \includegraphics[height=1cm]{figures/kyoto.png} section-titles: false biblio-style: authoryear header-includes: - \usefonttheme{professionalfonts} - \usepackage{booktabs} - \usepackage{multicol,multirow} - \usepackage{algorithm,algpseudocode} - \usepackage{fontspec} - \usepackage{xeCJK} - \setCJKmainfont{Noto Sans CJK SC} - \usepackage{unicode-math} - \usepackage{bm} - \usepackage[italic]{mathastext} - \renewcommand*\familydefault{\sfdefault} - \def\jin{仁} - \usepackage{tikz} - \def\ReLU{\textnormal{ReLU}} - \def\correct{\includegraphics{figures/win.pdf}} - \def\mistake{\includegraphics{figures/fail.pdf}} - \newcommand\logit{\mathop{\mathrm{logit}}} biblatexoptions: - maxbibnames=99 - maxcitenames=5 —

Knowledge Tracing

Topics

Crowdsourcing

Data: worker $i$ labels item $j$ with class $k$
What is the true label of all items?

Mixture of experts

Model what algorithm works on what kind of data

Machine teaching

Feed the best sequence of samples to train a known algorithm

Practical intro

When exercises are too easy (or difficult),
students get bored (or discouraged).

To personalize assessment,
$\rightarrow$ need a \alert{model} of how people respond to exercises.

\centering \includegraphics{figures/adaptive.pdf}

Figure!

\centering \includegraphics[width=0.6\linewidth]{figures/map.png}

Maths!

Let us assume $\bm{x} \in \mathbf{R}^d$ is \alert{sparse}.

\pause

Linear regression

$y = \langle \bm{w}, \bm{x} \rangle$

Logistic regression

$y = \sigma(\langle \bm{w}, \bm{x} \rangle)$ where $\sigma$ is sigmoid.

Neural network

$x^{(L + 1)} = \sigma(\langle \bm{w}, \bm{x}^{(L)} \rangle)$ where $\sigma$ is ReLU.

What if $\sigma : x \mapsto x^2$ for example?

\pause

Polynomial kernel

$y = \sigma(1 + \langle \bm{w}, \bm{x} \rangle)$ where $\sigma$ is a monomial.

Factorization machine
$y = \langle \bm{w}, \bm{x} \rangle + {   V \bm{x}   }^2$

\vspace{5mm} \small \fullcite{blondel2016polynomial}

Students try exercises

Math Learning

\centering \begin{tabular}{cccc} \toprule Items & 5 – 5 = ? & \uncover<2->{17 – 3 = ?} & \uncover<3->{13 – 7 = ?}\ \midrule New student & \alert{$\mathbf{\circ}$} & \only<2->{\alert{$\mathbf{\circ}$}} & \only<3->{\alert{$\symbf{\times}$}}\ \bottomrule \end{tabular}

\raggedright \only<4->{Language Learning

\includegraphics{figures/duolingo0.png}}

\pause\pause\pause\pause

Challenges

Predicting student performance: knowledge tracing

Data

A population of users answering items

Side information

Goal: classification problem

Predict the performance of new users on existing items
Metric: AUC

Method

Learn parameters of questions from historical data \hfill \emph{e.g., difficulty}
Measure parameters of new students \hfill \emph{e.g., expertise}

Dimension 1

Our small dataset

\begin{columns} \begin{column}{0.6\linewidth} \begin{itemize} \item User 1 answered Item 1 correct \item User 1 answered Item 2 incorrect \item User 2 answered Item 1 incorrect \item User 2 answered Item 1 correct \item User 2 answered Item 2 ??? \end{itemize} \end{column} \begin{column}{0.4\linewidth} \centering \input{tables/dummy-ui-weak}\vspace{5mm}

\texttt{dummy.csv} \end{column} \end{columns}

Our approach

\includegraphics[width=\linewidth]{figures/archi.pdf}

Simplest baseline: Item Response Theory (Rasch, 1960)

Learn abilities $\alert{\theta_i}$ for each user $i$
Learn easiness $\alert{e_j}$ for each item $j$ such that: \(\begin{aligned} Pr(\textnormal{User $i$ Item $j$ OK}) & = \sigma(\alert{\theta_i} + \alert{e_j}) \quad \sigma : x \mapsto 1/(1 + \exp(-x))\\ \logit Pr(\textnormal{User $i$ Item $j$ OK}) & = \alert{\theta_i} + \alert{e_j} \end{aligned}\)

Really popular model, used for the PISA assessment

Can be encoded as logistic regression

Learn $\alert{\bm{w}}$ such that $\logit Pr(\bm{x}) = \langle \alert{\bm{w}}, \bm{x} \rangle + b$

Graphically: IRT as logistic regression

Encoding “User $i$ answered Item $j$” with \alert{sparse features}:

\centering

\[\langle \bm{w}, \bm{x} \rangle = \theta_i + e_j = \logit Pr(\textnormal{User $i$ Item $j$ OK})\]

Encoding into sparse features

\centering

\input{tables/show-ui}

\raggedright Then logistic regression can be run on the sparse features.

Oh, there’s a problem

\input{tables/pred-ui}

We predict the same thing when there are several attempts.

Performance Factor Analysis (Pavlik et al., 2009)

Keep counters over time:
$W_{ik}$ ($F_{ik}$): how many successes (failures) of user $i$ over skill $k$ \begin{columns} \begin{column}{0.5\linewidth} \includegraphics[width=\linewidth]{figures/lr-swf.pdf} \end{column} \begin{column}{0.5\linewidth} \(\begin{aligned} \logit Pr(\textnormal{User $i$ Item $j$ OK})\\ = \sum_{\textnormal{Skill } k \textnormal{ of Item } j} \alert{\beta_k} + W_{ik} \alert{\gamma_k} + F_{ik} \alert{\delta_k} \end{aligned}\) \end{column} \end{columns}

\small \input{tables/pred-swf}

Model 3: a new model (but still logistic regression)

346860 attempts of 4217 students over 26688 items on 123 skills.

\centering \input{tables/assistments42-afm-pfa-iswf}

Dimension $n$

Here comes a new challenger

How to model \alert{pairwise interactions} with \alert{side information}?

Logistic Regression

Learn a 1-dim \alert{bias} for each feature (each user, item, etc.)

Factorization Machines

Learn a 1-dim \alert{bias} and a $k$-dim \alert{embedding} for each feature

How to model pairwise interactions with side information?

If you know user $i$ attempted item $j$ on \alert{mobile} (not desktop)
How to model it?

$y$: score of event “user $i$ solves correctly item $j$”

IRT

\[y = \theta_i + e_j\]

Multidimensional IRT (similar to collaborative filtering)

\[y = \theta_i + e_j + \langle \bm{v_{\textnormal{user $i$}}}, \bm{v_{\textnormal{item $j$}}} \rangle\]

\pause

With side information

\small \vspace{-3mm} \(y = \theta_i + e_j + \alert{w_{\textnormal{mobile}}} + \langle \bm{v_{\textnormal{user $i$}}}, \bm{v_{\textnormal{item $j$}}} \rangle + \langle \bm{v_{\textnormal{user $i$}}}, \alert{\bm{v_{\textnormal{mobile}}}} \rangle + \langle \bm{v_{\textnormal{item $j$}}}, \alert{\bm{v_{\textnormal{mobile}}}} \rangle\)

Graphically: logistic regression

\centering

Graphically: factorization machines

\centering

Formally: factorization machines

Each \textcolor{blue!80}{user}, \textcolor{orange}{item}, \textcolor{green!50!black}{skill} $k$ is modeled by bias $\alert{w_k}$ and embedding $\alert{\bm{v_k}}$.\vspace{2mm} \begin{columns} \begin{column}{0.47\linewidth} \includegraphics[width=\linewidth]{figures/fm.pdf} \end{column} \begin{column}{0.53\linewidth} \includegraphics[width=\linewidth]{figures/fm2.pdf} \end{column} \end{columns}\vspace{-2mm}

\hfill \(\logit p(\bm{x}) = \mu + \underbrace{\sum_{k = 1}^N \alert{w_k} x_k}_{\textnormal{logistic regression}} + \underbrace{\sum_{1 \leq k < l \leq N} x_k x_l \langle \alert{\bm{v_k}}, \alert{\bm{v_l}} \rangle}_{\textnormal{pairwise relationships}}\)

\small \fullcite{rendle2012factorization}

Training using MCMC

Priors: $w_k \sim \mathcal{N}(\mu_0, 1/\lambda_0) \quad \bm{v_k} \sim \mathcal{N}(\symbf{\mu}, \symbf{\Lambda}^{-1})$
Hyperpriors: $\mu_0, \ldots, \mu_n \sim \mathcal{N}(0, 1), \lambda_0, \ldots, \lambda_n \sim \Gamma(1, 1) = U(0, 1)$

\begin{algorithm}[H] \begin{algorithmic} \For {each iteration} \State Sample hyperp. $(\lambda_i, \mu_i)_i$ from posterior using Gibbs sampling \State Sample weights $\bm{w}$ \State Sample vectors $\bm{V}$ \State Sample predictions $\bm{y}$ \EndFor \end{algorithmic} \caption{MCMC implementation of FMs} \label{mcmc-fm} \end{algorithm}

Implementation in C++ (libFM) with Python wrapper (pyWFM).

\small \fullcite{rendle2012factorization}

Datasets

Fraction

500 middle-school students, 20 fraction subtraction questions,
8 skills (full matrix)

Assistments

346860 attempts of 4217 students over 26688 math items
on 123 skills (sparsity 0.997)

Berkeley

On a MOOC of Computer Science, 562201 attempts
of 1730 students over 234 items of 29 categories

Existing work on Assistments

\footnotesize \begin{tabular}{cccc} \toprule \multirow{2}{}{Model} & \multirow{2}{}{Basically} & Original & \only<3->{Fixed}
& & AUC & \only<3->{AUC}\ \midrule Bayesian Knowledge Tracing & \multirow{2}{}{Hidden Markov Model} & \multirow{2}{}{0.67} & \only<3->{\multirow{2}{}{0.63}}
(\cite{corbett1994knowledge})\ \midrule \only<2->{Deep Knowledge Tracing & \multirow{2}{
}{Recurrent Neural Network} & \multirow{2}{}{0.86} & \only<3->{\multirow{2}{}{0.75}}\} \only<2->{(\cite{piech2015deep})\ \midrule} \only<4->{Item Response Theory & \multirow{3}{}{Online Logistic Regression} & \multirow{3}{}{} & \multirow{3}{*}{0.76}\} \only<4->{(\cite{rasch1960studies})\} \only<4->{(Wilson et al., 2016) \ \midrule} \only<5->{Knowledge Tracing Machines & Factorization Machines & & 0.82\ \bottomrule} \end{tabular}

\small \only<5->{\fullcite{KTM2019}}

AUC results on the Assistments dataset

\centering \includegraphics[width=0.6\linewidth]{figures/barchart.pdf}

\scriptsize \input{tables/assistments42-full}

Bonus: interpreting the learned embeddings

\centering

\includegraphics{figures/viz.pdf}

???

What ‘bout recurrent neural networks?

Deep Knowledge Tracing: knowledge tracing as sequence prediction

\small \fullcite{piech2015deep} \normalsize

Our approach: encoder-decoder

\def\xin{\bm{x^{in}_t}} \def\xout{\bm{x^{out}_t}} \(\left\{\begin{array}{ll} \bm{h_t} = Encoder(\bm{h_{t - 1}}, \xin)\\ p_t = Decoder(\bm{h_t}, \xout)\\ \end{array}\right. t = 1, \ldots, T\)

Graphically: deep knowledge tracing

\centering

Deep knowledge tracing with dynamic student classification

\centering

\raggedright\small \fullcite{Minn2018}

DKT seen as encoder-decoder

\centering

Results on Fraction dataset

500 middle-school students, 20 Fraction subtraction questions,
8 skills (full matrix)

\begin{table} \centering \begin{tabular}{cccccc} \toprule Model & Encoder & Decoder & $\xout$ & ACC & AUC\ \midrule \textbf{Ours} & GRU $d = 2$ & bias & iswf & \textbf{0.880} & \textbf{0.944}
KTM & counter & bias & iswf & 0.853 & 0.918
PFA & counter & bias & swf & 0.854 & 0.917
Ours & $\varnothing$ & bias & iswf & 0.849 & 0.917
Ours & GRU $d = 50$ & $\varnothing$ & & 0.814 & 0.880
DKT & GRU $d = 2$ & $d = 2$ & s & 0.772 & 0.844
Ours & GRU $d = 2$ & $\varnothing$ & & 0.751 & 0.800\ \bottomrule \end{tabular} \label{results-fraction} \end{table}

Results on Berkeley dataset

562201 attempts of 1730 students over 234 CS-related items of 29 categories.

\begin{table} \centering \begin{tabular}{cccccc} \toprule Model & Encoder & Decoder & $\xout$ & ACC & AUC\ \midrule \textbf{Ours} & GRU $d = 50$ & bias & iswf & \textbf{0.707} & \textbf{0.778}
\textbf{KTM} & counter & bias & iswf & \textbf{0.704} & \textbf{0.775}
Ours & $\varnothing$ & bias & iswf & 0.700 & 0.770
DKT & GRU $d = 50$ & $d = 50$ & s & 0.684 & 0.751
Ours & GRU $d = 100$ & $\varnothing$ & & 0.682 & 0.750
PFA & counter & bias & swf & 0.630 & 0.683
DKT & GRU $d = 2$ & $d = 2$ & s & 0.637 & 0.656\ \bottomrule \end{tabular} \label{results-assistments} \end{table}

\raggedright \small \fullcite{Vie2019encode}

Conclusion

Take home message

\alert{Factorization machines} unify many existing EDM models

They can be combined with \alert{deep neural networks}

Then we can \alert{optimize learning}

Merci ! Do you have any questions?

\centering \url{https://jilljenn.github.io}

\raggedright I’m interested in:

\vspace{7mm} \begin{columns} \begin{column}{0.2\linewidth} \includegraphics[width=\linewidth]{figures/tryalgo-cn.jpg} \end{column} \begin{column}{0.8\linewidth} We also wrote a book about Programming in Python: 高效算法: 竞赛、应试与提高必修128例. \人民邮电出版社. Code is available on GitHub. \end{column} \end{columns}

\centering vie@jill-jenn.net