% Knowledge Tracing Machines:\newline Families of models\newline for predicting student performance % Jill-Jênn Vie \and Hisashi Kashima % November 9, 2018\bigskip\newline \url{https://arxiv.org/abs/1811.03388} — theme: Frankfurt handout: true institute: \includegraphics[height=1cm]{figures/aip-logo.png} \quad \includegraphics[height=1cm]{figures/kyoto.png} section-titles: false biblio-style: authoryear header-includes: - \usepackage{booktabs} - \usepackage{multicol} - \usepackage{bm} - \DeclareMathOperator\logit{logit} - \def\ReLU{\textnormal{ReLU}} biblatexoptions: - maxbibnames=99 - maxcitenames=5 —
AI can:
as long as you have enough data.
Can it also:
as long as you have enough data?
\pause
A population of students answering questions
Good model for prediction $\rightarrow$ Good adaptive policy for teaching
\vspace{5mm}
\fullcite{rendle2012factorization}
\fullcite{piech2015deep}
Filling the blanks: some students did not attempt all questions
Cold-start: some new students are not in the train set
\includegraphics[width=\linewidth]{figures/dkt.png}
\emph{From the DKT paper.}
\begin{columns} \begin{column}{0.6\linewidth} \begin{itemize} \item User 1 answered Item 1 correct \item User 1 answered Item 2 incorrect \item User 2 answered Item 1 incorrect \item User 2 answered Item 1 correct \item User 2 answered Item 2 ??? \end{itemize} \end{column} \begin{column}{0.4\linewidth} \centering \input{tables/dummy-ui-weak}\vspace{5mm}
\texttt{dummy.csv} \end{column} \end{columns}
\begin{columns} \begin{column}{0.6\linewidth} \begin{itemize} \item User 1 answered Item 1 correct \item User 1 answered Item 2 incorrect \item User 2 answered Item 1 ??? \item User 2 answered Item 1 ??? \item User 2 answered Item 2 ??? \end{itemize} \end{column} \begin{column}{0.4\linewidth} \centering \input{tables/dummy-ui-strong}\vspace{5mm}
\texttt{dummy.csv} \end{column} \end{columns}
Learn abilities $\theta_i$ for each user $i$
Learn easiness $e_j$ for each item $j$ such that:
\(\begin{aligned}
Pr(\textnormal{User $i$ Item $j$ OK}) & = \sigma(\theta_i + e_j)\\
\logit Pr(\textnormal{User $i$ Item $j$ OK}) & = \theta_i + e_j
\end{aligned}\)
Learn $\alert{\bm{w}}$ such that $\logit Pr(\bm{x}) = \langle \alert{\bm{w}}, \bm{x} \rangle$
Usually with L2 regularization: ${ | \bm{w} | }_2^2$ penalty $\leftrightarrow$ Gaussian prior |
Encoding of “User $i$ answered Item $j$”:
\centering
\[\logit Pr(\textnormal{User $i$ Item $j$ OK}) = \langle \bm{w}, \bm{x} \rangle = \theta_i + e_j\]python encode.py --users --items
\centering
\input{tables/show-ui}
data/dummy/X-ui.npz
\raggedright Then logistic regression can be run on the sparse features:
python lr.py data/dummy/X-ui.npz
python encode.py --users --items
python lr.py data/dummy/X-ui.npz
\input{tables/pred-ui}
We predict the same thing when there are several attempts.
Keep track of what the student has done before:
\centering
\input{tables/dummy-uiswf}
data/dummy/data.csv
$W_{ik}$: how many successes of user $i$ over skill $k$ ($F_{ik}$: #failures)
Learn $\alert{\beta_k}$, $\alert{\gamma_k}$, $\alert{\delta_k}$ for each skill $k$ such that: \(\logit Pr(\textnormal{User $i$ Item $j$ OK}) = \sum_{\textnormal{Skill } k \textnormal{ of Item } j} \alert{\beta_k} + W_{ik} \alert{\gamma_k} + F_{ik} \alert{\delta_k}\)
python encode.py --skills --wins --fails
\centering \input{tables/show-swf}
data/dummy/X-swf.npz
python encode.py --skills --wins --fails
python lr.py data/dummy/X-swf.npz
\input{tables/pred-swf}
278608 attempts of 4163 students over 196457 items on 124 skills.
http://jiji.cat/weasel2018/data.csv
data/assistments09
python fm.py data/assistments09/X-ui.npz
(or make big
)
\vspace{1cm}
\centering \input{tables/assistments42-afm-pfa}
python encode.py --items --skills --wins --fails
python lr.py data/dummy/X-iswf.npz
\centering \input{tables/assistments42-afm-pfa-iswf}
How to model \alert{side information} in, say, recommender systems?
Learn a 1-dim \alert{bias} for each feature (each user, item, etc.)
Learn a 1-dim \alert{bias} and a $k$-dim \alert{embedding} for each feature
\centering
{width=60%}
If you know user $i$ attempted item $j$ on \alert{mobile} (not desktop)
How to model it?
$y$: score of event “user $i$ solves correctly item $j$”
\pause
\centering
\centering
Learn bias \alert{$w_k$} and embedding \alert{$\bm{v_k}$} for each feature $k$ such that: \(\logit p(\bm{x}) = \mu + \underbrace{\sum_{k = 1}^N \alert{w_k} x_k}_{\textnormal{logistic regression}} + \underbrace{\sum_{1 \leq k < l \leq N} x_k x_l \langle \alert{\bm{v_k}}, \alert{\bm{v_l}} \rangle}_{\textnormal{pairwise interactions}}\)
Multidimensional item response theory: $\logit p(\bm{x}) = \langle \bm{u_i}, \bm{v_j} \rangle + e_j$
is a particular case.
\small \fullcite{rendle2012factorization}
\centering \input{tables/duolingo}
Available on \url{http://sharedtask.duolingo.com}
Learn layers \alert{$W^{(\ell)}$} and \alert{$b^{(\ell)}$} such that: \(\begin{aligned}[c] \bm{a}^{0}(\bm{x}) & = (\alert{\bm{v_{\texttt{user}}}}, \alert{\bm{v_{\texttt{item}}}}, \alert{\bm{v_{\texttt{skill}}}}, \ldots)\\ \bm{a}^{(\ell + 1)}(\bm{x}) & = \ReLU(\alert{W^{(\ell)}} \bm{a}^{(\ell)}(\bm{x}) + \alert{\bm{b}^{(\ell)}}) \quad \ell = 0, \ldots, L - 1\\ y_{DNN}(\bm{x}) & = \ReLU(\alert{W^{(L)}} \bm{a}^{(L)}(\bm{x}) + \alert{\bm{b}^{(L)}}) \end{aligned}\)
\[\logit p(\bm{x}) = y_{FM}(\bm{x}) + y_{DNN}(\bm{x})\]\small \fullcite{Duolingo2018}
\centering
\begin{tabular}{cccc} \toprule
Rank & Team & Algo & AUC\ \midrule
1 & SanaLabs & RNN + GBDT & .857
2 & singsound & RNN & .854
2 & NYU & GBDT & .854
4 & CECL & LR + L1 (13M feat.) & .843
5 & TMU & RNN & .839\ \midrule
(7) & JJV & KTM == FM & .822
(8) & JJV & DeepFM & .814
10 & JJV & DeepFM & .809\ \midrule
– & JJV & KTM == LR + L2 & .783
15 & Duolingo & LR + L1 & .771\ \bottomrule
\end{tabular}
\raggedright \small \fullcite{Settles2018}
Deep Knowledge Tracing: model the problem as sequence prediction
\centering
\centering
DKT does not model individual differences.
Actually, Wilson even managed to beat DKT with (1-dim!) IRT.
By estimating on-the-fly the student’s learning ability, we managed to get a better model.
\centering \input{tables/results-dkt}
\raggedright \small \fullcite{Minn2018}
\input{tables/assistments42-full}
\alert{Factorization machines} are a strong baseline that unifies many existing EDM models
\alert{Recurrent neural networks} are powerful because they track the evolution of the latent state
\fullcite{KTM2018}
Read our article:
\begin{block}{Knowledge Tracing Machines} \url{https://arxiv.org/abs/1811.03388} \end{block}
Try the code:
\centering \url{https://github.com/jilljenn/ktm}
\raggedright Feel free to chat:
\centering vie@jill-jenn.net
\raggedright Do you have any questions?