Jill-Jênn Vie

Researcher at Inria

% Information geometry & adaptive assessment % JJ Vie % \includegraphics[height=1cm]{figures/inria.png} — handout: true header-includes: - \usepackage{bm} - \usepackage{booktabs} - \usepackage{tikz} - \DeclareMathOperator\logit{logit} - \newcommand\mycite[3]{\textcolor{blue}{#1} “#2”.~#3.} —

JJV

\centering \begin{tikzpicture}[xshift=-14cm,xscale=0.7] \draw (6,0) – node[above,label={below:Lyon}] {Bachelor} (9,0); \draw (9,1) – node[above,label={below:Paris}] {Master(s)} (14.5,1); \draw (14,0) – node[above,label={below:Paris-Saclay}] {PhD} (16.5,0); \draw (17.5,1) – node[above,label={below:Tokyo \& Kyoto}] {Postdoc} (19.5,1); \only<2>{\draw[red] (13.5,-1) – node[above,label={below:Cachan}] {Agrégation ?!} (14.5,-1);} \draw[->] (6,-2) – (19.5,-2) node[above] {time}; \end{tikzpicture}

SequeL

#

\centering

Akinator

\centering

GuessWhat (de Vries et al. 2016)

\centering

{width=90%}

Real-world example: certifying digital skills

We want to assess your skills in some domain,
by asking you to complete some tasks.

\centering \begin{tabular}{rlcccc} \toprule & & \multicolumn{4}{c}{Knowledge components}
& & \textbf{form} & \textbf{mail} & \textbf{copy} & \textbf{url}\ \midrule T1 & Send a mail & \textbf{form} & \textbf{mail}
T2 & Fill a form & \textbf{form}
T3 & Share a link & & & \textbf{copy} & \textbf{url}
T4 & Type a URL & \textbf{form} & & & \textbf{url}\ \bottomrule \end{tabular}

\raggedright \def\correct{\textcolor{green!50!black}{Correct !}} \def\incorrect{\textcolor{red}{Incorrect.}}

\pause

We administer task 1. \correct{}
$\Rightarrow$ \textbf{form} & \textbf{mail} : mastered. Task 2 brings few information.

\pause

We administer task 4. \incorrect{}
$\Rightarrow$ \textbf{url} seems unmastered. Task 3 will bring few information.

Feedback

You seem to master form & mail but not url.

Discrete adaptive assessments

Trying to find a \alert{target} in ${0, 1}^K$ where $K$ is the number of skills.

Maximum entropy: uniform distribution

\alert{Minimizing} the expected entropy

But the support is $O(2^K)$, what to do when $K$ is big?

Structure on the assessed domain (prerequisite graph)

\centering

Digital competencies curriculum DIGCOMP 2.0

5 domains, 16 competencies, 800 skills, what should we do?

\centering \includegraphics[width=0.5\linewidth]{figures/digcomp.png}

Certification of digital skills

Before: B2i.

Now:

{width=2cm}

Pix replaces B2i for high school students (JO September 1\textsuperscript{st} 2019) Some companies use it to measure the impact of their trainings

An example of Pix challenge

\centering \Large In the French village of Montrésor,
what street is crossing Perrières street?

\vspace{1cm} \pause

\normalsize $\rightarrow$ can get skill \@rechercheInfo3

Different tests, different objectives

Placement tests (self-assessment, low stake)

Assess your level adaptively
Know your strong and weak points
Recommend tutorials

Certification tests (high stake)

Few questions to certify a rough estimate
“This person is level 4 in safety.”

Progression tests

“What should I learn next?”
Optimizing human learning

Continuous adaptive assessments

Rasch (1960)

Rasch model

\[p(success) = \Pr(R_{ij} = 1) = \sigma(\alert{\theta_i} - \alert{d_j}).\]

Training

Adaptive assessment

Combining discrete and Rasch

Ask question that maximizes average number of validated/invalidated skills:

\centering

$\textnormal{Maximize } p(success) N_{validated} + (1 - p(success)) N_{invalidated}$

\raggedright

Code is on GitHub (AGPLv3 license) in JavaScript

\small \fullcite{Vie2017PIX}

Are we really unidimensional?

In language learning, people from different countries have different difficulties.

Continuous multivariate:

Rasch (item response theory)

\[\Pr(R_{ij} = 1) = \sigma(\alert{\theta} - \alert{d}).\]

Multidimensional item response theory

\[\Pr(R_{ij} = 1) = \sigma(\langle \bm{a}, \alert{\bm{\theta}} \rangle + b).\]

Example of multidimensional adaptive assessment

\[\Pr(R_{ij} = 1) = \sigma(\langle \bm{a}, \alert{\bm{\theta}} \rangle + b).\]

Black points are items, red point is user.

\centering \includegraphics[width=0.5\linewidth]{figures/mirt-here.png}

Interpreting the components

\centering \includegraphics[width=\linewidth]{figures/inter1.jpg}

Interpreting the components

\centering \includegraphics[width=\linewidth]{figures/inter2.jpg}

Prior

\centering

Prior + Posterior given $(1, 1)$ is answered correctly

\centering

What information?

We want to maximize likelihood $\Rightarrow \max LL = \max \log p(X \theta)$

Find the zeroes, or go in the direction of the gradient:

\[\nabla_\theta LL = \frac{\partial LL}{\partial \theta}\]
Property (fun fact): $\mathbb{E}_{p(X \theta)} \nabla_\theta LL = 0$

\pause

If $Var_{p(X \theta)} (\nabla_\theta LL)$ is low, the observation is \alert{useless}.
\[\mathcal{F}(\theta) = Var_{p(X|\theta)} (\nabla_\theta LL) = -\mathbb{E}_{p(X|\theta)} \frac{\partial^2 LL}{\partial^2 \theta}\]

\pause

Another index for choosing a question:

\[KL(\theta) = \int_{B(\theta, c/\sqrt{n})} KL(\theta||\theta_0) = \int_{B(\theta, c/\sqrt{n})} \mathbb{E}_{p(X|\theta_0)} \log \frac{P(X|\theta_0)}{P(X|\theta)}\]

A toy example

Let’s take the Rasch model $p(X_j \theta) = \sigma(\theta - d_j) = p_j$

$\nabla_\theta LL = X_j - p_j$

$\mathcal{F}(\theta) = - \frac{\partial^2 LL}{\partial^2 \theta} = p_j (1 - p_j)$

Which means the item of maximum Fisher information is the one of probability \alert{closest to $1/2$}, given the current maximum likelihood estimate.

\pause

Other rewards & policies have been considered:

Here comes a new challenger

How to model \alert{pairwise interactions} with \alert{side information}?

Logistic Regression

Learn a 1-dim \alert{bias} for each feature (each user, item, etc.)

Factorization Machines

Learn a 1-dim \alert{bias} and a $k$-dim \alert{embedding} for each feature

How to model pairwise interactions with side information?

If you know user $i$ attempted item $j$ on \alert{mobile} (not desktop)
How to model it?

$y$: score of event “user $i$ solves correctly item $j$”

IRT

\[y = \theta_i + e_j\]

Multidimensional IRT (similar to collaborative filtering)

\[y = \theta_i + e_j + \langle \bm{v_{\textnormal{user $i$}}}, \bm{v_{\textnormal{item $j$}}} \rangle\]

\pause

With side information

\small \vspace{-3mm} \(y = \theta_i + e_j + \alert{w_{\textnormal{mobile}}} + \langle \bm{v_{\textnormal{user $i$}}}, \bm{v_{\textnormal{item $j$}}} \rangle + \langle \bm{v_{\textnormal{user $i$}}}, \alert{\bm{v_{\textnormal{mobile}}}} \rangle + \langle \bm{v_{\textnormal{item $j$}}}, \alert{\bm{v_{\textnormal{mobile}}}} \rangle\)

Graphically: logistic regression

\centering

Graphically: factorization machines

\centering

Formally: factorization machines

Learn bias \alert{$w_k$} and embedding \alert{$\bm{v_k}$} for each feature $k$ such that: \(\logit p(\bm{x}) = \mu + \underbrace{\sum_{k = 1}^N \alert{w_k} x_k}_{\textnormal{logistic regression}} + \underbrace{\sum_{1 \leq k < l \leq N} x_k x_l \langle \alert{\bm{v_k}}, \alert{\bm{v_l}} \rangle}_{\textnormal{pairwise interactions}}\)

Multidimensional item response theory: $\logit p(\bm{x}) = \langle \bm{u_i}, \bm{v_j} \rangle + e_j$
is a particular case.

\small \fullcite{rendle2012factorization}

\normalsize Use temporal features

\small \fullcite{KTM2019}

Learners evolve over time!

Simple assumptions:

Forgetting model

\centering

{width=80%}

\small\raggedright \mycite{Benoît Choffin, Fabrice Popineau, Yolaine Bourda, and Jill-Jênn Vie (2019)}{DAS3H: Modeling Student Learning and Forgetting for Optimally Scheduling Distributed Practice of Skills}{Best Paper Award at EDM 2019}

Thank you!

“Information geometry” was coined by Shunichi Amari (RIKEN)

Fisher information defines a \alert{Riemannian metric} on probability distributions

\vspace{1cm}

Questions?