Jill-Jênn Vie

Researcher at Inria

% Apprendre de données d’humains :\newline applications du crowdsourcing \only<2>{(\alert{observer})}\newline et de l’inférence causale \only<2>{(\alert{agir}) }pour l’éducation % Jill-Jênn Vie % 31 août 2023 — aspectratio: 169 institute: \includegraphics[height=1cm]{figures/soda.png} \includegraphics[height=0.9cm]{figures/inria.png} header-includes: |

  \usepackage{tikz}
  \usepackage{subfig}
  \usepackage{ulem}
  \usepackage{eurosym}
  \usepackage{graphbox}
  \usepackage{tabularx}
  \def\hfilll{\hspace{0pt plus 1 filll}}
  \def\E{\mathbb{E}}
  \renewcommand{\arraystretch}{1.2}
  \newcolumntype{C}{>{\centering\arraybackslash}X}

Crowdsourcing : apprendre de données d’humains $\to$ bruitées

\centering

\resizebox{0.9\linewidth}{!}{$\displaystyle \substack{\normalsize \Pr(\textnormal{“joueur A bat joueur B”})\ \Pr(\textnormal{“étudiant A résout question B”})\ \Pr(\textnormal{“A préféré à B”})} = \frac1{1 + \exp(-(score_A - score_B))}$}

\raggedright

Les gens affrontent des questions plus dures qu’eux (Rasch) et potentiellement apprennent au passage (Elo)

\begin{figure} \captionsetup[subfigure]{labelformat=empty,justification=centering} \subfloat[reCAPTCHA\ (Luis von Ahn, 2008)]{\raisebox{2mm}{\includegraphics[width=0.25\linewidth]{figures/captcha.png}}} \subfloat[Elo (1967)\ TrueSkill (2007)]{\includegraphics[width=0.25\linewidth]{figures/tournament-nyt.png}} \subfloat[Tests adaptatifs\ (Rasch, 1960)]{\includegraphics[width=0.25\linewidth]{figures/irt.pdf}} \subfloat[Modèles de préférences\ (Bradley \& Terry, 1952)]{\raisebox{3mm}{\includegraphics[width=0.25\linewidth]{figures/elo2.jpg}}} \end{figure}

\vfill \small

\textcolor{gray}{Raykar, Yu, Zhao, Valadez, Florin, Bogoni, \& Moy (JMLR 2010).
Learning from crowds.}

Tests adaptatifs

:::::: {.columns} ::: {.column width=45%} Optimiser la question suivante (ex. Pix)

::: ::: {.column width=55%} Apprendre des détracteurs qui piègent les gens\bigskip

Apprendre de nouvelles bonnes réponses \small \textcolor{gray}{Bachrach, Graepel, Minka \& Guiver (ICML 2012). How to grade a test without knowing the answers—A Bayesian graphical model for adaptive crowdsourcing and aptitude testing.}\bigskip

\normalsize

Comportement adversarial ::: ::::::

\[\parbox{3.6cm}{Poser une question\\ A / B / C / D\\ Demander réponse\\ + explication\\ + degré de confiance} \only<1>{\quad \rightarrow}\only<2>{\alert{\xrightarrow{\text{intervention !}}}} \quad \parbox[c]{3.6cm}{Présenter à l'étudiant une réponse différente\\ + explication} \quad \to \quad \parbox{3.6cm}{Souhaitez-vous\\ modifier votre\\ réponse ?\\ « 30 \% des étudiants s'améliorent »}\]

\small

\textcolor{gray}{Silvestre, Vidal, \& Broisin (EC-TEL 2015). Reflexive learning, socio-cognitive conflict and peer-assessment to improve the quality of feedbacks in online tests.}

Tests randomisés controlés \hfilll vs. \hfilll Étude de cohorte

\centering

\raggedright

:::::: {.columns} ::: {.column width=”30%”} Dans un monde idéal, on peut \alert{contrôler} le traitement :

\resizebox{\linewidth}{!}{$\underbrace{P(X

T = 1)}_{\text{pop traitée}} = \underbrace{P(X

T = 0)}_{\text{pop non traitée}}$}

\begin{tikzpicture}[var/.style={draw,rounded corners=2pt,align=center}, every edge/.style={draw,->,>=stealth,very thick},xscale=2.5,yscale=2] \node (x) [var] {caractéristiques \ $X$}; \node (t) at (-0.5,-1) [var] {traitement\ $T$}; \node (y) at (0.5,-1) [var] {outcome\ $Y$}; \draw (x) edge (t); \draw (t) edge (y); \draw (x) edge (y); \end{tikzpicture} ::: ::: {.column width=”50%”} \includegraphics{figures/rct.png}

\small

\textcolor{gray}{Source : https://quantifyinghealth.com/cohort-vs-randomized-controlled-trials/} ::: ::: {.column width=”20%”} On n’a pas pu contrôler, il faut enlever le biais\bigskip

(ex. \textit{inverse probability weighting})\bigskip ::: ::::::

\pause

Inférence causale : quelles quantités d’intérêt mesure-t-on ?

Effet de traitement moyen : $ATE = \E [Y^1 - Y^0]$ (les traités s’en sortent-ils mieux ?)
Effet de traitement individuel : $uplift(x) = \E [Y^1|X = x] - \E [Y^0|X = x]$
(en fonction des caractéristiques $X$)

\small

\textcolor{gray}{Hsieh, Kasiviswanathan \& Kveton (NeurIPS 2022). Uplifting bandits.}

Et la théorie du \alert{contrôle} optimal, alors ?

Plutôt que d’attendre d’avoir suffisamment d’échantillons pour être statistiquement significatif (A/B test $\downarrow$)

\hfill $p(T)$ uniforme \qquad $p(T)$ non uniforme \quad $p(T|X)$ dépend de $X$ \hspace{5mm}

\raggedleft

Pourquoi ne pas plutôt : allouer dynamiquement davantage de trafic aux actions qui marchent (par opposition à celles qui ne marchent pas $\uparrow$) ?

Décider de donner le traitement ou pas en fonction des caractéristiques $X$ :
\alert{politique} $p(T = 1|X)$ (ex. \textit{dynamic treatment regime} en médecine personnalisée)

\raggedright \small

\textcolor{gray}{Source : dynamicyield.com}

Exemples de bandits contextuels

Agir ou pas ? Quel traitement choisir ? \centering

{width=50%}

\raggedright

Médecin : voit un patient $x$ choisit un traitement $y$ et obtient \sout{25 \euro} une récompense si le patient guérit

Classifieur : voit des données $x$ (une image) choisit une classe $y$ et obtient une récompense de 1 s’il a eu bon, 0 s’il s’est trompé

ChatGPT : reçoit un prompt $x$ choisit une réponse $y$ et obtient un reward de $r(x, y)$ qu’il estime lui-même à partir de préférences

Professeur : voit un élève $x$ choisit un exercice $y$ et… quelle récompense ?

Retour à l’éducation : exemples de récompenses

:::::: {.columns} ::: {.column width=70%} Effet de traitement : différence entre post-test et pre-test\bigskip

Minimiser l’incertitude sur les connaissances (active learning) : mais les apprenants échouent 50 \% du temps ::: ::: {.column width=30%} ::: ::::::

Différence entre le taux de succès après et avant (Clément et al. 2015 ; Shabana et al. AIED 2022 Best Paper Award) : doit aussi dépendre des questions posées

Est-ce que je cherche à :

emmagasiner le plus de connaissances (maximiser le nombre de composantes de connaissance acquises : Yessad, 2022) ;
ou juste bachoter l’examen suivant (ex. réviser ce qui a le plus de chances de tomber, Lan et al., 2016) ;
ou étant donné un objectif d’apprentissage, planifier les activités pour y parvenir ?

Des bandits à l’apprentissage par renforcement

\scriptsize

\begin{tabularx}{\columnwidth}{l*{4}{C}} \rule{0pt}{4.2ex} & Actions ne changent pas l’état & Actions changent l’état & Pas de contrôle\[3ex] \cline{2-4} \rule{0pt}{5.2ex} État observable & \multicolumn{1}{|c|}{Bandits contextuels} & Processus de décision markovien (PDM) & \multicolumn{1}{|c|}{Chaîne de Markov}\[3ex] \cline{2-4} \rule{0pt}{4.2ex} État caché & \multicolumn{1}{|c|}{Bandits à plusieurs bras} & PDM partiellement observable & \multicolumn{1}{|c|}{Modèle de Markov caché}\[3ex] \cline{2-4} \rule{0pt}{4.2ex} & Bandits & Apprentissage par renforcement (RL) & Modèles graphiques \end{tabularx}

\pause

\normalsize

Bandit : $A_0 \to R_0$

Bandit contextuel : $S_0 \to^\pi A_0 \to R_0$
\hfill Optimiser $V(\pi) = \E_{s,a,r} r = \int_s \int_a \int_r r\, p(r|s,a)\, \pi(a|s)\, p(s)\, ds\, da\, dr$

Épisode de RL : $S_0 \to^\pi A_0 \to R_0 \to S_1 \to^\pi A_1 \to R_1 \to S_2 \to^\pi \cdots \to R_T$
\hfill Trouver $\pi(a|s)$ qui optimise $\E_\pi [G_t | S_t = s]$ où $G_t = R_{t + 1} + \gamma R_{t + 2} + \cdots$.

Les bandits contextuels sont l’équivalent de RL pour des épisodes de taille 1

Reinforcement Learning from Human Feedback : ChatGPT

Collecter des données de démonstration et entraîner une politique supervisée $\pi_0(y x)$ (basée sur GPT-3)
Collecter des données de comparaisons, entraîner un modèle de récompense
(ex. modèle Elo à partir de 50k annotations)

\centering

$\displaystyle \textnormal{loss}(\alert\theta) = -\E_{(x, y_w, y_\ell) \sim D} \log \underbrace{\sigma(r_{\alert\theta}(x, y_k) - r_{\alert\theta}(x, y_\ell))}{\Pr(\textnormal{“réponse } y_k \textnormal{ est préférée à } y\ell \text{“})}$

\raggedright

Apprendre une politique qui optimise le modèle de récompense (avec PPO).

\centering

$\displaystyle \textnormal{objective}(\alert\phi) = \E_{(x, y) \sim \pi_{\alert\phi}} r_\theta(x, y) - \beta \textnormal{KL}(\pi_{\alert\phi}, \pi_0)$

\raggedright \small

\textcolor{gray}{Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, … \& Lowe (NeurIPS 2022).\ Training language models to follow instructions with human feedback.}

Un papier récent suggère de laisser tomber la partie 3 et de surtout se concentrer sur la partie 2

\textcolor{gray}{Rafailov, Sharma, Mitchell, Ermon, Manning and Finn (arXiv 2023.06).\ Direct Preference Optimization: Your Language Model is Secretly a Reward Model.}

Un dernier exemple : same-language subtitling (SLS)

\centering

{height=3cm} {height=3cm}

\raggedright

Étude sur 13000 écoliers entre 2002 et 2007.\bigskip

Purely from schooling, without any exposure to SLS, we found that \alert{24\%} children became good readers after 5 years of schooling. But in the group of school children that was exposed to SLS regularly, at most 30 min a week over five years, \alert{56\%} became good readers.

\small

\textcolor{gray}{Kothari, B. (2008). Let a billion readers bloom: Same language subtitling (SLS) on television for mass literacy. International review of education, 54(5-6), 773-780.}

Un récent outil : AxTongue.com \only<2>{basé sur un prompt ChatGPT}

\centering

{width=80%}

Merci pour votre attention !

:::::: {.columns} ::: {.column width=30%} {width=100%}

\centering

Richard Bellman (1920–1984)

Man of the century
Invented dynamic programming (1952) before programming was invented (1953) ::: ::: {.column width=70%}

Bellman’s Principle of Optimality

\bigskip

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.\bigskip

Applications à Soda en :

santé (observation de cohortes longitudinales à l’AP-HP)
culture (diversité du Pass Culture) ;
et éducation (récompenses court terme vs long terme).

\centering \vspace{1cm}

\hfilll jill-jenn.vie@inria.fr

::: ::::::