% Contextual bandits\newline and reinforcement learning from human feedback % Jill-Jênn Vie % Optimizing Human Learning workshop @ LAK 2024\newline March 19, 2024 — handout: true aspectratio: 169 institute: \includegraphics[height=1cm]{figures/soda.png} \includegraphics[height=1cm]{figures/inria.png} biblio-style: authoryear colorlinks: true header-includes: |
\usepackage{tikz}
\usepackage{subfig}
\usepackage{eurosym}
\usepackage{graphbox}
\usepackage{tabularx}
\usepackage{annotate-equations}
\usepackage{xcolor}
\def\hfilll{\hspace{0pt plus 1 filll}}
\def\D{\mathcal{D}}
\def\E{\mathbb{E}}
\renewcommand{\arraystretch}{1.2}
\newcolumntype{C}{>{\centering\arraybackslash}X}
Reinforcement learning is popular in simulated environments such as games.
EDM and LAK communities have extensive experience in designing and fitting student models, but not in RL.
Challenges:
ITS: domain model, student model, tutoring model: policy $\pi(a | \theta)$ |
\centering
\vspace{1cm}
For part 2, a good reference is the following tutorial:
\small
\fullcite{saito2021counterfactual}
\centering
\begin{tikzpicture}[var/.style={draw,rounded corners=2pt,align=center}, every edge/.style={draw,->,>=stealth,very thick},xscale=2.5,yscale=2] \node (x) [var] {covariates \ $X$}; \node (t) at (-0.5,-1) [var] {treatment\ $T$}; \node (y) at (0.5,-1) [var] {outcome\ $Y$}; \draw (x) edge (t); \draw (t) edge (y); \draw (x) edge (y); \end{tikzpicture}
#
\centering
\raggedright
:::::: {.columns} ::: {.column width=”30%”} In an ideal world, one can \alert{control} treatment:
\resizebox{\linewidth}{!}{$\underbrace{P(X | T = 1)}_{\text{treated pop.}} = \underbrace{P(X | T = 0)}_{\text{untreated pop.}}$} |
\begin{tikzpicture}[var/.style={draw,rounded corners=2pt,align=center}, every edge/.style={draw,->,>=stealth,very thick},xscale=2.5,yscale=2] \node (x) [var] {covariates \ $X$}; \node (t) at (-0.5,-1) [var] {treatment\ $T$}; \node (y) at (0.5,-1) [var] {outcome\ $Y$}; \draw (x) edge (t); \draw (t) edge (y); \draw (x) edge (y); \end{tikzpicture} ::: ::: {.column width=”50%”} \includegraphics{figures/rct.png}
\small
\textcolor{gray}{Source : https://quantifyinghealth.com/cohort-vs-randomized-controlled-trials/} ::: ::: {.column width=”20%”} In general we cannot control allocation, so we have to remove the bias\bigskip
(e.g. \textit{inverse probability weighting})\bigskip ::: ::::::
\pause
The policy is $p(T | X)$: deciding to give the treatment or not given covariates $X$ |
Instead of waiting to have enough samples to be statistically significant (A/B test)
\hfill static uniform \alert{policy} $p(T)$ \quad dynamic $p(T)$ \quad $p(T|X)$ depends on $X$ \hspace{5mm}
\raggedleft
Why not: dynamically allocate traffic to actions that work
(as opposed to those who don’t)? This is bandit learning.
\raggedright
Therefore, average treatment effect is policy evaluation (without improvement)
\small
\textcolor{gray}{Source: dynamicyield.com}
\centering
{width=50%}
\raggedright
Recommender system: receives reward 1 if the user clicks on the recommendation, 0 otherwise.
ChatGPT: receives a prompt $x$ selects an answer $y$ and obtains a reward $r(x, y)$ self-estimated from preferences
Tutor: sees a student $x$ chooses an exercise $y$ and… where is the reward?
\small
\fullcite{doroudi2019s}
What is the tutor objective? Ask as few questions as possible. Measure efficiently.
It assumes IRT-1PL as student model. \only<2>{\hfill Problem: students fail 50\% of the time.}
\centering
\centering
\resizebox{0.9\linewidth}{!}{$\displaystyle \substack{\normalsize \Pr(\textnormal{“student A solves question B”})\ \Pr(\textnormal{“player A beats player B”})\ \Pr(\textnormal{“A is preferred to B”})} = \frac1{1 + \exp(-(score_A - score_B))}$}
\raggedright
People attempt questions (Rasch) and possibly learn by attempting (Elo)
\begin{figure} \captionsetup[subfigure]{labelformat=empty,justification=centering} \subfloat[reCAPTCHA\ (Luis von Ahn, 2008)]{\raisebox{2mm}{\includegraphics[width=0.25\linewidth]{figures/captcha.png}}} \subfloat[Elo (1967)\ TrueSkill (2007)]{\includegraphics[width=0.25\linewidth]{figures/tournament-nyt.png}} \subfloat[Adaptive tests\ (Rasch, 1960)]{\includegraphics[width=0.25\linewidth]{figures/irt.pdf}} \subfloat[Preference models\ (Bradley \& Terry, 1952)]{\raisebox{3mm}{\includegraphics[width=0.25\linewidth]{figures/elo2.jpg}}} \end{figure}
\vfill \small
\fullcite{rasch1960studies}
Treatment effect: difference between post-test and pre-test\bigskip
Difference between success rate after and before (\cite{clement2015multi,shabana2022curriculumtutor}1): but should depend on which questions were asked
What is my reward?
Observe student context $s$ (user ability, user history, day of the week, etc.)
$\to$ select activity $a$ $\to$ observe reward $r$
Find the policy $\pi(a \mid s)$ that maximizes average reward:\bigskip
\begin{equation} V(\pi) = \E r = \int_s \int_a \int_r \eqnmarkbox[NavyBlue]{s}{p(s)}\, \eqnmarkbox[OliveGreen]{a}{\pi(a \mid s)}\, \eqnmarkbox[WildStrawberry]{r}{p(r \mid s, a) r}\, ds\, da\, dr \end{equation}
\annotate[yshift=1em]{above,left}{s}{observe student context $s$} \annotate[yshift=1em]{above}{a}{select activity $a$ using policy} \annotate[yshift=-0.5em]{below}{r}{observe reward $r$}
\raggedright Given a dataset $\D_0 = (s_i, a_i, r_i)_i$ collected with policy $\pi_0(a \mid s)$:
As you can see, these questions go beyond the application to education.
Find $\pi$ that optimizes $V$
\[V(\pi) = \int_s \int_a \int_r p(s)\, \pi(a \mid s)\, p(r \mid s, a)\, r\, ds\, da\, dr\]\small \fullcite{saito2021open}
Transformers / are / a / new / machine / [learning]
Transformers / are / a / new / machine / learning / [architecture]
Query: put the first letters in uppercase in "optimizing human learning"
Answer: Optimizing Human Learning
Query: write a poem
Answer 1: Roses are red
Answer 2: Once upon a time, a prince in a castle
Where answer 2 is voted better by experts
A reward model takes two sentences query $x$ and answer $y$ and should verify $r(x, y_1) < r(x, y_2)$ when experts prefer $y_2$ than $y_1$
Predict the next word $\pi(y | x)$ (GPT) |
Collect demonstration data, and train a supervised policy $\pi_0(y | x)$ (based on GPT) |
\centering
$\displaystyle \textnormal{loss}(\alert\theta) = -\E_{(x, y_k, y_\ell) \sim D} \log \underbrace{\sigma(r_{\alert\theta}(x, y_k) - r_{\alert\theta}(x, y_\ell))}{\Pr(“\textnormal{answer } y_k \textnormal{ is preferred to } y\ell”)}$
$\displaystyle \textnormal{objective}(\alert\phi) = \E_{(x, y) \sim \pi_{\alert\phi}} r_\theta(x, y) - \beta \textnormal{KL}(\pi_{\alert\phi}, \pi_0)$
\raggedright \small
\fullcite{ouyang2022training}
Part 2: A teacher should be better than the main population (if 50\% of population believes something wrong, we do not want the LLM to imitate this behavior).
Part 3–4: We can remove the reward model, according to the following paper.
\fullcite{rafailov2024direct}
Dynamic, sequential decision making, using contextual bandits
Adaptive trials to replace randomized controlled trials
Importance weighting to remove the bias from collected data in offline RL
Sometimes we may still need a student model: model-based RL
:::::: {.columns} ::: {.column width=30%} {width=100%}
\centering
Richard Bellman (1920–1984)
\bigskip
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.\bigskip
In our lab, applications to:
\centering \vspace{1cm}
::: ::::::
\scriptsize
\begin{tabularx}{\columnwidth}{l*{4}{C}} \rule{0pt}{4.2ex} & Actions don’t change state & Actions change state & Cannot control\[3ex] \cline{2-4} \rule{0pt}{5.2ex} Observable & \multicolumn{1}{|c|}{Contextual bandits} & Markov Decision Process & \multicolumn{1}{|c|}{Markov Chain}\[3ex] \cline{2-4} \rule{0pt}{4.2ex} Hidden & \multicolumn{1}{|c|}{Multi-armed bandits} & Partially observable MDP & \multicolumn{1}{|c|}{Hidden Markov Model}\[3ex] \cline{2-4} \rule{0pt}{4.2ex} & Bandits & Reinforcement Learning & Graphical Models \end{tabularx}
\pause
\normalsize
Episode: $S_0 \to^\pi A_0 \to R_0 \to S_1 \to^\pi A_1 \to R_1 \to S_2 \to^\pi \cdots \to R_T$
$G_t = R_{t + 1} + \gamma R_{t + 2} + \cdots = \sum_{k = t + 1}^T \gamma^{k - t - 1} R_k$
Find $\pi(a | s)$ that optimizes $\E_\pi [G_t | S_t = s]$ |
Bandits are the equivalent for episodes of length 1: $S \to A \to R$
Best Paper Award AIED 2022 ↩