Jill-Jênn Vie

Researcher at Inria

% Contextual bandits\newline and reinforcement learning from human feedback % Jill-Jênn Vie % Optimizing Human Learning workshop @ LAK 2024\newline March 19, 2024 — handout: true aspectratio: 169 institute: \includegraphics[height=1cm]{figures/soda.png} \includegraphics[height=1cm]{figures/inria.png} biblio-style: authoryear colorlinks: true header-includes: |

  \usepackage{tikz}
  \usepackage{subfig}
  \usepackage{eurosym}
  \usepackage{graphbox}
  \usepackage{tabularx}
  \usepackage{annotate-equations}
  \usepackage{xcolor}
  \def\hfilll{\hspace{0pt plus 1 filll}}
  \def\D{\mathcal{D}}
  \def\E{\mathbb{E}}
  \renewcommand{\arraystretch}{1.2}
  \newcolumntype{C}{>{\centering\arraybackslash}X}

RL on human feedback

Reinforcement learning is popular in simulated environments such as games.

EDM and LAK communities have extensive experience in designing and fitting student models, but not in RL.

Challenges:

ITS: domain model, student model, tutoring model: policy $\pi(a \theta)$

\centering

Outline

  1. How to conduct experiments on real students better than A/B testing?
  2. How to conduct experiments (on real student data) without new interactions with students? (Offline RL)
    • What would have been the outcomes if we had asked the questions in a different order? (counterfactual learning)
  3. What is the reward function that ChatGPT is optimizing?

\vspace{1cm}

For part 2, a good reference is the following tutorial:

\small

\fullcite{saito2021counterfactual}

A/B testing, randomized controlled trials

\centering

\begin{tikzpicture}[var/.style={draw,rounded corners=2pt,align=center}, every edge/.style={draw,->,>=stealth,very thick},xscale=2.5,yscale=2] \node (x) [var] {covariates \ $X$}; \node (t) at (-0.5,-1) [var] {treatment\ $T$}; \node (y) at (0.5,-1) [var] {outcome\ $Y$}; \draw (x) edge (t); \draw (t) edge (y); \draw (x) edge (y); \end{tikzpicture}

#

\centering

\raggedright

:::::: {.columns} ::: {.column width=”30%”} In an ideal world, one can \alert{control} treatment:

\resizebox{\linewidth}{!}{$\underbrace{P(X T = 1)}_{\text{treated pop.}} = \underbrace{P(X T = 0)}_{\text{untreated pop.}}$}

\begin{tikzpicture}[var/.style={draw,rounded corners=2pt,align=center}, every edge/.style={draw,->,>=stealth,very thick},xscale=2.5,yscale=2] \node (x) [var] {covariates \ $X$}; \node (t) at (-0.5,-1) [var] {treatment\ $T$}; \node (y) at (0.5,-1) [var] {outcome\ $Y$}; \draw (x) edge (t); \draw (t) edge (y); \draw (x) edge (y); \end{tikzpicture} ::: ::: {.column width=”50%”} \includegraphics{figures/rct.png}

\small

\textcolor{gray}{Source : https://quantifyinghealth.com/cohort-vs-randomized-controlled-trials/} ::: ::: {.column width=”20%”} In general we cannot control allocation, so we have to remove the bias\bigskip

(e.g. \textit{inverse probability weighting})\bigskip ::: ::::::

\pause

Causal inference: what quantities of interest?

The policy is $p(T X)$: deciding to give the treatment or not given covariates $X$

How about optimal \alert{control} theory?

Instead of waiting to have enough samples to be statistically significant (A/B test)


\hfill static uniform \alert{policy} $p(T)$ \quad dynamic $p(T)$ \quad $p(T|X)$ depends on $X$ \hspace{5mm}

\raggedleft

Why not: dynamically allocate traffic to actions that work
(as opposed to those who don’t)? This is bandit learning.

\raggedright

Therefore, average treatment effect is policy evaluation (without improvement)

\small

\textcolor{gray}{Source: dynamicyield.com}

Applications of bandits

\centering

{width=50%}

\raggedright

Recommender system: receives reward 1 if the user clicks on the recommendation, 0 otherwise.

ChatGPT: receives a prompt $x$ selects an answer $y$ and obtains a reward $r(x, y)$ self-estimated from preferences

Tutor: sees a student $x$ chooses an exercise $y$ and… where is the reward?

\small

\fullcite{doroudi2019s}

A first example of reward: adaptive tests

What is the tutor objective? Ask as few questions as possible. Measure efficiently.

It assumes IRT-1PL as student model. \only<2>{\hfill Problem: students fail 50\% of the time.}

\centering

My favorite student model: item response theory IRT-1PL

\centering

\resizebox{0.9\linewidth}{!}{$\displaystyle \substack{\normalsize \Pr(\textnormal{“student A solves question B”})\ \Pr(\textnormal{“player A beats player B”})\ \Pr(\textnormal{“A is preferred to B”})} = \frac1{1 + \exp(-(score_A - score_B))}$}

\raggedright

People attempt questions (Rasch) and possibly learn by attempting (Elo)

\begin{figure} \captionsetup[subfigure]{labelformat=empty,justification=centering} \subfloat[reCAPTCHA\ (Luis von Ahn, 2008)]{\raisebox{2mm}{\includegraphics[width=0.25\linewidth]{figures/captcha.png}}} \subfloat[Elo (1967)\ TrueSkill (2007)]{\includegraphics[width=0.25\linewidth]{figures/tournament-nyt.png}} \subfloat[Adaptive tests\ (Rasch, 1960)]{\includegraphics[width=0.25\linewidth]{figures/irt.pdf}} \subfloat[Preference models\ (Bradley \& Terry, 1952)]{\raisebox{3mm}{\includegraphics[width=0.25\linewidth]{figures/elo2.jpg}}} \end{figure}

\vfill \small

\fullcite{rasch1960studies}

Other examples of rewards

Treatment effect: difference between post-test and pre-test\bigskip

Difference between success rate after and before (\cite{clement2015multi,shabana2022curriculumtutor}1): but should depend on which questions were asked

What is my reward?

Contextual bandits

Observe student context $s$ (user ability, user history, day of the week, etc.)
$\to$ select activity $a$ $\to$ observe reward $r$

Find the policy $\pi(a \mid s)$ that maximizes average reward:\bigskip

\begin{equation} V(\pi) = \E r = \int_s \int_a \int_r \eqnmarkbox[NavyBlue]{s}{p(s)}\, \eqnmarkbox[OliveGreen]{a}{\pi(a \mid s)}\, \eqnmarkbox[WildStrawberry]{r}{p(r \mid s, a) r}\, ds\, da\, dr \end{equation}

\annotate[yshift=1em]{above,left}{s}{observe student context $s$} \annotate[yshift=1em]{above}{a}{select activity $a$ using policy} \annotate[yshift=-0.5em]{below}{r}{observe reward $r$}

\raggedright Given a dataset $\D_0 = (s_i, a_i, r_i)_i$ collected with policy $\pi_0(a \mid s)$:

As you can see, these questions go beyond the application to education.

We usually observe only one outcome

Offline RL: what if we cannot collect new samples from real students?

Bandit pipeline

Contextual bandits

Find $\pi$ that optimizes $V$

\[V(\pi) = \int_s \int_a \int_r p(s)\, \pi(a \mid s)\, p(r \mid s, a)\, r\, ds\, da\, dr\]

Pipeline

\small \fullcite{saito2021open}

Off-policy estimation

Large Language Models

  1. Transformer: predict next word given first words
Transformers / are / a / new / machine / [learning]  
Transformers / are / a / new / machine / learning / [architecture]
  1. Demonstration data:
Query: put the first letters in uppercase in "optimizing human learning"
Answer: Optimizing Human Learning
  1. Comparison data:
Query: write a poem
Answer 1: Roses are red
Answer 2: Once upon a time, a prince in a castle

Where answer 2 is voted better by experts

A reward model takes two sentences query $x$ and answer $y$ and should verify $r(x, y_1) < r(x, y_2)$ when experts prefer $y_2$ than $y_1$

Reinforcement Learning from Human Feedback: InstructGPT, ChatGPT

  1. Predict the next word $\pi(y x)$ (GPT)
  2. Collect demonstration data, and train a supervised policy $\pi_0(y x)$ (based on GPT)
  3. Collect comparison data (“only” 50k preferences), train a reward model using Elo

\centering

$\displaystyle \textnormal{loss}(\alert\theta) = -\E_{(x, y_k, y_\ell) \sim D} \log \underbrace{\sigma(r_{\alert\theta}(x, y_k) - r_{\alert\theta}(x, y_\ell))}{\Pr(“\textnormal{answer } y_k \textnormal{ is preferred to } y\ell”)}$

  1. Optimize a policy against the reward model using PPO (“without going too far”).

$\displaystyle \textnormal{objective}(\alert\phi) = \E_{(x, y) \sim \pi_{\alert\phi}} r_\theta(x, y) - \beta \textnormal{KL}(\pi_{\alert\phi}, \pi_0)$

\raggedright \small

\fullcite{ouyang2022training}

Part 2: A teacher should be better than the main population (if 50\% of population believes something wrong, we do not want the LLM to imitate this behavior).
Part 3–4: We can remove the reward model, according to the following paper.

\fullcite{rafailov2024direct}

Take home message

Dynamic, sequential decision making, using contextual bandits
Adaptive trials to replace randomized controlled trials

Importance weighting to remove the bias from collected data in offline RL

Sometimes we may still need a student model: model-based RL

Thanks for your attention!

:::::: {.columns} ::: {.column width=30%} {width=100%}

\centering

Richard Bellman (1920–1984)

Bellman’s Principle of Optimality

\bigskip

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.\bigskip

In our lab, applications to:

\centering \vspace{1cm}

\hfilll jill-jenn.vie@inria.fr

::: ::::::

From bandits to reinforcement learning

\scriptsize

\begin{tabularx}{\columnwidth}{l*{4}{C}} \rule{0pt}{4.2ex} & Actions don’t change state & Actions change state & Cannot control\[3ex] \cline{2-4} \rule{0pt}{5.2ex} Observable & \multicolumn{1}{|c|}{Contextual bandits} & Markov Decision Process & \multicolumn{1}{|c|}{Markov Chain}\[3ex] \cline{2-4} \rule{0pt}{4.2ex} Hidden & \multicolumn{1}{|c|}{Multi-armed bandits} & Partially observable MDP & \multicolumn{1}{|c|}{Hidden Markov Model}\[3ex] \cline{2-4} \rule{0pt}{4.2ex} & Bandits & Reinforcement Learning & Graphical Models \end{tabularx}

\pause

\normalsize

Episode: $S_0 \to^\pi A_0 \to R_0 \to S_1 \to^\pi A_1 \to R_1 \to S_2 \to^\pi \cdots \to R_T$

$G_t = R_{t + 1} + \gamma R_{t + 2} + \cdots = \sum_{k = t + 1}^T \gamma^{k - t - 1} R_k$

Find $\pi(a s)$ that optimizes $\E_\pi [G_t S_t = s]$

Bandits are the equivalent for episodes of length 1: $S \to A \to R$

  1. Best Paper Award AIED 2022