Jill-Jênn Vie

Researcher at Inria

\textcolor{gray}{Source: https://quantifyinghealth.com/cohort-vs-randomized-controlled-trials/}

Randomized controlled trials vs. cohort study


\begin{tikzpicture}[var/.style={draw,rounded corners=2pt,align=center}, every edge/.style={draw,->,>=stealth,very thick},xscale=2.5,yscale=2] \node (x) [var] {covariates\ $X$}; \node (t) at (-0.5,-1) [var] {treatment\ $T$}; \node (y) at (0.5,-1) [var] {outcome\ $Y$}; \draw (x) edge (t); \draw (t) edge (y); \draw (x) edge (y); \end{tikzpicture}


Randomized controlled trial (A/B testing)

We could \alert{control} treatment, therefore treated/non-treated distributions are the same: $P(X T = 0) = P(X T = 1)$

Cohort study (observational data)

Could not control, have to remove bias from estimates
(e.g. inverse probability weighting)\bigskip

Randomized \alert{controlled} trials $\to$ How about optimal \alert{control} theory?

Here come bandits

Instead of having to wait for sufficient sample size and high statistical significance (A/B test)

How about: dynamically allocating traffic to actions that are performing well
(while allocating less and less traffic to underperforming actions)

\textcolor{gray}{Source: blog post on dynamicyield.com}

Quantities of interest – causal inference

Average treatment effect

\[ATE = \E [Y^1 - Y^0]\]

Individual or heterogeneous treatment effect

Uplift: the incremental profit brought by treatment conditioned on features of each individual

\[u(x) = \E [Y^1|X = x] - \E [Y^0|X = x]\]


\textcolor{gray}{Yamane, Yger, Atif \& Sugiyama (NeurIPS 2018). Uplift Modeling from Separate Labels.}

\textcolor{gray}{Hsieh, Kasiviswanathan \& Kveton (NeurIPS 2022). Uplifting bandits.}


Deciding whether treatment or not given $x_i$: \alert{policy}
(seen in dynamic treatment regime)

Contextual bandits

Optimize average reward

Or regret: how do we perform compared to the best possible action at each time?

Or best arm identification: which action/treatment is the best?

Quantities of interest – stochastic bandits


Off-policy estimation

From bandits to reinforcement learning


\begin{tabularx}{\columnwidth}{l*{4}{C}} \rule{0pt}{4.2ex} & Actions don’t change state & Actions change state & Cannot control\[3ex] \cline{2-4} \rule{0pt}{5.2ex} Observable & \multicolumn{1}{|c|}{Contextual bandits} & Markov Decision Process & \multicolumn{1}{|c|}{Markov Chain}\[3ex] \cline{2-4} \rule{0pt}{4.2ex} Hidden & \multicolumn{1}{|c|}{Multi-armed bandits} & Partially observable MDP & \multicolumn{1}{|c|}{Hidden Markov Model}\[3ex] \cline{2-4} \rule{0pt}{4.2ex} & Bandits & Reinforcement Learning & Graphical Models \end{tabularx}



Episode: $S_0 \to^\pi A_0 \to R_0 \to S_1 \to^\pi A_1 \to R_1 \to S_2 \to^\pi \cdots \to R_T$

$G_t = R_{t + 1} + \gamma R_{t + 2} + \cdots = \sum_{k = t + 1}^T \gamma^{k - t - 1} R_k$

Find $\pi(a s)$ that optimizes $\E_\pi [G_t S_t = s]$

Bandits are the equivalent for episodes of length 1

On-policy vs. off-policy

Problem: old episodes were collected from an older policy, so does it makes sense?

$Q$-learning is an off-policy algorithm



\textcolor{gray}{Hasselt (NeurIPS 2010). Double $Q$-Learning.}\bigskip



\textcolor{gray}{Kumar, Zhou, Tucker \& Levine (NeurIPS 2020). Conservative $Q$-Learning for Offline Reinforcement Learning.}

Dynamic programming (1952)

Richard Bellman (1920–1984)


Richard Bellman (1920–1984) ::: ::: {.column} Invented dynamic programming before programming was invented (Autocode, 1953)

Principle of Optimality

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. ::: ::::::


Learning from human (noisy) labels (e.g. reCAPTCHA, Duolingo) based on graphical models.

\textcolor{gray}{Raykar, Yu, Zhao, Valadez, Florin, Bogoni, \& Moy (JMLR 2010).
Learning from crowds.}

Ex. \textcolor{gray}{Bachrach, Graepel, Minka \& Guiver (ICML 2012). How to grade a test without knowing the answers—A Bayesian graphical model for adaptive crowdsourcing and aptitude testing.}

Reinforcement Learning from Human Feedback

  1. Collect demonstration data, and train a supervised policy $\pi_0(y x)$ (based on GPT-3)
  2. Collect comparison data, train a reward model
    (Elo rating, or BPR; “only” 50k annotations)
\[\textnormal{loss}(\alert\theta) = -\E_{(x, y_w, y_\ell) \sim D} \log \underbrace{\sigma(r_{\alert\theta}(x, y_k) - r_{\alert\theta}(x, y_\ell))}_{\Pr(\textnormal{output } y_w \textnormal{ is preferred to } y_\ell)}\]
  1. Optimize a policy against the reward model using PPO.
\[\textnormal{objective}(\alert\phi) = \E_{(x, y) \sim \pi_{\alert\phi}} r_\theta(x, y) - \beta \textnormal{KL}(\pi_{\alert\phi}, \pi_0)\]


\textcolor{gray}{Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, … \& Lowe (NeurIPS 2022). Training language models to follow instructions with human feedback.}



Clémence Réda, Marie Skłodowska-Curie postdoc
Drug repurposing: given a disease, find a drug


Tomas Rigaux, engineer
Recommendation of cultural content with diversity
(Pass Culture, 15–18 years old)


Samuel Girard, intern
Off-policy estimation of new policies for asking exercises
Compromise between short-term reward (learner solves hard problems) and long-term reward (they progress a lot)