Jill-Jênn Vie

Researcher at Inria

% \only<1>{Constrained Decision Transformer\newline for Offline Safe Reinforcement Learning} \only<2>{Thompson Sampling with Diffusion Generative Prior} % Jill-Jênn Vie % MILLE CILS 2023 — title-meta: Thompson Sampling with Diffusion Generative Prior aspectratio: 169 institute: \includegraphics[height=1cm]{figures/inria.png} \includegraphics[height=1cm]{figures/soda.png} header-includes:

:::::: {.columns} ::: {.column width=80%}

We make the assumption that the dataset is both \alert{clean} and \alert{reproducible}, meaning that any trajectory in the dataset can be reliably reproduced by a policy

\vspace{5mm}

To capture realistic bandit scenarios, we propose a novel diffusion model training procedure that trains from \alert{incomplete} and \alert{noisy} data, which could be of independent interest. ::: ::: {.column width=20%} \vspace{1cm}

::: ::::::

Context

It is boring to learn from scratch in bandits, it is important to have a good prior

How to learn\only<2>{ \alert{a diffusion model}} from noisy and partial information?

\centering

{width=80%}

\small (stolen from 2205.10113)

Setting

Thompson sampling:

The idea is to use a diffusion model to learn the prior over rewards, in a meta-training scheme where either:

Meta-Challenge

For a vector $x$, $x^a$ represents its $a$-th coordinate, $x^2$ represents its coordinate-wise square

\pause

\centering

\scalebox{4}{$\huge x_{i,\ell}^a$}

\raggedright

Challenge I: partial information

At round $t$ we only observe component $a_t$ still we want to update the posterior of the diffusion model.

Sample from $q(X_0 y) \propto q(y X_0) q(X_0)$ where $y$ contains partial noisy observations of $X_0$

\centering

{width=70%}

\raggedright

About diffusion

The forward process is Markovian $q(x_{1:L} x_0) = \prod_{\ell = 0}^{L - 1} q(x_{\ell + 1} x_\ell)$
A denoising diffusion model is an approximation of the reverse process $q(x_{0:L - 1} x_L) = \prod_{\ell = 0}^{L - 1} q(x_\ell x_{\ell + 1})$

Therefore, we estimate $x_0$:

\[\textnormal{ minimize $ELBO$} \Leftrightarrow \textnormal{minimize } \sum_{\ell = 1}^L \E_{x_0 \sim Q_0, x_\ell \sim X_\ell|x_0} || x_0 - h_\theta(x_\ell, \ell) ||^2\]

Challenge II: partial and noisy information

Cannot observe $x_0$ but a noisy version of it $\to$ Stochastic EM-like procedure

\centering

{width=50%}

\raggedright

What loss $\mathcal{L}$?

Instead they choose (Metzler et al. 2018; Zhussip et al. 2019)

\centering

{width=65%}

\raggedright

where $\lambda = 0$ encourages exploration.

Results

Recommender system \hfill Bidding on auctions \hfill Getting out from a maze

Up: real means, down: perturbed means

Thanks

Yu-Guan Hsieh, Shiva Kasiviswanathan, Branislav Kveton, Patrick Blöbaum.
\alert{Thompson Sampling with Diffusion Generative Prior.} Proceedings of the 40th International Conference on Machine Learning, PMLR 202:13434-13468, 2023. \url{https://arxiv.org/abs/2301.05182}

\vspace{1cm}

jill-jenn.vie@inria.fr / jjv.ie