% Multilayer Perceptrons:\newline Expressiveness, overfitting, regularization % Marc Lelarge; Kevin Scaman; Jill-Jênn Vie % Oct 21, 2022 — aspectratio: 169 header-includes: |
\usepackage{bm}
\usepackage{datetime2}
\def\E{\mathbb{E}}
\def\R{\mathbf{R}}
\def\N{\mathbf{N}}
\def\C{\mathcal{C}}
\def\L{\mathcal{L}}
\def\NN{\mathcal{N}}
\def\MLP{\textnormal{MLP}}
\def\x}
\def\y}
\def\W}
\def\b}
\def\CC}
\def\boldth}
\def\hf
\usepackage[skins,minted]{tcolorbox}
\definecolor{bgm}{rgb}{0.95,0.95,0.95}
\newtcblisting{myminted}[3][]{listing engine=minted,listing only,#1,minted language=#3,colback=bgm,minted options={linenos,fontsize=\footnotesize,numbersep=2mm,escapeinside=||,mathescape=true}}
\pause
The $\ell$th layer has $d_\ell$ neurons. Input layer $\ell = 0$, output layer $\ell = L$.
$\sigma$ is the link function. Usually, $\sigma = \textnormal{ReLU} = \max(\mathbf{0}, \x)$. #params?
\pause
\begin{myminted}{MLP}{python} from torch import nn
mlp = nn.Sequential( nn.Linear(|$d_0$|, |$d_1$|), nn.ReLU(), nn.Linear(|$d_1$|, |$d_2$|), … nn.Linear(|$d_{L - 1}$|, |$d_L$|) ) \end{myminted}
$y = \sigma(\W \x + b)$ where $\sigma = \textnormal{sigmoid} = 1 / (1 + \exp(-x))$
#
\centering
{width=70%}
\pause
\raggedright A ReLU-based MLP with inputs in $\R^n$, $L$ layers of width $k \geq n$, can compute functions that have $\Omega((k / n)^{n (L - 1)} n^k)$ linear regions.
\fullcite{montufar2014number}
The number of activation patterns ($\sim$ regions) of a ReLU-based MLP with $L$ layers of width $k$, inputs in $\R^n$ is upper bounded (tightly) by $O(k^{n L})$ as $L \to \infty$.
\fullcite{raghu2017expressive}
Let $\alert\sigma \in \C(\R)$ a continuous function from $\R$ to $\R$.
Then: $\alert\sigma$ is not polynomial $\iff$
For all $\varepsilon > 0$, $n, m \in \N$, compact $K \subseteq \R^n$, function $f \in \C(K, \R^m)$,
there exist latent dimension $k$ and weights $\W, \b, \CC$ such that
\pause
In other words, 2-layer MLPs are \alert{dense} in $\C(K, \R^m)$.
They are \alert{universal approximators} of continuous functions.
For any function $f \in L^p(\R^n, \R^m)$ and any $\varepsilon > 0$
there exists a MLP with ReLU of width $\max(n + 1, m)$ such that
\pause
Moreover:
Let $\NN$ be the space of $\textnormal{MLP} : \R^n \to \R^m$ with any layers having $n + m + 2$ neurons.
Then: $\NN$ is \alert{dense} in $\C(K, \R^m)$ where compact $K \subseteq \R^n$.
The Fisher information matrix i.e. $\frac{\partial^2 \L}{\partial \theta^2}$ has eigenvalues having mean $O(1/M)$, variance $O(1)$ and max $O(M)$.
:::::: {.columns} ::: {.column}
Polynomial degree 1 $Y = wX + b$
Linear regression
scipy.stats.linreg
:::
::: {.column}
Polynomial degree 6 $Y = \sum_{k = 0}^6 w_k X^k$
Lagrange interpolation
scipy.interpolation.lagrange
:::
::::::
Samples $x_i, y_i \in \R$. We train a model $\hf$.
Let us assume that $y_i = f(x_i) + \varepsilon_i$ where $\varepsilon_i \in \NN(0, \sigma^2)$.
Then: the generalization error for the squared loss verifies
\(\begin{aligned}
\E[(Y - \hf(X))^2] & = \alert{\textnormal{Bias}(\hf)^2} + \textcolor{blue}{\textnormal{Var}(\hf)} + \sigma^2\\
& = \alert{\E[\hf - f]^2} + \textcolor{blue}{\E[(\hf - \E \hf)^2]} + \sigma^2.
\end{aligned}\)
\only<1>{\centering\includegraphics[width=0.6\linewidth]{figures/bias-var.png}} \only<2>{Proof. \(\E f = f \quad \E Y = f \quad \textnormal{Var}(Y) = \sigma^2\) As $\varepsilon$ and $\hf$ are independent: \small \({\displaystyle {\begin{aligned}\E\left[(Y-{\hf})^{2}\right]&=\E\left[Y^{2}+{\hf}^{2}-2Y{\hf}\right]\\&=\E\left[Y^{2}\right]+\E \left[{\hf}^{2}\right]-\E [2Y{\hf}]\\&=\textnormal {Var}(Y)+\E [Y]^{2}+\textnormal {Var}({\hf})+\E [{\hf}]^{2}-2f\E [{\hf}]\\&=\textnormal {Var}(Y)+\textnormal {Var}({\hf})+(f-\E [{\hf}])^{2}\\&=\textnormal {Var}(Y)+\textnormal {Var}({\hf})+\E [f-{\hf}]^{2}\\&=\sigma ^{2}+\textnormal {Var}({\hf})+\textnormal {Bias}({\hf})^{2}.\end{aligned}}}\)}
:::::: {.columns} ::: {.column} ::: ::: {.column} \pause By keeping a validation set
::: ::::::
\centering
{width=40%}
\begin{myminted}{CV}{python} trainval, test = split data into 80:20 train, valid = split trainval into 80:20
for each hyperparameter |$\lambda$|: minimize error on train using |$\lambda$| |$\texttt{valid_score}\lambda \gets$| evaluate metric on valid |$\lambda^* \gets \lambda$| achieving best |$\texttt{valid_score}\lambda$| minimize error on trainval using |$\lambda^*$| (= refit) \end{myminted}
\centering
{width=40%}
\raggedright \begin{myminted}{CV}{python} def training_loop(train, valid): for each epoch: for |$\x$|, |$y$| in train: do one step of gradient valid_score |$\gets$| evaluate metric on valid if valid_score is worse than before: return \end{myminted}
Let us consider logistic regression i.e. 1-layer MLP:
$f(\x_i) = \sigma(\W \x_i + b)$
Logistic loss: $\L = \sum_i -(1 - y_i) \log (1 - f(\x_i)) - y_i \log f(\x_i)$
If all samples have same target $y_i = 1$ (or if there’s only 1 sample), what will happen?
\pause
MLP believes everything is a cat.
Parameters $ | \W | $ and $b$ diverge to $+\infty$ |
Add penalty to loss $ | \W | _2^2 + | b | _2^2$ (= assuming a Gaussian prior centered in $\bm{0}$), called $L_2$ regularization |
\begin{columns}
\begin{column}{0.6\linewidth}
Minimizing loss:
May fall in local minima or diverge to $\infty$\\vspace{2cm}
Minimizing loss + regularization:
Easier to optimize
\end{column}
\begin{column}{0.4\linewidth}
\hfill \includegraphics[width=\linewidth]{figures/nonreg.pdf}
\hfill \includegraphics[width=\linewidth]{figures/reg.pdf}
\end{column}
\end{columns}
We will see an example next week.
There is incompressible error due to inherent noise
\centering
{width=20%}
\raggedright $1797 \times 8 \times 8$ images representing numbers between 0 and 9.
1599 wines $\times$ 11 features2, have to predict quality which is an integer between 0 and 10 (in practice between 3 and 8).