% Embeddings: learning representations for unsupervised learning % Marc Lelarge; Kevin Scaman; Jill-Jênn Vie % Oct 28, 2022 — aspectratio: 169 header-includes: |
\usepackage{tikz}
\usepackage{bm}
\def\x}
\def\y}
\def\u}
\def\v}
\def\X}
\def\e}
\def\W}
\def\L{\mathcal{L}}
\def\R{\mathbf{R}}
\usepackage{datetime2}
:::::: {.columns} ::: {.column} We observe $\x, y$
i.e. classification, regression
::: ::: {.column} \pause We just observe $\x$
i.e. raw text, music, ratings ::: ::::::
\pause
It all deals with learning a representation of the data: embedding entities into $\R^d$
\centering {width=70%}
\raggedright \small \fullcite{mikolov2013}
\centering {width=40%}
\raggedright \fullcite{lample2018word}
\begin{tabular}{ccccc}
& \includegraphics[height=2.5cm]{figures/1.jpg} & \includegraphics[height=2.5cm]{figures/2.jpg} & \includegraphics[height=2.5cm]{figures/3.jpg} & \includegraphics[height=2.5cm]{figures/4.jpg}
Sacha & ? & 5 & 2 & ?
Ondine & 4 & 1 & ? & 5
Pierre & 3 & 3 & 1 & 4
Joëlle & 5 & ? & 2 & ?
\end{tabular}
\begin{tabular}{ccccc}
& \includegraphics[height=2.5cm]{figures/1.jpg} & \includegraphics[height=2.5cm]{figures/2.jpg} & \includegraphics[height=2.5cm]{figures/3.jpg} & \includegraphics[height=2.5cm]{figures/4.jpg}
Sacha & \alert{3} & 5 & 2 & \alert{2}
Ondine & 4 & 1 & \alert{4} & 5
Pierre & 3 & 3 & 1 & 4
Joëlle & 5 & \alert{2} & 2 & \alert{5}
\end{tabular}
Given sparse entries $M_{ij}$ find the other entries $M_{ij}$
Embedding $n$ entities into $\R^d$ is equivalent to learning a $n \times d$ matrix $\W$ s.t. embedding of entity $i$ is $W_i$.
\centering
Input: $n \times m$ matrix $\X$
Minimize: $ | \X - M | ^22 = \sum{i, j} (X_{ij} - M_{ij})^2$ where $rank(M) = k$ |
\pause
Solution: compute the singular value decomposition (SVD) $\X = U \Sigma V^T$ where
$k$-SVD is truncated to $k$ biggest singular values. $M = U_k \Sigma_k V_k^T$ is solution. And $W = U_k \Sigma_k$ is a $n \times k$ matrix of embeddings (pca.transform
).
\pause
If $R’$ the $N \times M$ matrix of rows $\frac{R_u}{ | R_u | }$, we can get the $N \times N$ score matrix by computing $R’ R’^T$. May not fit in memory. |
\vspace{-7mm}
\(R = \left(\begin{array}{c} R_1\\ R_2\\ \vdots\\ R_n \end{array}\right) = \raisebox{-1cm}{\begin{tikzpicture} \draw (0,0) rectangle (2.5,2); \end{tikzpicture}} = \raisebox{-1cm}{\begin{tikzpicture} \draw (0,0) rectangle ++(1,2); \draw node at (0.5,1) {$U$}; \draw (1.1,1) rectangle ++(2.5,1); \draw node at (2.35,1.5) {$P$}; \end{tikzpicture}}\) \(\text{$R$: 600 users $\times$ 9k works} \iff \left\{\begin{array}{l} \text{$U$: 600 users $\times$ \alert{20 profiles}}\\ \text{$P$: \alert{20 profiles} $\times$ 9k works}\\ \end{array}\right.\) $\R_\text{Bob}$ is a linear combination of profiles $P_1$, $P_2$, etc.
\pause
\begin{tabular}{@{}lccc@{}}
If $P$ & $P_1$: adventure & $P_2$: romance & $P_3$: plot twist
And $U_{\textnormal{Bob}}$ & $0.2$ & $-0.5$ & $0.6$
\end{tabular}
$\Rightarrow$ Bob \alert{likes a bit} adventure, \alert{hates} romance, \alert{loves} plot twists.
\pause
SVD: $\sum_{\textnormal{\alert{all} } i, j}~(\u_i^T \v_j - r_{ij})^2$ (missing pairs are imputed to zero)
ALS-WR : $\sum_{i, j \textnormal{\alert{ known}}}~(\u_i^T \v_j - r_{ij})^2 + \lambda (\sum_i N_i ||\u_i||^2 + \sum_j M_j ||\v_j||^2)$
where $N_i$ (resp. $M_j$): how many times user $i$ rated items (resp. item $j$ was rated)
\pause
$\L$ is (polynomial and) convex in each $\u_k$ or $\v_\ell$.
Finding \alert{$\u_k$} that minimizes $\L$ $\Rightarrow$ Finding the zeroes of \(\frac{\partial \L}{\partial \u_k} = \sum_{j \textnormal{ rated by } k} 2 (\alert{\u_k}^T \v_j - r_{kj}) \v_j + 2 \lambda \alert{\u_k} = 0\) can be rewritten $A\alert{\u_k} = B$ so $\alert{\u_k} = A^{-1}B$ (closed form)
Complexity: $O(d^3)$ where $d$ is the (embedding) size of $A$ (but can be parallelized)
\pause
\only<1>{\includegraphics[width=0.9\linewidth]{figures/embed0.pdf}} \only<2>{\includegraphics[width=0.9\linewidth]{figures/embed1.pdf}} \only<3>{\includegraphics[width=0.9\linewidth]{figures/embed2.pdf}} \only<4>{\includegraphics[width=0.9\linewidth]{figures/embed3.pdf}} \only<5>{\includegraphics[width=0.9\linewidth]{figures/embed4.pdf}} \only<6>{\includegraphics[width=0.9\linewidth]{figures/embed5.pdf}} \only<7>{\includegraphics[width=0.9\linewidth]{figures/embed6.pdf}} \only<8>{\includegraphics[width=0.9\linewidth]{figures/embed7.pdf}} \only<9>{\includegraphics[width=0.9\linewidth]{figures/embed8.pdf}} \only<10>{\includegraphics[width=0.9\linewidth]{figures/embed9.pdf}} \only<11>{\includegraphics[width=0.9\linewidth]{figures/embed10.pdf}} \only<12>{\includegraphics[width=0.9\linewidth]{figures/embed11.pdf}} \only<13>{\includegraphics[width=0.9\linewidth]{figures/embed12.pdf}} \only<14>{\includegraphics[width=0.9\linewidth]{figures/embed13.pdf}} \only<15>{\includegraphics[width=0.9\linewidth]{figures/embed14.pdf}} \only<16>{\includegraphics[width=0.9\linewidth]{figures/embed15.pdf}} \only<17>{\includegraphics[width=0.9\linewidth]{figures/embed16.pdf}} \only<18>{\includegraphics[width=0.9\linewidth]{figures/embed17.pdf}} \only<19>{\includegraphics[width=0.9\linewidth]{figures/embed18.pdf}} \only<20>{\includegraphics[width=0.9\linewidth]{figures/embed19.pdf}} \only<21>{\includegraphics[width=0.9\linewidth]{figures/embed20.pdf}} \only<22>{\includegraphics[width=0.9\linewidth]{figures/embed21.pdf}} \only<23>{\includegraphics[width=0.9\linewidth]{figures/embed22.pdf}} \only<24>{\includegraphics[width=0.9\linewidth]{figures/embed23.pdf}} \only<25>{\includegraphics[width=0.9\linewidth]{figures/embed24.pdf}} \only<26>{\includegraphics[width=0.9\linewidth]{figures/embed25.pdf}} \only<27>{\includegraphics[width=0.9\linewidth]{figures/embed26.pdf}} \only<28>{\includegraphics[width=0.9\linewidth]{figures/embed27.pdf}} \only<29>{\includegraphics[width=0.9\linewidth]{figures/embed28.pdf}} \only<30>{\includegraphics[width=0.9\linewidth]{figures/embed29.pdf}} \only<31>{\includegraphics[width=0.9\linewidth]{figures/embed30.pdf}} \only<32>{\includegraphics[width=0.9\linewidth]{figures/embed31.pdf}} \only<33>{\includegraphics[width=0.9\linewidth]{figures/embed32.pdf}} \only<34>{\includegraphics[width=0.9\linewidth]{figures/embed33.pdf}} \only<35>{\includegraphics[width=0.9\linewidth]{figures/embed34.pdf}} \only<36>{\includegraphics[width=0.9\linewidth]{figures/embed35.pdf}} \only<37>{\includegraphics[width=0.9\linewidth]{figures/embed36.pdf}} \only<38>{\includegraphics[width=0.9\linewidth]{figures/embed37.pdf}} \only<39>{\includegraphics[width=0.9\linewidth]{figures/embed38.pdf}}
\centering {width=80%}
\centering {width=80%}
We want, given a word $w \sim P(W)$, to predict its \alert{context} (i.e. the words around it).
Oh, it’s a classification problem! With 60,000 classes. \(P(c | w) = \textnormal{softmax}(\u_w^T \v_c) \propto \exp(\u_w^T \v_c)\)
Cross-entropy would look like: \(\L = \log \sigma(\u_w^T \v_c) + \sum_{\substack{i = 1\\c_i \neq c}}^{60000} \log (1 - \sigma(\u_w^T \v_{c_i}))\)
So instead let’s just sample $k$ words from vocabulary proportionally to their occurrence: \(\L = \sum_{w, c} \log \sigma(\u_w^T \v_c) + \sum_{\substack{i = 1\\c_i \sim P(W)}}^k \log (1 - \sigma(\u_w^T \v_{c_i}))\)
\vspace{-5mm} \(\L = \sum_{w, c} \log \sigma(\u_w^T \v_c) + \sum_{\substack{i = 1\\c_i \sim P(W)}}^k \log (1 - \sigma(\u_w^T \v_{c_i}))\)
\vspace{-5mm} \(\L = \sum_{w, c} \log \sigma(\u_w^T \v_c) + \sum_{\substack{i = 1\\c_i \sim P(W)}}^k \log \sigma(-\u_w^T \v_{c_i}))\) This objective is equivalent to factorizing the pointwise mutual information matrix: \(\log \frac{\#(w, c) N}{\#w \#c} - \log k = \u_w^T \v_c \textnormal{ where $N$ is the number of pairs.}\)
\footnotesize \fullcite{levy2014neural}
You are given some entries of $M = U V$ where the shapes are $(10, 3) \cdot (3, 5)$
The goal is to recover the other entries
100,000 ratings (1–5 stars) of 600 users on 9,000 movies.
\url{https://files.grouplens.org/datasets/movielens/ml-latest-small.zip}
By Arthur Conan Doyle
\url{https://github.com/theeluwin/pytorch-sgns}