% Fairness et confidentialité en IA pour l’éducation :\newline risques et opportunités % Jill-Jênn Vie % 29 juin 2022 — institute: \includegraphics[height=1cm]{figures/inria.png} \includegraphics[height=1cm]{figures/soda.png} colorlinks: true lang: fr aspectratio: 169 biblio-style: authoryear biblatexoptions: natbib header-includes: - \usepackage{bm} - \usepackage{tikz} - \usepackage{booktabs} - \usepackage{colortbl} - \DeclareMathOperator\logit{logit} - \def\Dt{D_\theta} - \def\E{\mathbb{E}} - \def\logDt{\log \Dt(x)} - \def\logNotDt{\log(1 - \Dt(x))} - \newcommand\mycite[3]{\textcolor{blue}{#1} “#2”.~#3.} - \usepackage{etoolbox} - \AtEndPreamble{\DefineBibliographyExtras{french}{\restorecommand\mkbibnamefamily}} —
(Rasch, 1961) danois, (Lord, 1986) américain ou (Binet, 1905) français
Compromis entre bien mesurer et poser peu de questions
{height=7cm}
\centering
{height=8cm}
{height=8cm}
{height=8cm}
\centering \includegraphics[width=0.6\linewidth]{figures/cfirt.pdf}
\raggedright \small\fullcite{Bergner2021}
Exercices de maths (ex. ASSISTments)
\centering \begin{tabular}{cccc} \toprule Items & 5 – 5 = ? & 17 – 3 = ? & 13 – 7 = ?\ \midrule New student & \alert{$\circ$} & \alert{$\circ$} & \alert{$\mathbf{\times}$}\ \bottomrule \end{tabular}
\raggedright Apprentissage d’une langue (jeu de données de Duolingo)
\includegraphics{figures/duolingo0.png}
\includegraphics[width=\linewidth]{figures/dkt.png}
Apprendre des paramètres de questions sur des données d’historiques \hfill \emph{ex. difficulté}
Mesurer les paramètres de nouveaux apprenants \hfill \emph{ex. expertise}
\centering \includegraphics[width=\linewidth]{figures/ktm.pdf}
\includegraphics{figures/ktm-cite.pdf}
Apprendre un biais \alert{$w_k$} et un embedding \alert{$\bm{v_k}$} pour chaque composante $k$ tel que : \(\logit p(\bm{x}) = \mu + \underbrace{\sum_{k = 1}^N \alert{w_k} x_k}_{\textnormal{régression logistique}} + \underbrace{\sum_{1 \leq k < l \leq N} x_k x_l \langle \alert{\bm{v_k}}, \alert{\bm{v_l}} \rangle}_{\textnormal{interactions par paires}}\)
\small \fullcite{rendle2012factorization}
\fullcite{Minn2018}
\fullcite{KTM2019}
\includegraphics[width=0.5\linewidth]{figures/anki.png}\includegraphics[width=0.5\linewidth]{figures/leitner.png} \centering
Principe simple :
:::::: {.columns} ::: {.column} Apprendre par régression logistique :
\scriptsize \mycite{Benoît Choffin, Fabrice Popineau, Yolaine Bourda, and Jill-Jênn Vie (2019)}{DAS3H: Modeling Student Learning and Forgetting for Optimally Scheduling Distributed Practice of Skills}{Best Paper Award at EDM 2019}
Réseaux de neurones récurrents (DKT), avec attention (SAKT, AKT)
IRT bien entraîné peut faire mieux que deep knowledge tracing (Wilson, EDM 2016).
Sur des grandes données, lorsque l’aspect séquentiel est important, des modèles profonds peuvent avoir une meilleure performance.
\fullcite{gervet2020deep}
Important de connaître les fondements, merci Michel :
\fullcite{desmarais2012review}
@narayanan2008robust managed to de-anonymize a Netflix pseudonymized dataset of seen movies with IMDb
for all datasets $D_1$ and $D_2$ that differ on a single element
for all possible subsets $S$ (of $\textnormal{Im } A$)
{width=50%}
However, we need a dynamic model
Knowledge embeddings are safe to be shared
User embeddings however should be drawn from distribution
\centering
{width=50%}
\begin{table}[h]
%\caption{Example of minimal tabular dataset.}
\label{example-dataset}
\centering
\resizebox{\textwidth}{!}{
\begin{tabular}{ccc} \toprule
user ID & action ID & outcome \ \midrule
2487 & 384 & 1
2487 & 242 & 0
2487 & 39 & 1
2487 & 65 & 1 \ \bottomrule
\end{tabular}
\arrayrulecolor{white}
\begin{tabular}{l} \toprule
description \ \midrule
user 2487 got token I'' correct \\
user 2487 got token
ate’’ incorrect
user 2487 got token an'' correct \\
user 2487 got token
apple’’ correct \ \bottomrule
\end{tabular}
}
\arrayrulecolor{black}
\end{table}
So in our case there are two models:
Ex. $r_{ij}$ is 1 if user $i$ got a positive outcome on action (item) $j$
\[p_{ij} = \Pr(R_{ij} = 1) = \sigma(\theta_i + e_j)\]\noindent where $\theta_i$ is ability of user $i$ and $e_j$ is easiness of action $j$
\vspace{1cm}
Trained using Newton’s method: minimize log-loss $\mathcal{L} = \sum_{i, j} (1 - r_{ij}) \log (1 - p_{ij}) + r_{ij} \log p_{ij}$
Let us encode the event (user $i$, item $j$) as a two-hot vector $\bm{x}$:
\centering
$p_{ij} = \sigma(\langle \alert{\bm{w}}, \bm{x} \rangle) = \sigma(\sum_k \alert{w_k} x_k) = \sigma(\alert{\theta_i} + \alert{e_j})$
\centering Practictioners who conduct study on the real and fake dataset should have similar findings
$\downarrow$
Trained model on original dataset should have parameters that are not too far in RMSE
\raggedright
We also consider weighted RMSE:
\[wRMSE = \sqrt{\sum_{i = 1}^N w_i (d_i - \widehat{d_i})^2}\]where $w_i \in [0, 1]$ is the frequency of action $i$ in the training set.
\centering It should not be easy to re-identify people / the fake dataset should not leak too much information about participants
$\downarrow$
An attacker has to guess, from a broader population, who was in the training set
\centering \begin{tikzpicture}[ xscale=3, yscale=2, data/.style={draw}, >=stealth ] \node[data] (original) at (0,0) {Original}; \node[data] (training) at (1,0) {Training set}; \node[data] (fake) at (1,-1) {Fake set}; \node[data,text width=1.6cm,text centered] (real-irt) at (2,0) {Real item params $d$}; \node[data,text width=1.6cm,text centered] (fake-irt) at (2,-1) {Fake item params $\hat{d}$}; \draw[->] (original) edge node[above=3mm] {sampling half users} (training); \draw[->] (training) edge node[right] {generator} (fake); \draw[<->] (real-irt) edge node[right] {RMSE} (fake-irt); \draw[->,dashed,bend right] (original) edge (training); \draw[->,dashed,bend left=60,text width=2cm,text centered] (fake) edge node[below left] {reidentify\AUC} (training); \draw[->] (training) edge node[above] {IRT} (real-irt); \draw[->] (fake) edge node[above] {IRT} (fake-irt); \end{tikzpicture}
(framework inspired by NeurIPS “Hide and Seek” challenge in healthcare by \cite{jordon2020hide})
\centering
Actions
{width=49%} {width=49%}
{width=100%}
“Different models with the same reported accuracy can have a very different distribution of error across population” (Hardt, 2017)
\pause
Crime prediction (watch Psycho-Pass):
Many results are being renamed and rediscovered :(
\fullcite{hutchinson201950}
\fullcite{zemel2013learning}
\centering
See Attacking discrimination with smarter machine learning
\centering
\raggedleft High if $x_n$ is close to $\alert{v_k}$
\raggedright
\[\hat{x_n} = \sum_k M_{n, k} \alert{v_k}\]\only<1>{\(\displaystyle \widehat{y_n} = \sum_k M_{n, k} \alert{w_k}\)} \only<2>{\(\widehat{y_n} = \sum_k \underbrace{M_{n, k}}_{\in \{0, 1\}} \alert{w_k}\)} \only<3>{\(\widehat{y_n} = \sum_k \underbrace{M_{n, k}}_{\in \{0, 1\}} \underbrace{\alert{w_k}}_{\in \{0, 1\}}\)}
$\alert{v_k} \in \mathbf{R}^d$, $\alert{w_k} \in \mathbf{R}$ are \alert{learned}
$L_y = \sum_n - y_n \log \hat{y_n} - (1 - y_n) \log (1 - \hat{y_n})$
$L_x = \sum_n | x_n - \hat{x}_n | ^2$ |
$L_z = \sum_k | M_k^+ - M_k^- | $ |
where $M_k^+ = \underbrace{\mathbb{E}+ M{n, k}}_{\textnormal{average across subgroup}}$
\only<1>{\(L = A_z L_z + A_x L_x + A_y L_y\)} \only<2>{\(L = A_z L_z + A_x L_x + A_y \alert{N_D}\)}
LR: Logistic Regression
FNB: Fair Naive Bayes
RLR: Regularized LR
LFR: Learning Fair Representations
Accuracy (high)
Discrimination (low)
\[D = | \mathbb{E}_+ \hat{y}^n - \mathbb{E}_- \hat{y}^n |\]\centering
{width=70%}
Consistency (high)
\[y_{nn} = 1 - \frac1{Nk} \sum_n \left| \hat{y}_n - \sum_{j \in kNN(x_n)} \hat{y}_j \right|\]\centering
{width=60%}
Constraints on AUC or area between ROC curves (ABROCA)
\centering
{width=60%}
\raggedright
Evaluating the Fairness of Predictive Student Models Through Slicing Analysis (Gardner, Brooks and Baker, 2019)
Also works from Bellet next door
\centering
{width=60%}
\pause
\[\begin{aligned} \forall S \subset \textnormal{Im} A, \forall D_1, D_2 \textnormal{"close"}, Pr(A(D_1) \subset S) \leq e^\varepsilon Pr(A(D_2) \subset S)\\ \forall S \subset \textnormal{Im} A, \forall D_1, D_2 \textnormal{"close"} \left|\frac{\log Pr(A(D_1) \in S)}{\log Pr(A(D_2) \in S)}\right| \leq \varepsilon \end{aligned}\]\raggedright
For more on this beautiful relationship:
Fairness through Awareness (Dwork et al., 2011)
\vspace{1cm}
\pause
Merci ! Questions ? \hfill Ces slides sur \href{https://jjv.ie/slides/pfia.pdf}{jjv.ie/slides/pfia.pdf}