% Coding best practices % Jill-Jênn Vie % New in ML workshop — aspectratio: 169 colorlinks: true institute:
\newfontfamily{\Emoji}[Renderer=Harfbuzz]{Noto Color Emoji}
\def\Clap
\centering \texttt{pip install tryalgo}
\centering
{width=75%}
Quote:\medskip
Unlike pro developers, researchers code for themselves only.
But at minimum you code for \alert{your future self}.
\footnotesize \raggedleft (e.g., the version of you who is going to write your PhD thesis)
\normalsize \raggedright Versioning is a way to keep a backup of your work
\footnotesize \raggedleft (in case your house burns, for example)
\centering \pause
\includegraphics[width=0.4\linewidth]{figures/git-push.png}
\footnotesize \raggedleft (some times I had to guess the version using binary search) \normalsize
pycodestyle
follows PEP8 (Style Guide for Python code)
\footnotesize \raggedleft
(also an --aggressive
mode that will directly fix your code, use with caution)
\normalsize \raggedright
pylint
does static analysis: warns you about useless variables, etc.
\vspace{1cm}
Existing equivalents in other languages
:::::: {.columns} ::: {.column width=”55%”} \tiny
def dijkstra(graph, weight, source=0, target=None):
"""single source shortest paths by Dijkstra
:param graph: directed graph in listlist or listdict format
:param weight: in matrix format or same listdict graph
:assumes: weights are non-negative
:param source: source vertex
:type source: int
:param target: if given, stops once distance to target found
:type target: int
:returns: distance table, precedence table
:complexity: `O(|V| + |E|log|V|)`
"""
n = len(graph)
assert all(weight[u][v] >= 0 for u in range(n) for v in graph[u])
prec = [None] * n
dist = [float('inf')] * n
dist[source] = 0
heap = OurHeap([(dist[node], node) for node in range(n)])
while heap:
dist_node, node = heap.pop() # Closest node from source
if node == target:
break
for neighbor in graph[node]:
old = dist[neighbor]
new = dist_node + weight[node][neighbor]
if new < old:
dist[neighbor] = new
prec[neighbor] = node
heap.update((old, neighbor), (new, neighbor))
return dist, prec
::: ::: {.column width=”45%”} ::: ::::::
No need to do it for everything, but some parts of your code.
\small
jj@altaria:~/code/tryalgo$ python -m unittest
......................................................................................................
----------------------------------------------------------------------
Ran 102 tests in 1.185s
OK
:::::: {.columns} ::: {.column width=”50%”} \tiny
class OurQueue:
"""
A queue for counting efficiently the number of events
within time windows.
Complexity:
All operators in amortized O(W) time
where W is the number of windows.
From JJ's KTM repository: https://github.com/jilljenn/ktm.
"""
def __init__(self):
self.queue = []
self.window_lengths = [
3600 * 24 * 30, 3600 * 24 * 7, 3600 * 24, 3600]
self.cursors = [0] * len(self.window_lengths)
def __len__(self):
return len(self.queue)
def get_counters(self, t):
self.update_cursors(t)
return [len(self.queue)] + [len(self.queue) - cursor
for cursor in self.cursors]
def push(self, time):
self.queue.append(time)
def update_cursors(self, t):
for pos, length in enumerate(self.window_lengths):
while (self.cursors[pos] < len(self.queue) and
t - self.queue[self.cursors[pos]] >= length):
self.cursors[pos] += 1
::: ::: {.column width=”50%”} \tiny
import unittest
from utils.this_queue import OurQueue
class TestOurQueue(unittest.TestCase):
def test_simple(self):
q = OurQueue()
q.push(0)
q.push(0.8 * 3600 * 24)
q.push(5 * 3600 * 24)
q.push(40 * 3600 * 24)
self.assertEqual(
q.get_counters(40 * 3600 * 24),
[4, 1, 1, 1, 1])
def test_complex(self):
q = OurQueue()
q.push(0)
q.push(10)
q.push(3599)
q.push(3600)
q.push(3601)
q.push(3600 * 24)
q.push(3600 * 24 + 1)
q.push(3600 * 24 * 7)
q.push(3600 * 24 * 7 + 1)
q.push(3600 * 24 * 7 * 30)
q.push(3600 * 24 * 7 * 30 + 1)
self.assertEqual(
q.get_counters(3600 * 24 * 7 * 30 + 1),
[11, 2, 2, 2, 2])
::: ::::::
Make sure you don’t break the existing version,
so that other people software, which relies on yours, won’t break.
Ex. Travis, CircleCI, GitHub actions
Sometimes on servers you can only run jobs as bash scripts.
It’s good to provide hyper-parameters in the command line.
\scriptsize
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Generates tokens.')
parser.add_argument('filename', type=str, nargs='?', default='text',
help='Try files in demo/ e.g. "demo/text.txt"')
parser.add_argument('--n', type=int, nargs='?', default=1,
help='How many sequences should be printed')
parser.add_argument('--l', type=int, nargs='?', default=42,
help='Length of these sequences')
args = parser.parse_args()
(venv) jj@altaria:~/code/markov.py$ python markov.py -h
usage: markov.py [-h] [--n [N]] [--l [L]] [filename]
Generates tokens.
positional arguments:
filename Try files in demo/ e.g. "demo/text.txt"
optional arguments:
-h, --help show this help message and exit
--n [N] How many sequences should be printed
--l [L] Length of these sequences
\scriptsize
saved_results = {
'options': vars(args),
'predictions': predictions,
'results': results,
'folds': folds
}
with open(os.path.join(folder, 'results-{}.json'.format(iso_date)), 'w') as f:
json.dump(saved_results, f)
\normalsize
Creates assistments09/results-2022-07-06T23:04:22.097694.json
which contains:
\scriptsize
{
"options": {"data": "assist09", "n_iter": 20, "d": 5},
"predictions": [
{"fold": 0, "pred": [0.6, 0.4], "truth": [1, 0]},
{"fold": 1, "pred": [0.7, 0.2], "truth": [1, 1]},
],
"results": [
{"name": "auc", "value": 0.76},
{"name": "accuracy", "value": 0.82}
],
"folds": "<path-to-fold-file>"
}
Using Makefile
It defines a dependency graph, for example:
\centering LaTeX figures $\to$ figures PDF $\to$ article / slides
\pause \vspace{1cm}
Here is the Makefile of my CV:
\includegraphics{figures/makefile-cv.png}
\setbeamercovered{transparent}
\centering data $\to$ preprocessed data $\to$ \alert<2>{logs} (incl. hyper-parameters) $\to$ \alert<3>{plots}
\vspace{1cm} \raggedright
\uncover<2>{Logging metrics as much as possible
(it’s OK to log several times at different locations: files, standard output)}
\uncover<3>{Then your plots will only be from your logs; you can only recompute what has changed}
You can open issues; \alert<2->{sometimes} researchers are happy to know their code is useful
Do not forget to cite other people’s software, e.g. \includegraphics[width=2cm]{figures/sklearn.png}
\small \fullcite{pedregosa2011scikit}
\fullcite{pandoc2.17}
\pause \pause
\hfill $\uparrow$ \mintinline{latex}{\fullcite} made using \mintinline{latex}{\usepackage{biblatex-software}}, thanks \raisebox{-8pt}{\includegraphics[width=2cm]{figures/swh.jpg}}
screen
for keeping an interactive session open even while you leave the server.
tmux
is similar and better: great for pair-programming.
\mintinline{shell}{python -m pdb}, the Python debugger.
magic-wormhole
for sending (non-sensitive) data over the Internet
one-click banner that opens a temporary Jupyter notebook or Lab in the browser with your repo (amazing, JupyterHub Core Team! \Clap)
ansible-playbook
for automating a remote install
git
requirements.txt
, doc, examples)Thanks! Questions?
\raggedleft
jill-jenn.vie@inria.fr