Git and Jupyter Notebooks Guide

Curious that they discuss several options, but ignore the totally obvious one: just use jupytext [0]. Jupytext is a (tiny) jupyter extension that reads/writes notebooks as python files, with text cells being represented as comments. With jupytext, you do away with the stupid .ipynb format. As long as you don't need to save the cell outputs, which is the case for version control, jupytext is the way to go.

People: pip install jupytext. All your python files will become notebooks, and your notebooks will become python files.

[0] https://jupytext.readthedocs.io/en/latest/

pletnes · 3 years ago

What happens to the outputs in this case? I found the outputs to be both the most useful parts of notebooks, but also the most troublesome for diffing and versioning.

enriquto · 3 years ago

Why would you commit the outputs into git? That would be like committing compiled binary objects or pdfs. Of course the outputs are useful, but you just want to commit the sources.

The .ipynb stores inputs and outputs together in an unholy way. It is much cleaner to separate them. The inputs are python (or markdown) files that you can edit with a text editor and version control with git. The outputs are html, pdf, or whatever you want to nbconvert to and share.

The .ipynb file would only be useful if you want to share a stateful notebook, whose state cannot be easily reproduced by the people who you share it with. But that would be really bizarre and definitely in bad taste. Sharing the .ipynb is akin to sharing your .pyc files.

I love working with notebooks, but as a measure of hygiene I avoid .ipynb files altogether.

Helmut10001 · 3 years ago

I use Jupytext since years. It allows me to have three types of synced notebook versions: 1) .ipynb (for opening/running), 2) .md (formatted code+comments, without outputs) and 3) *.py (python formatted, code+comments).

I commit the Markdown-version, but I also use the py-version of notebooks for chained notebook imports. Allows me to split larger notebooks into multiple smaller ones. Both of these options are a blessing and Jupytext works super-robust.

Finally, when I want to archive (and share) notebooks _with_ outputs once in a while, I have a cell at the end to convert (nbconvert) to HTML, and I commit this html file. The Markdown-version remains as a clean basis for commit history. The HTML file is much better suited for sharing and archiving than the ipynb file.

kzrdude · 3 years ago

I use jupytext paired with ipynb files. Only store the .py files in git. The ipynb files act as a local cache of outputs. Outputs are loaded from the ipynb even if you open the .py notebook.

bootsmann · 3 years ago

We use jupytext with dvc. You can generate the notebook in dvc.yaml using the jupytext cli and then push this alongside the .py file.

cycomanic · 3 years ago

This was the first thing I wanted to post when reading the article. Jupytext is excellent, although i typically use MyST (an extended Markdown syntax).

joouha · 3 years ago

Euporie (my terminal Jupyter notebook editor) also supports Jupytext

Wow, no mention of DVC (http://www.dvc.org)? That has been invaluable for data scientist workflows.

I definitely do like to strip notebooks and make them run-idempotent to the best of my ability, but sometimes you just need stateful notebooks. And since .ipynb are technically json but in reality act more like a binary file format (with respect to diffing), DVC is the ideal tool to store them. Don't get me started on git annex or LFS, both of those took years off my life due to stress of using them and them bugging out.

Also I am hardly a fan of XML, but does anyone feel like notebook files would have been a near-ideal use-case of it? It's literally a collection of markup. The fact that json was chosen over xml I think is somewhat damning of xml as an application data storage format. I think xml is perfectly cromulent as a write-once-read-many presentation format or rendering target (html, svg, GeniCam api info), but it seems to flounder in virtually every other domain it's been shoehorned into, with the exception of office application formats.

Actually, downthread there is a link to a jupyer enhancement proposal for a .nb.md markdown based format. I think this is great. One theme I keep coming across in my computer science journey is that formats which have mandatory closing endcaps are kind of a PITA. It seems the stream-of-containers (with state machines as needed) is all-around better. JSON-LD is better than JSON, streaming video formats are better than ones that stick metadata at the end, zip is... an eldritch horror, etc.

kortex · 3 years ago

wdroz · 3 years ago

If you don't need to "commit" the output, you can just use nbconvert[0]:

    jupyter nbconvert --clear-output --inplace my_notebook.ipynb

So you can use git as usual, like for code.

[0] -- https://nbconvert.readthedocs.io/en/latest/

andrecosta · 3 years ago

nbstripout[0] does that and installs a pre-commit hook

[0] -- https://github.com/kynan/nbstripout

milliams · 3 years ago

There is a draft JEP (Jupyter Enhancement Proposal) for Markdown-based notebooks (https://github.com/jupyter/enhancement-proposals/pull/103) which will make it a little more RMarkdown-like.

nvy · 3 years ago

Seems to me that this article does a great job explaining why jupyter notebooks are a poor collaboration tool.

I wish that non-emacs implementations of org were more commonplace, as it's a pretty sane markup language and supports embedded code and graphics, diffs nicely, and doesn't introduce the insanity of JSON.

joelschw · 3 years ago

The native GitHub feature in preview will make this a lot better for those able to use it https://github.blog/changelog/2023-03-01-feature-preview-ric...

SalsaCrotch · 3 years ago

This feature has resolved the problem for our team.

sashk · 3 years ago

You don't need to commit output into the git. I used pre-commit filter in git, where it will strip all output from the notebook before it was committed into repository. This allowed us to review the code changes of notebooks.

TeeWEE · 3 years ago

My quick solution is to not commit the result cells, only the commands. So its just code.