Curious that they discuss several options, but ignore the totally obvious one: just use jupytext [0]. Jupytext is a (tiny) jupyter extension that reads/writes notebooks as python files, with text cells being represented as comments. With jupytext, you do away with the stupid .ipynb format. As long as you don't need to save the cell outputs, which is the case for version control, jupytext is the way to go.
People: pip install jupytext. All your python files will become notebooks, and your notebooks will become python files.
What happens to the outputs in this case? I found the outputs to be both the most useful parts of notebooks, but also the most troublesome for diffing and versioning.
Why would you commit the outputs into git? That would be like committing compiled binary objects or pdfs. Of course the outputs are useful, but you just want to commit the sources.
The .ipynb stores inputs and outputs together in an unholy way. It is much cleaner to separate them. The inputs are python (or markdown) files that you can edit with a text editor and version control with git. The outputs are html, pdf, or whatever you want to nbconvert to and share.
The .ipynb file would only be useful if you want to share a stateful notebook, whose state cannot be easily reproduced by the people who you share it with. But that would be really bizarre and definitely in bad taste. Sharing the .ipynb is akin to sharing your .pyc files.
I love working with notebooks, but as a measure of hygiene I avoid .ipynb files altogether.
I use Jupytext since years. It allows me to have three types of synced notebook versions: 1) .ipynb (for opening/running), 2) .md (formatted code+comments, without outputs) and 3) *.py (python formatted, code+comments).
I commit the Markdown-version, but I also use the py-version of notebooks for chained notebook imports. Allows me to split larger notebooks into multiple smaller ones. Both of these options are a blessing and Jupytext works super-robust.
Finally, when I want to archive (and share) notebooks _with_ outputs once in a while, I have a cell at the end to convert (nbconvert) to HTML, and I commit this html file. The Markdown-version remains as a clean basis for commit history. The HTML file is much better suited for sharing and archiving than the ipynb file.
I use jupytext paired with ipynb files. Only store the .py files in git. The ipynb files act as a local cache of outputs. Outputs are loaded from the ipynb even if you open the .py notebook.
This was the first thing I wanted to post when reading the article. Jupytext is excellent, although i typically use MyST (an extended Markdown syntax).
Wow, no mention of DVC (http://www.dvc.org)? That has been invaluable for data scientist workflows.
I definitely do like to strip notebooks and make them run-idempotent to the best of my ability, but sometimes you just need stateful notebooks. And since .ipynb are technically json but in reality act more like a binary file format (with respect to diffing), DVC is the ideal tool to store them. Don't get me started on git annex or LFS, both of those took years off my life due to stress of using them and them bugging out.
Also I am hardly a fan of XML, but does anyone feel like notebook files would have been a near-ideal use-case of it? It's literally a collection of markup. The fact that json was chosen over xml I think is somewhat damning of xml as an application data storage format. I think xml is perfectly cromulent as a write-once-read-many presentation format or rendering target (html, svg, GeniCam api info), but it seems to flounder in virtually every other domain it's been shoehorned into, with the exception of office application formats.
Actually, downthread there is a link to a jupyer enhancement proposal for a .nb.md markdown based format. I think this is great. One theme I keep coming across in my computer science journey is that formats which have mandatory closing endcaps are kind of a PITA. It seems the stream-of-containers (with state machines as needed) is all-around better. JSON-LD is better than JSON, streaming video formats are better than ones that stick metadata at the end, zip is... an eldritch horror, etc.
Seems to me that this article does a great job explaining why jupyter notebooks are a poor collaboration tool.
I wish that non-emacs implementations of org were more commonplace, as it's a pretty sane markup language and supports embedded code and graphics, diffs nicely, and doesn't introduce the insanity of JSON.
You don't need to commit output into the git. I used pre-commit filter in git, where it will strip all output from the notebook before it was committed into repository. This allowed us to review the code changes of notebooks.
People: pip install jupytext. All your python files will become notebooks, and your notebooks will become python files.
[0] https://jupytext.readthedocs.io/en/latest/
The .ipynb stores inputs and outputs together in an unholy way. It is much cleaner to separate them. The inputs are python (or markdown) files that you can edit with a text editor and version control with git. The outputs are html, pdf, or whatever you want to nbconvert to and share.
The .ipynb file would only be useful if you want to share a stateful notebook, whose state cannot be easily reproduced by the people who you share it with. But that would be really bizarre and definitely in bad taste. Sharing the .ipynb is akin to sharing your .pyc files.
I love working with notebooks, but as a measure of hygiene I avoid .ipynb files altogether.
I commit the Markdown-version, but I also use the py-version of notebooks for chained notebook imports. Allows me to split larger notebooks into multiple smaller ones. Both of these options are a blessing and Jupytext works super-robust.
Finally, when I want to archive (and share) notebooks _with_ outputs once in a while, I have a cell at the end to convert (nbconvert) to HTML, and I commit this html file. The Markdown-version remains as a clean basis for commit history. The HTML file is much better suited for sharing and archiving than the ipynb file.
I definitely do like to strip notebooks and make them run-idempotent to the best of my ability, but sometimes you just need stateful notebooks. And since .ipynb are technically json but in reality act more like a binary file format (with respect to diffing), DVC is the ideal tool to store them. Don't get me started on git annex or LFS, both of those took years off my life due to stress of using them and them bugging out.
Also I am hardly a fan of XML, but does anyone feel like notebook files would have been a near-ideal use-case of it? It's literally a collection of markup. The fact that json was chosen over xml I think is somewhat damning of xml as an application data storage format. I think xml is perfectly cromulent as a write-once-read-many presentation format or rendering target (html, svg, GeniCam api info), but it seems to flounder in virtually every other domain it's been shoehorned into, with the exception of office application formats.
Actually, downthread there is a link to a jupyer enhancement proposal for a .nb.md markdown based format. I think this is great. One theme I keep coming across in my computer science journey is that formats which have mandatory closing endcaps are kind of a PITA. It seems the stream-of-containers (with state machines as needed) is all-around better. JSON-LD is better than JSON, streaming video formats are better than ones that stick metadata at the end, zip is... an eldritch horror, etc.
[0] -- https://nbconvert.readthedocs.io/en/latest/
[0] -- https://github.com/kynan/nbstripout
I wish that non-emacs implementations of org were more commonplace, as it's a pretty sane markup language and supports embedded code and graphics, diffs nicely, and doesn't introduce the insanity of JSON.