5 minute read

Conventions I established while using a Notebook everyday during my PhD

Jupyter Notebooks saved my sanity in graduate school. I genuinely don’t know how I wouldn’t have stayed organized without them. In fact, I didn’t know about them for the first few month of graduate school, and ran into multiple issues during that time.

On the other hand, people have a lot of legitimate concerns and issues with programming in notebooks. The global mutable state is probably the biggest. There are plenty of other issues, some of which I don’t address because they didn’t effect my workflow (e.g. playing nice with git). Again, these concerns are legitimate, but a lot of them come from professional software developers who have only recently tried notebooks. In practice, I don’t find any of these concerns reasons to not use notebooks.

Here are ten tips I’ve picked up over the past five years of programming in notebooks on an almost daily basis to do bioinformatics research:

  1. Make a new notebook a month. This worked well for me because I only had one main project, so everything was chronological. If you have multiple project long projects, consider keeping one a month per project. A month is a long enough time that your code isn’t too fragmented, but short enough that your notebooks aren’t unwieldy.
  2. Do your best to run cells in order and to not rerun cells after you’ve written new ones. This is aspirational. The real rule is:
    • Don’t rerun a cell you didn’t write that day. Copy and paste the cell at the bottom of the notebook, and do it there. You’ll be in a hell of hidden global state if you don’t have some rule that’s similar to this. Even 24 hours later, you think you have all of the necessary context in your head to rerun the cell, but I always ended up throwing an Exception I thought I’d accounted for.
  3. DRY doesn’t apply here, feel free to copy and paste code some. In fact, do it maybe 5-7 times. After that, pull it into a .py file. You should be keeping a repository for “core” code that you repeatedly use during the project. If you pull the code out of your notebook after 2-4 times, you’re probably too early and don’t know the correct way to abstract the code in a way that makes it pleasant to use down the road. If you wait longer, your notebook becomes blotted, and you’re pretty likely to start using the wrong copy of the code snippet.
  4. Speaking of DRY, this isn’t software engineering, it’s data analysis; don’t over engineer things. But do stick to good practices where ever possible. Good variable naming is a huge bonus.
  5. Use full, explict strings when doing IO. It’s tempting to do things like outfile = f'{base}/{cell_type}_{rna_seq}.csv' inside of a loop. Sometimes it’s necessary. However, if you can avoid the temptation, your notebooks become a (mostly) transactional history of your filesystem. One of the biggest problems I faced as a first year graduate student was looking at the name of a file and thinking “How in the world did I produce ES_polyA_v3.csv, and how is it different than v2? That was a month ago!” With notebooks, all I had to do was grep "ES_polyA_v3.csv" path/to/notebooks/*.ipynb, and I’d know exactly what code (and upstream data!) produced it. This is the most powerful feature of notebooks.
    • I fell in and out of this habit, but all data related file IO should happen in a notebook. Want to restructure some subdirectories? Run all your cp, mv, and rm commands in a notebook, so you can always trace where things finally ended up.
  6. Graphs were similar to data files, but even worse. It was easy to make a lot of slightly different visualizations of the same main dataset. All of the above rules applied to graphs, with a couple of additions.
    1. To be honest, I don’t even know what notebooks default graphing settings are anymore. All of my notebooks began with, among other things, %matplotlib inline, so graphs were embedded in my notebooks. But! Even if you’re sure you don’t need it, go ahead and save the graph. Graphs are usually pretty small, and it’s usually a hassle to regenerate the exact conditions to remake the graph.
    2. Save the graph as a .pdf. This becomes more important as your graphs are more likely to be used in a presentation or publication. matplotlib is powerful, but don’t underestimate the flexibility of Adobe Illustrator. .pdf manipulations in Illustrator will likely take you hours, instead of literal weeks perfecting your matplotlib script. If it turns out you do want a .png, just run an image converter.
  7. .pdf should also be the fate of the notebooks themselves. At the end of the month freeze the notebook. I just did this by convention. When the next month hit, I stopped adding new cells to the previous notebook. I made a .pdf copy using. jupyter nbconvert --to pdf nb.ipynb. In theory, converting to a .pdf provides you with an easy way of sharing your notebook with your python deficient colleagues. In practice, nobody cares about your notebook but you. The real reason to convert is for searching later. When I would run grep on a file, as described earlier, it could easily be found in a half dozen notebooks. Opening and scrolling through those was a hassle. Instead, I can quickly open the relevant .pdf files, and find what I need. Usually, this immediately leads to another grep (perhaps on the input data to the file I originally searched for), and the process starts over.
  8. Use %load_ext watermark and %watermark -a 'Jessime Kirk' -nmv --packages as basic dependency management. Python packaging and dependency management is a mess. If you have a system that works for you, keep it. My strategy was to keep a single main conda virtual environment. At the end of each month, I’d upgrade all of the packages, run watermark at the beginning of my new notebook, and not touch anything until next month. Over the 5 years, this saved me probably 3-5 times. pandas or some other library would subtlety change in a way that code that was working before wouldn’t any more. Knowing that I was on a different version of pandas from six months ago saved me a lot of headache.
  9. Cleanup useless noise. This can be a little bit tricky if you’re not sure what counts as noise. But a lot of my earliest notebooks are large and unwieldy because I output too many useless lines. More than the number of bytes, the likelihood of finding a useful line in those notebooks is significantly lower. Deciding how to do this is a matter of practice, and figuring out what works for your workflow.
  10. Don’t listen to me! Working with notebooks was a joy for me. I was intentional and experimental. I thought a lot about what I needed in a workflow, and I tried a lot of different possible solutions until I found a set that worked for me. You should do the same! Maybe you’ll find some extensions you’ll fall in love with.

Tags:

Categories:

Updated: