End of Year Review (Part I)
My PI has requested that we answer a few questions as a reflection of the past year and as a mental preparation for the coming year. I figured it would be worthwhile to record at least a few of my answers here. This seems like a useful exercise and my answers are decently in depth. So the post isn’t overwhelming, I’m going to dedicate a different post for each question.
- Most of the scripts have virtually no comments or documentation.
- Everything is hard-coded. Occasionally, there are chunks of data stored in the script. There are random strings referring to data files throughout the code. There’s no way I would reliably have found and changed all those strings if I wanted to use different data.
- There’s only minimal use of libraries like numpy and scipy; I didn’t feel very comfortable with them yet.
- Several scripts have no functions whatsoever, just 100 lines of straight code. All of these scripts were obviously capable of doing exactly one thing.
- Other scripts have one main function that take a couple basic parameters (like the names of files). Unfortunately, I didn’t yet know how to incorporate command-line options into a python script. At the bottom of the script there’s a long list of strings corresponding to the files on which I had run the code. As I performed different runs of the script, I would comment out previous strings…
- Surprisingly, https://www.reddit.com/r/Python has been a great resource. The community is active and new packages, updates, and other resources are constantly posted and discussed. I now have a way to stay up to date with virtually everything Python has to offer as a data analysis tool.
- Numpy/Scipy/Pandas/Matplotlib/sklearn are now part of my everyday tools. I understand their APIs, documentation, and am generally familiar with their capabilities and limitations. This means that I can much more quickly import, manipulate, analyze, plot, and save large datasets than I could last year. A great example is using Numpy and linear algebra for computing Pearson’s correlation ~500 times faster than I was previously.
- Discovering the Ipython/Jupyter Notebook has been the biggest game changer for me. Using the notebook allows me to essentially record everything I do in a month. A lot of code that I write has a one-and-done functionality. There’s no reason for me to store it as its own script. Now I don’t have to. I can make notes to myself in the same location as my code, save graphs directly below the code that made them, and section off my projects in a way that makes sense. Basically, I now have a method for reproducibility that has stopped my code directory structure from becoming an incomprehensible disaster. Organization is not my natural forte, so having the capability to use the Notebooks feels like a major accomplishment.
- Another large benefit of the Notebook is that the .py files I do write are readable and reusable. When I write some code I know I’m going to need again, I take a bit of time and rewrite it in a .py file as a class. I properly break the code up into modular functions and remember to write a docstring for each function. Inevitably, when I go to rerun my code, I’m going to want to try something slightly different than I did the first time. Now, all of my core logic is abstracted away in a .py file. All I have to do is instantiate the class I want inside of the Notebook and choose the functions I want to use this time. All of my slight variations from run to run are immediately saved in the notebook.
While I’m still a long way away from a data analysis expert, I can honestly say that I no longer feel on the brink of disaster like I did this time last year. Improvements can be made in each of the four points listed, but I’m happy to say that I believe I’m on the right track.