Recipe 4: Project Preparation

Data science projects are meant to be shared! Taking time to structure your projects in an understandable, reproducible way will save you lots of headaches when you share with colleagues/other units or when you return to the project after some time.



Version Control

Version control allows you to maintain your stable code (code that is fully working) separately from code that you are working on fixing or adding new features. These various versions are maintained in separate ‘branches.’ This allows you to change and fix code without breaking what you know is working. Version control tracks your changes, and once you have a new working version, the changes can be merged into your stable code. There are various version control systems, but here’s an intro to general version control.

One of the most commonly used version control options is Git. GitHub is a cloud repository that utilizes Git. Version control is necessary for making sure you don’t end up with 10 different files named ‘my_project_v1’, ‘my_project_exploring’, ‘my_project_final’, ‘my_project_definitely_FINAL’, etc etc. Here’s a beginner guide to Git/Github from Towards Data Science. Scroll down to option 2, to make a repository on the website using their user interface. Use the Git Flow toolkit if you’re practicing version control within a team. For more information on advanced project workflows, checkout these guides provided by Beanstalk.



Virtual environment

In short, virtual environments are important for developers that are working on multiple projects with different dependencies. A virtual environment, from the onset, will not seem like a useful endeavor because it creates more friction in your work routine. However, just like the other suggestions in this section, there will come the day where you become grateful for putting in the work to build a professional development environment for yourself and your team.

Let’s capture an example to help illustrate the purpose of virtual environments (or virtualenv, or virtualenvwrapper for those Python nerds). Say you’re a developer working on multiple tools at the same time – an older tool that you’re just updating with v1.2.3 of jinnytool in the PyPi library and a fresh new exciting tool that might also use jinnytool but v2.3.4. Additionally, the old tool is using Python 2.7 and the “customer” likes it that way. The new tool, will of course, use Python 3.6 or maybe even 3.8 or hey, how about Python 4! Life will become very complicated without separate virtual environments for each of the two projects. A smart, organized developer will activate and deactivate environments depending on which project she is working on at the time. This developer names one virtual environment oldtool and one newtool, each using different installations of Python. When she runs pip freeze, the different jinnytool versions are installed at that time. Not only does this help in development, this also matters when it comes to deployment. With an easy pip freeze > requirements.txt, you can easily generate a text file with the list of installed libraries and the versions that the user should be using for the desired, expected output.

Here is a guide to Python’s virtual environments (TDS), and some more advice/tips here. Note that if you’re using Jupyter notebooks that your virtual environment won’t automatically be available within Jupyter. Here’s a guide to making virtual environments available in Jupyter notebooks. An easy way to install a package management environment is by downloading miniconda/anaconda (downside being you’ll be stuck with a lot of things you likely won’t ever need).



Structure (cookiecutter)

A standardized file format like cookiecutter is a must. Here’s another quick walkthrough of using the cookiecutter structure. Notebooks are used for exploring (not production), raw data is kept unchanged, and src holds your .py scripts.



Cookiecutter
Cookiecutter project structure. Source



Notebooks vs Source Code

Some people love notebooks and others hate them. How you use them is up to personal preference. Most people agree that notebooks are great for data exploration, data visualization, and tinkering on new projects, but they can lead to bad habits like hidden state and poor code and unlike Python scripts are not transferable for replicable, production code. That said, sometimes you can add a lot of value to some organizations with just a Jupyter notebook and iPython widgets can give your notebooks some cool interactive capabilities!



» Previous recipe
» Next recipe


» About
» Ingredients
» All recipes
» Resources
» Code examples