How to use Git for your data science projects

Learn how to use Git for your data science projects so you can keep your code backed-up and share it with your colleagues or the data science community.

How to use Git for your data science projects
Picture by Joao Jesus, Pexels.
14 minutes to read

Git is the world’s most widely used version control system and is an essential tool for data scientists, especially those collaborating on projects with others. You’ll need to be able to use Git to get any role working within the data science field, so it’s an important one to master.

Git tracks the changes made within your projects, allows several people to collaborate on the same code, and keeps files remotely backed up and synchronised between different users, workstations, and servers. Here’s a quick guide to the basics of Git.

Installing Git

Git is a command line application and is so commonly used that may already be installed on your machine. Assuming you’re using Ubuntu Linux, you can check whether Git is installed by typing git --version into your terminal.

If Git is installed, you’ll see a message like git version 2.25.1 telling you which release is installed. If it isn’t installed you can install it by entering the following command in your terminal, and entering your password:

sudo apt install -y git

Git repositories

Each time you work on a project you want to keep under version control, you create something called a local Git “repository”. Not to be confused with a suppository, a repository (or “repo”) comprises two main parts: the directories and files containing your code, and a hidden directory called .git, which lies in the root of your repository directory.

The .git directory contains all the information that Git uses to track changes to your files, and allows you to create new versions of your code, store any changes, and roll back to previous versions. This all sits on your local machine, so isn’t backed up, unless you explicitly do so.

While you can use your local Git repository without the need to use a third party service, such as Github.com, to get the full benefits of Git (including off-site backup, collaboration, and code sharing) you’ll likely want to upload or “push” your code to a remote Git repository server, along with your .git directory. Therefore, you may want to head to Github.com to create an account.

png Create a GitHub account, if you don’t have one already.

Creating Git repositories

There are three main ways to create a Git repository, and you’ll probably use all three methods at some stage, so we’ll go through them one by one. The three methods are:

  1. Cloning a remote Git repository
  2. Creating a Git repository for a new project
  3. Creating a Git repository for an existing project
1. Cloning a remote Git repository

The GitHub.com site acts as a sort of app store for open source code, and you can download or “clone” a whole range of public Git repositories to your local machine. You can also publish or “push” your own Git repositories to GitHub and make them private to you only, share them with your colleagues, or make them freely available for anyone to use.

To clone a remote repository to your local machine, open a terminal and navigate to your desired destination directory using cd, then enter git clone followed by the URL of the Git repository you want to clone. In the example below, I’m cloning my datasets repository, which includes some sample datasets I use for learning new modeling techniques.

cd /home/matt/projects/datasets
git clone https://github.com/flyandlure/datasets

After the command is executed, Git will fetch the code from the repository and place it in your desired directory, along with a hidden .git directory containing the version control information. You can then use the code as you wish, including making your own modifications, which all get stored in the .git directory.

Cloning into 'datasets'...
remote: Enumerating objects: 18, done.
remote: Counting objects: 100% (18/18), done.
remote: Compressing objects: 100% (17/17), done.
remote: Total 18 (delta 4), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (18/18), 1.02 MiB | 1.91 MiB/s, done.
2. Creating a Git repository for a new project

To create a repository for a new project you can use the git init command in your terminal. Here, you’d navigate to your projects directory using cd and then enter git init followed by the name you want to assign to the repo, i.e. nptb_model.

cd /home/matt/projects
git init nptb_model

When issued, Git will create a directory called nptb_model for your repository files, inside which you’ll find the hidden .git directory where all the version control information is stored. Each time you make changes to your project you can “commit” them to the repository (more on this later) and you have the option of “pushing” them to a remote repository on a site like GitHub.

3. Creating a Git repository for an existing project

The process for creating a new Git repository from existing files is fairly similar. First, navigate to the project directory that contains your existing code using cd, then enter git init to initialise the new repository.

cd /home/matt/projects/nptb_model
git init

This will create a hidden .git directory containing the version control information within the nptb_model directory, allowing you to have local version control and push your repository to a remote server.

png Picture by Yancy Min, Unsplash.

Git branches

All version control systems, including Git, are based around a concept called “branching”. Branches are basically different versions of your project, each with a specific name. The neat thing about branches is that they allow you and your colleagues to work on the same project simultaneously. They also allow you to create separate versions of your project in which you can test new ideas, without the risk of them impacting the main project’s code.

You are free to pick the names for your branches, but you’ll always have one called the master branch that contains your final production ready code. You would not normally work directly on the master branch. Instead, you’d create other branches and merge them with the master. You can view your current branch via the terminal:

git status

Branch naming conventions vary between companies, but most lead engineers and architects I’ve worked with tend to have a development branch for stuff that’s in progress, and a staging branch for working nearing completion.

Each data scientist also works on their own local branch (often named to match a ticket number, like ticket123), which is then pushed to the repository for code review by the rest of the team, before being merged with development, staging, and eventually master once everyone is happy with the code quality. You can view all branches with git branch.

git branch
1. Creating a branch

There are a couple of ways to create new branches in Git. You can type git branch ticket123 to do this, but you’ll then need to type git checkout ticket123 to change to that branch.

To make this quicker, there’s a useful shorthand you can use to both create the new branch and change to the new branch.

git checkout -b ticket123
2. Changing to a different branch

As I mentioned above, you wouldn’t normally work on the master branch. Instead, you’d create a ticket for the specific item you’re working on (i.e. ticket123 - Log transform data) in your ticketing system (such as GitLab, Jira or similar) and then create a local branch with a matching branch name, i.e. ticket123.

This allows others to easily identify your branch when it’s pushed to the remote repository. If you’re currently working on the ticket122 branch, and want to change to the ticket125 branch that your colleague has been working upon, you need to checkout this branch from the remote repository and download it to your local machine, then change to that branch.

3. Merging code branches

Let’s say we’ve got a project with a master branch and have been working on some changes in a branch called ticket1. We can check the branches in the project by typing:

git branch
* master
  ticket1

We’ve already added our files to Git and committed our changes to the local ticket1 branch in our repository along with a commit message, so we’ll change to the master branch with the following command:

git checkout master
Switched to branch 'master'

Then, to merge the code from our ticket1 branch with the master we can use:

git merge ticket1

Updating 1624dbe..1f1f6c7
Fast-forward
 ticket1.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Adding files to Git

When you work on a file you need to add it to Git so that your changes are tracked. By default, Git won’t track any changes unless you add them. If you integrate Git with your IDE, such as PyCharm, it can handle this for you. However, should you need to do it manually you can do so with the below command:

git add filename.txt

Committing code to a repository

Each time you want to store a change, or a batch of related changes, you “commit” your changes to your local repository along with a commit message to explain what you changed. This basically creates a checkpoint that you can roll back to later if you wish.

The commit message is there to help you, and any other people examining your code, see what was changed, so is really useful for debugging and code review. To commit your file with a commit message you can use this command:

git add filename.txt
git commit -m "Added file to hold useful information."

If you want to change your commit message, you can amend it afterwards using this command:

git commit -m --amend "Added file to hold useful information on Git."

When you commit code the changes are stored in your local repository only until you push them to a remote Git repo. Therefore, they’ll only be available to you, and won’t get backed up, unless you push them.

Pushing your local commits to GitHub

Once you’re happy that your code is working, has been thoroughly tested, and is ready to be included within your project, you can push your commits to your remote repository. For example, let’s say I’d been fixing a bug in ticket-123 and had tested it locally and wanted to push it to the repository.

I’d normally push this to Github using a branch of the same name as my ticket, and then let my co-workers know that the ticket was ready for their review by creating a “pull request”. One of my team would then pull my branch from GitHub and run it on their machine. They’d pore over my code and check it made sense, worked correctly, and didn’t break anything else.

Then, if they were happy, they’d mark that it had passed code review and another member of the team would then merge the branch onto the dev branch for further testing, or onto the master branch, so it could be deployed to production. Different teams have different approaches, but it’s generally something along these lines.

git push ticket123 dev

Ignoring certain files with .gitignore

In many cases in data science there are going to be files in your project that you don’t want to put into your repository. For example, if you use a Python virtual environment you may have a massive venv directory, or you might have secure credentials, or data, that should be placed in the cloud.

You can prevent these selected files from being added to Git by creating a .gitignore file. This is a simple text file that sits within the root directory of your project and contains a list of the files and directories you want Git to ignore. Here’s an example from one of my projects.

venv
config
client_secrets.json
example.py
.idea
gapandas.egg-info
dist
build
*.ipynb

Matt Clarke, Monday, March 08, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.