Git is the world’s most widely used version control system and is an essential tool for data scientists, especially those collaborating on projects with others. You’ll need to be able to use Git to get any role working within the data science field, so it’s an important one to master.
Git tracks the changes made within your projects, allows several people to collaborate on the same code, and keeps files remotely backed up and synchronised between different users, workstations, and servers. Here’s a quick guide to the basics of Git.
Git is a command line application and is so commonly used that may already be installed on your machine. Assuming you’re using Ubuntu Linux, you can check whether Git is installed by typing
git --version into your terminal.
If Git is installed, you’ll see a message like
git version 2.25.1 telling you which release is installed. If it isn’t installed you can install it by entering the following command in your terminal, and entering your password:
sudo apt install -y git
Each time you work on a project you want to keep under version control, you create something called a local Git “repository”. Not to be confused with a suppository, a repository (or “repo”) comprises two main parts: the directories and files containing your code, and a hidden directory called
.git, which lies in the root of your repository directory.
.git directory contains all the information that Git uses to track changes to your files, and allows you to create new versions of your code, store any changes, and roll back to previous versions. This all sits on your local machine, so isn’t backed up, unless you explicitly do so.
While you can use your local Git repository without the need to use a third party service, such as Github.com, to get the full benefits of Git (including off-site backup, collaboration, and code sharing) you’ll likely want to upload or “push” your code to a remote Git repository server, along with your
.git directory. Therefore, you may want to head to Github.com to create an account.
Create a GitHub account, if you don’t have one already.
There are three main ways to create a Git repository, and you’ll probably use all three methods at some stage, so we’ll go through them one by one. The three methods are:
The GitHub.com site acts as a sort of app store for open source code, and you can download or “clone” a whole range of public Git repositories to your local machine. You can also publish or “push” your own Git repositories to GitHub and make them private to you only, share them with your colleagues, or make them freely available for anyone to use.
To clone a remote repository to your local machine, open a terminal and navigate to your desired destination directory using
cd, then enter
git clone followed by the URL of the Git repository you want to clone. In the example below, I’m cloning my
datasets repository, which includes some sample datasets I use for learning new modeling techniques.
cd /home/matt/projects/datasets git clone https://github.com/flyandlure/datasets
After the command is executed, Git will fetch the code from the repository and place it in your desired directory, along with a hidden
.git directory containing the version control information. You can then use the code as you wish, including making your own modifications, which all get stored in the
Cloning into 'datasets'... remote: Enumerating objects: 18, done. remote: Counting objects: 100% (18/18), done. remote: Compressing objects: 100% (17/17), done. remote: Total 18 (delta 4), reused 0 (delta 0), pack-reused 0 Unpacking objects: 100% (18/18), 1.02 MiB | 1.91 MiB/s, done.
To create a repository for a new project you can use the
git init command in your terminal. Here, you’d navigate to your projects directory using
cd and then enter
git init followed by the name you want to assign to the repo, i.e.
cd /home/matt/projects git init nptb_model
When issued, Git will create a directory called
nptb_model for your repository files, inside which you’ll find the hidden
.git directory where all the version control information is stored. Each time you make changes to your project you can “commit” them to the repository (more on this later) and you have the option of “pushing” them to a remote repository on a site like GitHub.
The process for creating a new Git repository from existing files is fairly similar. First, navigate to the project directory that contains your existing code using
cd, then enter
git init to initialise the new repository.
cd /home/matt/projects/nptb_model git init
This will create a hidden
.git directory containing the version control information within the
nptb_model directory, allowing you to have local version control and push your repository to a remote server.
Picture by Yancy Min, Unsplash.
All version control systems, including Git, are based around a concept called “branching”. Branches are basically different versions of your project, each with a specific name. The neat thing about branches is that they allow you and your colleagues to work on the same project simultaneously. They also allow you to create separate versions of your project in which you can test new ideas, without the risk of them impacting the main project’s code.
You are free to pick the names for your branches, but you’ll always have one called the
master branch that contains your final production ready code. You would not normally work directly on the
master branch. Instead, you’d create other branches and merge them with the
master. You can view your current branch via the terminal:
Branch naming conventions vary between companies, but most lead engineers and architects I’ve worked with tend to have a
development branch for stuff that’s in progress, and a
staging branch for working nearing completion.
Each data scientist also works on their own local branch (often named to match a ticket number, like
ticket123), which is then pushed to the repository for code review by the rest of the team, before being merged with
staging, and eventually
master once everyone is happy with the code quality. You can view all branches with
There are a couple of ways to create new branches in Git. You can type
git branch ticket123 to do this, but you’ll then need to type
git checkout ticket123 to change to that branch.
To make this quicker, there’s a useful shorthand you can use to both create the new branch and change to the new branch.
git checkout -b ticket123
As I mentioned above, you wouldn’t normally work on the
master branch. Instead, you’d create a ticket for the specific item you’re working on (i.e.
ticket123 - Log transform data) in your ticketing system (such as GitLab, Jira or similar) and then create a local branch with a matching branch name, i.e.
This allows others to easily identify your branch when it’s pushed to the remote repository. If you’re currently working on the
ticket122 branch, and want to change to the
ticket125 branch that your colleague has been working upon, you need to
checkout this branch from the remote repository and download it to your local machine, then change to that branch.
Let’s say we’ve got a project with a
master branch and have been working on some changes in a branch called
ticket1. We can check the branches in the project by typing:
* master ticket1
We’ve already added our files to Git and committed our changes to the local
ticket1 branch in our repository along with a commit message, so we’ll change to the
master branch with the following command:
git checkout master
Switched to branch 'master'
Then, to merge the code from our
ticket1 branch with the
master we can use:
git merge ticket1
Updating 1624dbe..1f1f6c7 Fast-forward ticket1.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
When you work on a file you need to add it to Git so that your changes are tracked. By default, Git won’t track any changes unless you add them. If you integrate Git with your IDE, such as PyCharm, it can handle this for you. However, should you need to do it manually you can do so with the below command:
git add filename.txt
Each time you want to store a change, or a batch of related changes, you “commit” your changes to your local repository along with a commit message to explain what you changed. This basically creates a checkpoint that you can roll back to later if you wish.
The commit message is there to help you, and any other people examining your code, see what was changed, so is really useful for debugging and code review. To commit your file with a commit message you can use this command:
git add filename.txt git commit -m "Added file to hold useful information."
If you want to change your commit message, you can amend it afterwards using this command:
git commit -m --amend "Added file to hold useful information on Git."
When you commit code the changes are stored in your local repository only until you push them to a remote Git repo. Therefore, they’ll only be available to you, and won’t get backed up, unless you push them.
Once you’re happy that your code is working, has been thoroughly tested, and is ready to be included within your project, you can push your commits to your remote repository. For example, let’s say I’d been fixing a bug in
ticket-123 and had tested it locally and wanted to push it to the repository.
I’d normally push this to Github using a branch of the same name as my ticket, and then let my co-workers know that the ticket was ready for their review by creating a “pull request”. One of my team would then pull my branch from GitHub and run it on their machine. They’d pore over my code and check it made sense, worked correctly, and didn’t break anything else.
Then, if they were happy, they’d mark that it had passed code review and another member of the team would then merge the branch onto the
dev branch for further testing, or onto the
master branch, so it could be deployed to production. Different teams have different approaches, but it’s generally something along these lines.
git push ticket123 dev
In many cases in data science there are going to be files in your project that you don’t want to put into your repository. For example, if you use a Python virtual environment you may have a massive
venv directory, or you might have secure credentials, or data, that should be placed in the cloud.
You can prevent these selected files from being added to Git by creating a
.gitignore file. This is a simple text file that sits within the root directory of your project and contains a list of the files and directories you want Git to ignore. Here’s an example from one of my projects.
venv config client_secrets.json example.py .idea gapandas.egg-info dist build *.ipynb
Matt Clarke, Monday, March 08, 2021