How to use Docker for your data science projects

Learning to use Docker for data science projects will make configuring, deploying, and sharing models with colleagues much easier.

How to use Docker for your data science projects
19 minutes to read

How many times have you struggled to get Python packages like TensorFlow, Keras, or PyTorch working together? How many times have you downloaded code or shared yours with others only to find that it didn’t work due to some kind of software conflict? If you’re like most data scientists, the answer is probably “almost every day”. Thankfully, there is a piece of software called Docker designed to solve this very problem, as well as several others. Learning how to use Docker for data science could change the way you work forever.

What is Docker?

Docker is a “platform as a service” system that lets you run, deploy, and distribute software in packages called containers. Containers are a little like virtual machines. You might be familiar with VMs if you’ve used VMWare or VirtualBox. Basically, both VMs and containers provide virtualization at the operating system level, effectively running a virtual operating system inside a physical one. However, containers have some notable benefits over VMs and have grown rapidly popularity as a result. Most large businesses like Google, Facebook, and Amazon use containers for data science development, deploying models, and powering software services.

Why use Docker over a virtual machine?

There are two key reasons: size and speed. To create a virtual machine image (VMI) in VMWare of VirtualBox you need to install an entire operating system, such as Linux or Windows, from a DVD ISO image into a virtual environment within your main machine. These VMIs are typically many gigabytes in size, so they’re not very portable, quick to download, or install and configure.

They also work like a full operating system and take a minute or more to boot up and need to be shutdown like a physical computer. They’re great for providing an entire self-contained virtual operating system, or creating a standardised development environment, but they are of less use for deploying applications.

Docker containers are much smaller than VMs because they all run from a single operating system kernel, rather than a kernel for each VM. They are quicker to download, take up less disk space, and use far fewer system resources. The Alpine Linux Docker image, for example, is just 5MB in size! They also fire up much quicker - many in just a split second.

The combination of lower resources, smaller size, and faster performance means that you can use several Docker images on a single machine, with each one performing a particular task. For example, you might have a container for your application, a container for MySQL, a container for Elastic, and a container for Kibana. If you tried doing this with virtual machines, you’d quickly run out of resources and your machine would grind to a halt.

png

Why do people use Docker for data science?

The complex software we use in data science can take some time and effort to set up and things don’t always work consistently when software versions are slightly different. Python virtual environments and using pip to install the packages listed in your requirements.txt can help, but it’s not perfect. Using Docker means you can give everyone in your team the same data science environment, so the software runs correctly and identically on each machine, and you can also deploy your Docker container to a production server, so that works perfectly too. Your projects will be shareable and reproducible.

Docker, therefore, allows you to create an environment to your exact specifications and then save the settings so they can be shared with others, allowing them to create an identical environment on their local machine, and it lets you do the same with your server - which just isn’t practical using a VM. There are tens of thousands of freely available Docker images you can download to obtain a pre-configured environment, application, or OS, so learning how to use Docker can be a big time saver.

As everything in a Docker container is separated or “containerized”, what’s installed in your Docker container won’t affect your machine’s main software configuration either, so you can run both TensorFlow 1 and 2 on one machine at the same time. If you need a MySQL database or Apache Spark, you can run these in containers too and they can all speak to each other seamlessly. If you update the software on your machine, it won’t impact your Docker containers. You can even store the configurations in Git and just pull them down to any machine you need them on.

What type of machine is required to run Docker?

Docker started out as a Linux application but is now also available for Windows and macOS, as well as various other systems, such as FreeBSD. In fact, if you prefer a graphical user interface, the Mac and Windows options are better, though the majority of data scientists use Linux. It’s not resource hungry - the minimum requirements are a 64 bit processor, at least 4 GB of RAM, and a BIOS that supports hardware virtualization. These are all common to most modern PCs - including laptops and Macs.

Containers work across devices. You can create a Docker container on your Mac laptop and share it with colleagues who use Linux, and then deploy it on a server running Linux or in AWS.

png

How does Docker work?

At the most basic level, Docker consists of four main things: the Docker daemon, the Docker client, the Docker registry, and Docker objects.

Docker daemon: The daemon or dockerd is an application which runs in the background on your machine and listens for any requests from the Docker API and communicates with Docker objects. It’s basically the middleman between you and the Docker system.

Docker client: The client or docker is the application used to send commands to the Docker daemon. For example, you might tell it to run a particular container, or pull one from the registry.

Docker registry: The registry is a bit like an app store for Docker. You can download any of thousands of pre-built Docker containers from here and you can upload your own to share with your colleagues or others.

Docker objects: Docker objects are the things you download or pull from the registry. Most commonly these objects will be images or containers. Containers are “runnable” instances of an image and work out of the box. Images are configuration templates that allow your machine to assemble the environment on your machine by downloading the bits it needs.

png

Getting started

1. Install Docker

To install Docker on an Ubuntu 20.04 workstation, open a terminal and the commands below. This will install Docker, start it every time you boot, and provide the Docker system with the administrative privileges it requires to run correctly.

sudo apt install docker.io
sudo usermod -aG docker $USER
sudo systemctl enable --now docker

2. Test Docker

To confirm that Docker has been installed enter docker --version, which will return the details of the Docker version and build you’re running.

(base) matt@SonOfAnton:~$ docker --version
Docker version 19.03.8, build afacb8b7f0

Next, we’‘ll install Docker’s tiny “hello-world” image to confirm everything works. You can do this by entering docker run hello-world. If this works, Docker should download the hello-world image from the Docker registry and you should see the words below.

Hello from Docker!

3. Basic Docker commands

There are lots of commands you can use for Docker. Some of the main ones are below, but you can see the full list in the Docker documentation or by typing docker --help into your terminal.

  • Run a container: docker run container-name-here
  • Start a container: docker start container-name-here
  • Stop a container: docker stop container-name-here
  • Pull a container from a registry: docker pull container-name-here
  • Push a container to a registry: docker push container-name-here registry-here
  • Build a Docker image from a file: docker build
  • List all Docker images: docker images --all
  • Get information on Docker: docker info

png

Using a pre-built Docker container for data science

Where Git has GitHub, allowing you to find thousands of open source applications you can use in your work and store or share your own, Docker has Docker Hub. It’s packed full of thousands of pre-built Docker containers, images, and plugins, that you can install quickly and easily install, just like the hello-world example we used above.

If you’ve ever pulled out your hair while trying to configure Keras, TensorFlow and other packages so they all work properly together, you’ll love this, because you can download pre-configured environments in which everything just works. You can also save your own to re-use it in future or share it with your colleagues.

1. Find a Docker container

You can search for Docker images and containers that might be of interest by typing docker search "search term here" from your terminal, but the easiest way is to browse them on Docker Hub, as each one includes some information on what’s included and how you can use it. Let’s try jupyter/tensorflow-notebook. This provides a full TensorFlow setup, plus Scikit-Learn, Pandas, Jupyter notebooks and loads of other packages data scientists commonly use. It’s one of a range of excellent Docker containers for data scienceoffered by the Jupyter Project team.

2. Install the container

To install the jupyter/tensorflow-notebook environment in your Docker, open up a terminal and enter the command docker run jupyter/tensorflow-notebook. This will install all the packages and set everything up so it’s ready to use. As this is quite a full-featured Docker container, do expect this process to take a while as it will need to download all the requirements and run the install scripts.

Once Docker has finished installing the code and setting everything up, it will automatically run the juypter notebook command you’d normally enter in your terminal. Close this by pressing ctrl c in your terminal. If you try to run this using the link in the terminal it will say “site cannot be reached”, as there’s an extra step we need to take before it’s ready for use.

3. Fire up the container

To gain access to the Jupyter TensorFlow Notebook, open up a terminal and enter docker run -p 8888:8888 jupyter/tensorflow-notebook. This will map the port 8888 of the Docker container to port 8888 of your machine. As the container is already present, Docker will just fire this up rather than downloading everything again. The terminal will show Jupyter starting up and will display the URL of the notebook. Hold ctrl and click the link to access your containerized Jupyter environment.

4. Connect your host filesystem to the container

If you try to create a Jupyter notebook file using Jupyter, you’ll currently fail because it’s not yet configured to work with your local filesystem. What we’re going to do next is get Jupyter to create and store notebooks on your machine, but do everything else within the container. First create a directory in your preferred location on your machine.

mkdir /home/matt/Development/notebooks

The Jupyter notebook containers, weirdly, use a home directory on the container called /home/jovyan, so next we need to close the running Jupyer instance with ctrl c and issue a modified command. This will run the container, map the 8888 port on the container to the 8888 port on our machine, and map the /home/matt/Development/notebooks directory to /home/jovyan on the container. Here’s the command:

docker run -p 8888:8888 -v ~/Development/notebooks:/home/jovyan jupyter/tensorflow-notebook

Now, when you creating a notebook, it will be saved to the notebooks directory on your machine, while all of the applications are kept within the container.

5. Create an alias

To finish off, we’ll create a helpful shortcut command called an alias to let us start the TensorFlow notebook without typing a long and complicated command every time.

Open up the .bashrc file on your Linux machine using your favourite editor, i.e. sudo gedit ~/.bashrc. Scroll to the bottom and enter the command below (modified to suit your specific set-up). Now when you type tfnb the long command will be run and the TensorFlow Jupyter Notebook will start up. (The other way to do this is using Docker Compose, which we’ll cover separately.) Once it’s saved, open your terminal and enter tfnb and you should see the TensorFlow Jupyter environment start up on the right ports, with the directory mapped to the one on your workstation.

alias tfnb='docker run -p 8888:8888 -v ~/Development/notebooks:/home/jovyan jupyter/tensorflow-notebook'

png

Containerize a data science application

Finally, let’s create a completely custom Docker container for a Python data science application. While the application we’ll be containerizing is going to be very basic, the concepts we’ll be using are identical for larger, more complicated models and applications, so you can just swap out the Python code for your own and modify the configuration to suit your own project.

Our container is going do the following:

  • Create an Ubuntu Server environment
  • Install some Linux software applications using Apt
  • Run Python and install our project dependencies
  • Run a simple Python application we can access

1. Create your Python code

For this example, we’ll create a simple script and place it into a directory so we can containerize it. Here, we’re using the Hugging Face Transformers sentiment-analysis to load up a model, accept some text from the user, and return a prediction on whether the text has a positive or negative sentiment. Save the Python code to a file called sentiment.py.

from transformers import pipeline
nlp = pipeline('sentiment-analysis')

text = input("Enter text: \n")
output = nlp(text)
print(output)

I’d suggest trying this in a Jupyter notebook before continuing. To get it to work you’ll need to pip install transformers and torch. On the first run, it will download the model from Hugging Face.

2. Create a Dockerfile

Next, we’ll create a Dockerfile. This is a basic text-based configuration file into which we can provide a series of commands to Docker, and to the operating system, if we need to. Add the contents below and save the file as Dockerfile with no file extension.

FROM python:3.8.5-buster
WORKDIR /app
COPY . /app
RUN pip install transformers
RUN pip install torch
CMD ["python3", "sentiment.py"]

As you can probably tell already, the FROM command tells Docker to use the official Python Docker container for version 3.8.5. It loads this up, sets /app as the working directory, then copies the content of the current directory . to the /app folder. Then it uses Pip to install the required applications and runs the Python script.

3. Build a Docker image

Finally, cd into the directory containing your Python code and your Dockerfile, then enter the below command into your terminal. This will take the contents of the current directory . and containerize them in a container called matt/example.

docker build -t sentiment-analysis .

The build process will take a few minutes, as Docker will need to download the base image you’re using 3.8.5-buster in our case and install Transformers and PyTorch, which are quite large packages. If you make a mistake and want to rebuild the container you can use docker build -t sentiment-analysis . --no-cache and repeat the process until you’ve resolved your issue.

4. Run the Docker container

To run the Docker container, enter the below command in your terminal. You should see it fire up, load the model, and then show you a prompt into which you can enter some text. Hit return and the model should provide a prediction on its sentiment. You can share your script and Dockerfile with others now and then can docker build the Dockerfile to get exactly the same environment as you.

docker run sentiment-analysis

Obviously, this only runs the model once, which is a massive drawback. However, you could easily modify this with a while loop or create a simple API to make this an application that can run as a service. I’ll cover how to do this in a later article.

Matt Clarke, Tuesday, March 02, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.