How to Dockerize a data science application

Learn how to Dockerize your data science application in five minutes so your code runs perfectly when shared with colleagues.

How to Dockerize a data science application
Picture by Pixabay, Pexels.
4 minutes to read

The phrase “It works on my machine” is a common one in most offices with a data science department. It’s very common to write an application that works perfectly in your local data science environment only to find your colleagues run into problems with the exact same code.

Small differences in the version numbers of the packages you’re using can cause compatibility issues that prevent others from reviewing your code or running your model on their development environment.

Docker is designed to overcome this issue and has become an industry standard tool for most data scientists and data engineers. Docker is a containerization application that lets you create a portable virtual operating system environment, complete with your code and the custom package versions required to run it.

In this project I’ll show you how you can Dockerize a data science application. This will take your local application, wrap it up with a host operating system and your preferred Python version, and then install the packages required to run it flawlessly.

Install Docker

Firstly, I’m assuming that, like most data scientists, you’re running an Ubuntu data science workstation. Open a terminal and enter the commands below to install Docker using the apt package management system on Ubuntu.

After you’ve installed docker.io, you’ll need to add your user to the docker group to give them the right permissions. Then you’ll need to enable the docker service. To confirm that Docker has installed, you can run docker --version to see the version number.

sudo apt install docker.io
sudo usermod -aG docker $USER
sudo systemctl enable --now docker
docker --version

Create a Dockerfile

Next, we’ll create a Dockerfile. This is a simple text file that should be added to the folder in which your data science application resides. The file needs to be called Dockerfile and have no file extension.

The Dockerfile we’re creating is a really simple one. It will first get the python:3.8.5-buster Docker container from Docker Hub and set the app directory as the working directory and copy your files to it. The Docker container installed is a Linux operating system including everything you need to run your code.

Once installed, the Docker container will run the pip install command to install pandas and numpy via the PyPi Python package manager, and then will run the command python3 automation.py to run the script on the specific application I’m Dockerizing.

FROM python:3.8.5-buster
WORKDIR /app
COPY . /app
RUN pip install pandas
RUN pip install numpy
CMD ["python3", "automation.py"]

Dockerize the application

To test your Dockerfile correctly Dockerizes your application, you can cd into the directory containing your application and the Dockerfile and run the docker build -t command to Dockerize the application.

The automation . part names the Docker container generated automation while the . tells it to do this to the present directory. This process will usually take a few minutes, depending on the size of the Docker container you used.

docker build -t automation .

Run the Dockerized application

Next, you can run your Dockerized application by typing docker run automation (where automation was the name you gave to your container). This will fire up your Docker container, pip install the required packages and then run python3 automation.py to execute your code.

docker run automation

Share your code

Now you can share your application with your colleagues with the Dockerfile included in the application directory, or Git repository. When they download your code, your colleagues can run the docker build -t automation command to build a container from the Docker file and then use docker run automation to execute it. It will then work just as it did on your local machine.

Matt Clarke, Saturday, December 18, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Database Design

Learn to design databases in SQL .

Start course for FREE

Comments