The phrase “It works on my machine” is a common one in most offices with a data science department. It’s very common to write an application that works perfectly in your local data science environment only to find your colleagues run into problems with the exact same code.
Small differences in the version numbers of the packages you’re using can cause compatibility issues that prevent others from reviewing your code or running your model on their development environment.
Docker is designed to overcome this issue and has become an industry standard tool for most data scientists and data engineers. Docker is a containerization application that lets you create a portable virtual operating system environment, complete with your code and the custom package versions required to run it.
In this project I’ll show you how you can Dockerize a data science application. This will take your local application, wrap it up with a host operating system and your preferred Python version, and then install the packages required to run it flawlessly.
Firstly, I’m assuming that, like most data scientists, you’re running an Ubuntu data science workstation. Open a terminal and enter the commands below to install Docker using the apt
package management system on Ubuntu.
After you’ve installed docker.io
, you’ll need to add your user to the docker
group to give them the right permissions. Then you’ll need to enable the docker service. To confirm that Docker has installed, you can run docker --version
to see the version number.
sudo apt install docker.io
sudo usermod -aG docker $USER
sudo systemctl enable --now docker
docker --version
Next, we’ll create a Dockerfile. This is a simple text file that should be added to the folder in which your data science application resides. The file needs to be called Dockerfile
and have no file extension.
The Dockerfile we’re creating is a really simple one. It will first get the python:3.8.5-buster
Docker container from Docker Hub and set the app
directory as the working directory and copy your files to it. The Docker container installed is a Linux operating system including everything you need to run your code.
Once installed, the Docker container will run the pip install
command to install pandas
and numpy
via the PyPi Python package manager, and then will run the command python3 automation.py
to run the script on the specific application I’m Dockerizing.
FROM python:3.8.5-buster
WORKDIR /app
COPY . /app
RUN pip install pandas
RUN pip install numpy
CMD ["python3", "automation.py"]
To test your Dockerfile correctly Dockerizes your application, you can cd
into the directory containing your application and the Dockerfile
and run the docker build -t
command to Dockerize the application.
The automation .
part names the Docker container generated automation
while the .
tells it to do this to the present directory. This process will usually take a few minutes, depending on the size of the Docker container you used.
docker build -t automation .
Next, you can run your Dockerized application by typing docker run automation
(where automation
was the name you gave to your container). This will fire up your Docker container, pip install
the required packages and then run python3 automation.py
to execute your code.
docker run automation
Now you can share your application with your colleagues with the Dockerfile
included in the application directory, or Git repository. When they download your code, your colleagues can run the docker build -t automation
command to build a container from the Docker file and then use docker run automation
to execute it. It will then work just as it did on your local machine.
Matt Clarke, Saturday, December 18, 2021