The phrase “It works on my machine” is a common one in most offices with a data science department. It’s very common to write an application that works perfectly in your local data science environment only to find your colleagues run into problems with the exact same code.
Small differences in the version numbers of the packages you’re using can cause compatibility issues that prevent others from reviewing your code or running your model on their development environment.
Docker is designed to overcome this issue and has become an industry standard tool for most data scientists and data engineers. Docker is a containerization application that lets you create a portable virtual operating system environment, complete with your code and the custom package versions required to run it.
In this project I’ll show you how you can Dockerize a data science application. This will take your local application, wrap it up with a host operating system and your preferred Python version, and then install the packages required to run it flawlessly.
Firstly, I’m assuming that, like most data scientists, you’re running an Ubuntu data science workstation. Open a terminal and enter the commands below to install Docker using the
apt package management system on Ubuntu.
After you’ve installed
docker.io, you’ll need to add your user to the
docker group to give them the right permissions. Then you’ll need to enable the docker service. To confirm that Docker has installed, you can run
docker --version to see the version number.
sudo apt install docker.io sudo usermod -aG docker $USER sudo systemctl enable --now docker docker --version
Next, we’ll create a Dockerfile. This is a simple text file that should be added to the folder in which your data science application resides. The file needs to be called
Dockerfile and have no file extension.
The Dockerfile we’re creating is a really simple one. It will first get the
python:3.8.5-buster Docker container from Docker Hub and set the
app directory as the working directory and copy your files to it. The Docker container installed is a Linux operating system including everything you need to run your code.
Once installed, the Docker container will run the
pip install command to install
numpy via the PyPi Python package manager, and then will run the command
python3 automation.py to run the script on the specific application I’m Dockerizing.
FROM python:3.8.5-buster WORKDIR /app COPY . /app RUN pip install pandas RUN pip install numpy CMD ["python3", "automation.py"]
To test your Dockerfile correctly Dockerizes your application, you can
cd into the directory containing your application and the
Dockerfile and run the
docker build -t command to Dockerize the application.
automation . part names the Docker container generated
automation while the
. tells it to do this to the present directory. This process will usually take a few minutes, depending on the size of the Docker container you used.
docker build -t automation .
Next, you can run your Dockerized application by typing
docker run automation (where
automation was the name you gave to your container). This will fire up your Docker container,
pip install the required packages and then run
python3 automation.py to execute your code.
docker run automation
Now you can share your application with your colleagues with the
Dockerfile included in the application directory, or Git repository. When they download your code, your colleagues can run the
docker build -t automation command to build a container from the Docker file and then use
docker run automation to execute it. It will then work just as it did on your local machine.
Matt Clarke, Saturday, December 18, 2021