Like most people who work in ecommerce data science, I regularly need to access data stored in a database - usually MySQL or MariaDB, but sometimes also MSSQL. Although it is possible to connect to these database servers directly and run SQL queries in situ, this can be risky. Write your query in the wrong way and you could place the production server under severe load, potentially impacting customers.
The solution, of course, is to transfer this data to a data warehouse, data lake, or local server and examine it there instead. Docker makes it fairly straightforward to set up a MySQL or MariaDB environment on your local machine upon which you can install one or many databases.
In this tutorial, I’ll show you how to set up a Docker container for MySQL or MariaDB, how to import a database dump to create a local copy of your production database, and finally how to connect to the Docker MySQL server using a data manager like Beekeeper Studio or Data Grip, and how to analyse the data using Jupyter Notebooks and SQLAlchemy.
MySQL comes in a few different flavours now. To minimise any potential compatibility issues, I am going to use the exact same version that’s running on the production server used to generate the MySQL dump. This is MariaDB version 10.1. If you visit Docker Hub, you’ll find pre-built Docker containers for the official version of MySQL and the MariaDB version. If you browse the tags section, you should be able to find a version number which matches the one used in your MySQL dump. By passing this tag value to the docker pull
command, you can ensure you download the exact same version, or one very close to it.
First we’ll use docker pull mariadb:10.1
to download and install the official MariaDB container from Docker Hub. This will install version 10.1 of the MariaDB version of MySQL and a compact Linux environment in a container on your machine. You’ll be able to use this for a single project or multiple projects, with everything neatly self-contained.
docker pull mariadb:10.1
Now that Docker is installed, we will run the container give it a name docker_mysql
and set a root password so we can login. Finally, we define the MariaDB version using a tag.
docker run --name docker_mysql -e MYSQL_ROOT_PASSWORD=SecretPasswordHere -d mariadb:10.1
To access your Docker container using the Bash shell you can use docker exec
, define the Docker container name and tell it to initialise Bash. This will give you a command prompt to issue commands on the Docker container itself. Once you have shell access you can then login to MySQL using the password you added in the previous step.
docker exec -it docker_mysql bash
mysql -uroot -p
Now we have a working database server we can install an existing MySQL database from a dump. If you’re not already logged in, repeat the steps above to open a Bash shell on the Docker container and then login to MySQL. Then create a new MySQL database with the name specified in your dump with create database your_db_name;
. If you don’t know the database name, open a terminal and type head /path/to/dump.sql
and you’ll see the database name in the code, without the need to load what might be a huge file.
docker exec -it docker_mysql bash
mysql -uroot -p
create database your_db_name;
Once you’ve created the database, you can import it using docker exec
. This will take a while if you’ve taken a massive dump…
docker exec -i docker_mysql mysql -uroot -pSecretPasswordHere your_db_name < dump.sql
Most people who work with databases a lot prefer to use an SQL manager application for browsing, administering, and querying their database, rather than doing everything from the command line. I tend to use DBeaver on Linux most of the time, but Beekeeper Studio and DataGrip are also very good. All of them can be configured to access the Docker container running your MySQL server.
First, find the enter docker ps
to view the running Docker containers and check the PORTS
column of the output. By default, you’ll usually see 3306/tcp
on your MySQL or MariaDB containers, as this is the default port.
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
cfe480cf160a mariadb:10.1 "docker-entrypoint.s…" 8 minutes ago Up 8 minutes 3306/tcp docker_mysql
01012c4f4759 jupyter/tensorflow-notebook "tini -g -- start-no…" About an hour ago Up About an hour 0.0.0.0:8888->8888/tcp pensive_grothendieck
Next, enter docker inspect docker_mysql
and you’ll get back a JSON object containing all of the data on your MySQL container. Scroll through this and look for the IPAddress
field. This is the IP of your Docker container. Mine was 172.17.0.3.
docker inspect docker_mysql
Finally, open up your preferred database manager - I’m using Beekeeper Studio, which is free and open source. Select the connection type as MariaDB (or MySQL if you installed that), enter the IP address in the Host field and ensure the Port value matches the one you saw above (usually 3306), then enter your username and password and test the connection. If you only have one database, you can set it as the default, or you can select it from the menu manually.
One really neat trick you can use with MySQL and other databases is performing your Exploratory Data Analysis inside Jupyter. The SQLAlchemy package makes this very easy to do. Install SQLAlchemy if you don’t already have it, and the PyMySQL engine.
!pip3 install sqlalchemy
!pip3 install pymysql
Requirement already satisfied: sqlalchemy in /opt/conda/lib/python3.8/site-packages (1.3.19)
Requirement already satisfied: pymysql in /opt/conda/lib/python3.8/site-packages (0.10.0)
Next, load up Pandas and use the create_engine
feature to connect to the MySQL database using a connection string like the one below. This passes in the username, password, IP address, port number and database name and opens a connection.
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://root:SecretPasswordHere@172.17.0.3:3306/your_db_name')
query = """
SELECT * FROM orders
"""
df = pd.read_sql(query, con=engine)
df.head()
Matt Clarke, Wednesday, March 03, 2021