How to query MySQL and other databases using Pandas

Panumas Nikhomkhai, Pexels.

8 minutes to read

For years, I used to spend much of my time performing Exploratory Data Analysis directly in SQL. Over time, the queries I wrote became very complicated, and it was often a struggle to make them run efficiently. Then I discovered that I could run SQL queries in Jupyter notebooks via Pandas and SQLAlchemy, and the way I worked changed forever.

Rather than writing one, slow, and hideously complex query in SQL, I found that I could write a series of shorter, faster, simpler queries in SQL and execute them independently via Pandas in a Jupyter notebook, allowing me to manipulate and join the selected data to a range of other sources.

Not only that, but the Jupyter notebook meant that I could add commentary to my analysis, generate useful plots, calculate things that are much harder to do in SQL, and create models and complex analyses. If you work with SQL regularly, I’d highly recommend switching to Jupyter and Pandas for your EDA work if you want to become more efficient. Here’s how it’s done.

Load the packages

Our aim is to run SQL queries via an SQLAlchemy connection using Pandas to execute the query and display the data in a dataframe. Open a Jupyter notebook and load the packages below. You’ll need to install the pandas, sqlalchemy and pymysql packages, which you can do by typing pip3 install package-name into your terminal.

I’m assuming that you’re using MySQL as your database, and that this is already set up and running. If you need to set up a local MySQL server, the easiest way to do this is to do this via Docker. If you prefer to connect to a remote MySQL server, you can do that via an SSH tunnel.

import pandas as pd
from sqlalchemy import create_engine

Configure your database connection

The create_engine() function in SQLAlchemy allows you to connect to a wide range of different database platforms, providing you have the necessary driver (like pymysql) installed. To create a connection you need to configure a connection string like the one below.

I’m connecting to mysql using the pymysql driver, as the root user with the password password-here, on the IP address 172.17.0.3 and port 3306, and the database name database-here. Once connected, we’ll store the connection in an object called engine so we can pass it to each of the SQLAlchemy functions we run.

engine = create_engine('mysql+pymysql://root:password-here@172.17.0.3:3306/database-here')

Running MySQL SELECT queries using Pandas

Now we have a working database connection, we can create an SQL statement and execute it on the MySQL server using the Pandas read_sql() function. We’ll pass in the engine object above to the con argument and return our results in a dataframe and display the first five rows using head().

df = pd.read_sql("SELECT * FROM brands", con=engine)
df.head()

	brand_id	siteid	id_path	level	status	product_count	image_id	position
0	1	1	1	1	A	85	25196	160
1	2	1	2	1	A	140	3860	30
2	3	1	3	1	A	43	25323	80
3	4	1	4	1	A	16	4213	40
4	5	1	5	1	A	198	12705	40

If you want to run a more complex SQL query which spans multiple lines, you can use triple quotes instead of single quotes. This allows you to format the query in a more readable manner, which will make your code easier to review and debug. For additional readability, it’s a good idea to place larger queries into variables, or even separate files and load them in.

query = """
SELECT
    shoporders.id AS order_id,
    shoporders.siteid AS site_id,
    shoporders.customerid AS customer_id,
    shoporders.amazon_orderid AS amazon_order_id,
    shoporders.ordertotal AS total_revenue,
    shoporders.datecreated AS date_created
FROM shoporders
LEFT JOIN shoporderorigins ON shoporderorigins.id = shoporders.orderoriginid
LEFT JOIN countries ON countries.id = shoporders.deliverycountry
"""

df = pd.read_sql(query, con=engine)
df.head()

	order_id	site_id	customer_id	amazon_order_id	total_revenue	date_created
0	299527	1	166958		74.01	2017-04-07 04:54:37
1	299528	1	191708		44.62	2017-04-07 06:32:54
2	299529	1	199961	205-3584248-4822709	16.99	2017-04-07 07:18:45
3	299530	1	199962	203-8816967-5772328	11.99	2017-04-07 07:20:00
4	299531	1	199963	026-3437725-0197969	14.49	2017-04-07 07:21:34

Adding bound parameters to SQL queries

As with regular SQL queries, you can “bind” placeholder parameters to named variables to create dynamic queries. Here, we’ll create a bound parameter called %s in our query to hold a customer_id.

query = """
SELECT
    shoporders.id AS order_id,
    shoporders.siteid AS site_id,
    shoporders.customerid AS customer_id,
    shoporders.amazon_orderid AS amazon_order_id,
    shoporders.ordertotal AS total_revenue,
    shoporders.datecreated AS date_created
FROM shoporders
LEFT JOIN shoporderorigins ON shoporderorigins.id = shoporders.orderoriginid
LEFT JOIN countries ON countries.id = shoporders.deliverycountry
WHERE shoporders.customerid = %s
"""

To pass the bound parameter to the query and replace %s with our value, we’ll assign the value 166958 to customer_id, and then pass this to the params argument. This can take a tuple, list, or dict.

customer_id = "166958"
df = pd.read_sql(query, con=engine, params=(['customer_id']))
df.tail()

	order_id	site_id	amazon_order_id	date_created
235	581710	1	None	2020-09-03 09:17:02
236	581758	1	None	2020-09-03 11:05:33
237	581874	1	None	2020-09-03 15:37:28
238	582133	1	None	2020-09-04 11:06:20
239	582270	1	None	2020-09-04 15:48:57

Working with dataframes from MySQL

Now you’ve extracted data from MySQL using a SELECT query, you’re able to manipulate the dataframe just as you would one that’s been loaded from any other source. This means you can clean it, merge it, connect it to other tables or other data sources, and save it out to almost any file type you like, all using the functions built into Pandas.

Matt Clarke, Monday, March 08, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.