How to query MySQL and other databases using Pandas

Querying MySQL and other databases using Pandas in Jupyter notebooks will change the way you work forever. Here’s how it's done...

How to query MySQL and other databases using Pandas
Panumas Nikhomkhai, Pexels.
8 minutes to read

For years, I used to spend much of my time performing Exploratory Data Analysis directly in SQL. Over time, the queries I wrote became very complicated, and it was often a struggle to make them run efficiently. Then I discovered that I could run SQL queries in Jupyter notebooks via Pandas and SQLAlchemy, and the way I worked changed forever.

Rather than writing one, slow, and hideously complex query in SQL, I found that I could write a series of shorter, faster, simpler queries in SQL and execute them independently via Pandas in a Jupyter notebook, allowing me to manipulate and join the selected data to a range of other sources.

Not only that, but the Jupyter notebook meant that I could add commentary to my analysis, generate useful plots, calculate things that are much harder to do in SQL, and create models and complex analyses. If you work with SQL regularly, I’d highly recommend switching to Jupyter and Pandas for your EDA work if you want to become more efficient. Here’s how it’s done.

Load the packages

Our aim is to run SQL queries via an SQLAlchemy connection using Pandas to execute the query and display the data in a dataframe. Open a Jupyter notebook and load the packages below. You’ll need to install the pandas, sqlalchemy and pymysql packages, which you can do by typing pip3 install package-name into your terminal.

I’m assuming that you’re using MySQL as your database, and that this is already set up and running. If you need to set up a local MySQL server, the easiest way to do this is to do this via Docker. If you prefer to connect to a remote MySQL server, you can do that via an SSH tunnel.

import pandas as pd
from sqlalchemy import create_engine

Configure your database connection

The create_engine() function in SQLAlchemy allows you to connect to a wide range of different database platforms, providing you have the necessary driver (like pymysql) installed. To create a connection you need to configure a connection string like the one below.

I’m connecting to mysql using the pymysql driver, as the root user with the password password-here, on the IP address 172.17.0.3 and port 3306, and the database name database-here. Once connected, we’ll store the connection in an object called engine so we can pass it to each of the SQLAlchemy functions we run.

engine = create_engine('mysql+pymysql://root:password-here@172.17.0.3:3306/database-here')

Running MySQL SELECT queries using Pandas

Now we have a working database connection, we can create an SQL statement and execute it on the MySQL server using the Pandas read_sql() function. We’ll pass in the engine object above to the con argument and return our results in a dataframe and display the first five rows using head().

df = pd.read_sql("SELECT * FROM brands", con=engine)
df.head()
brand_id siteid parent_id id_path level status product_count image_id position timestamp
0 1 1 0 1 1 A 85 25196 160 0
1 2 1 0 2 1 A 140 3860 30 0
2 3 1 0 3 1 A 43 25323 80 0
3 4 1 0 4 1 A 16 4213 40 0
4 5 1 0 5 1 A 198 12705 40 0

If you want to run a more complex SQL query which spans multiple lines, you can use triple quotes instead of single quotes. This allows you to format the query in a more readable manner, which will make your code easier to review and debug. For additional readability, it’s a good idea to place larger queries into variables, or even separate files and load them in.

query = """
SELECT
    shoporders.id AS order_id,
    shoporders.siteid AS site_id,
    shoporders.customerid AS customer_id,
    shoporders.amazon_orderid AS amazon_order_id,
    shoporders.ordertotal AS total_revenue,
    shoporders.datecreated AS date_created
FROM shoporders
LEFT JOIN shoporderorigins ON shoporderorigins.id = shoporders.orderoriginid
LEFT JOIN countries ON countries.id = shoporders.deliverycountry
"""
df = pd.read_sql(query, con=engine)
df.head()
order_id site_id customer_id amazon_order_id total_revenue date_created
0 299527 1 166958 74.01 2017-04-07 04:54:37
1 299528 1 191708 44.62 2017-04-07 06:32:54
2 299529 1 199961 205-3584248-4822709 16.99 2017-04-07 07:18:45
3 299530 1 199962 203-8816967-5772328 11.99 2017-04-07 07:20:00
4 299531 1 199963 026-3437725-0197969 14.49 2017-04-07 07:21:34

Adding bound parameters to SQL queries

As with regular SQL queries, you can “bind” placeholder parameters to named variables to create dynamic queries. Here, we’ll create a bound parameter called %s in our query to hold a customer_id.

query = """
SELECT
    shoporders.id AS order_id,
    shoporders.siteid AS site_id,
    shoporders.customerid AS customer_id,
    shoporders.amazon_orderid AS amazon_order_id,
    shoporders.ordertotal AS total_revenue,
    shoporders.datecreated AS date_created
FROM shoporders
LEFT JOIN shoporderorigins ON shoporderorigins.id = shoporders.orderoriginid
LEFT JOIN countries ON countries.id = shoporders.deliverycountry
WHERE shoporders.customerid = %s
"""

To pass the bound parameter to the query and replace %s with our value, we’ll assign the value 166958 to customer_id, and then pass this to the params argument. This can take a tuple, list, or dict.

customer_id = "166958"
df = pd.read_sql(query, con=engine, params=(['customer_id']))
df.tail()
order_id site_id customer_id amazon_order_id total_revenue date_created
235 581710 1 0 None 0.0 2020-09-03 09:17:02
236 581758 1 0 None 0.0 2020-09-03 11:05:33
237 581874 1 0 None 0.0 2020-09-03 15:37:28
238 582133 1 0 None 0.0 2020-09-04 11:06:20
239 582270 1 0 None 0.0 2020-09-04 15:48:57

Working with dataframes from MySQL

Now you’ve extracted data from MySQL using a SELECT query, you’re able to manipulate the dataframe just as you would one that’s been loaded from any other source. This means you can clean it, merge it, connect it to other tables or other data sources, and save it out to almost any file type you like, all using the functions built into Pandas.

Matt Clarke, Monday, March 08, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.