For years, I used to spend much of my time performing Exploratory Data Analysis directly in SQL. Over time, the queries I wrote became very complicated, and it was often a struggle to make them run efficiently. Then I discovered that I could run SQL queries in Jupyter notebooks via Pandas and SQLAlchemy, and the way I worked changed forever.
Rather than writing one, slow, and hideously complex query in SQL, I found that I could write a series of shorter, faster, simpler queries in SQL and execute them independently via Pandas in a Jupyter notebook, allowing me to manipulate and join the selected data to a range of other sources.
Not only that, but the Jupyter notebook meant that I could add commentary to my analysis, generate useful plots, calculate things that are much harder to do in SQL, and create models and complex analyses. If you work with SQL regularly, I’d highly recommend switching to Jupyter and Pandas for your EDA work if you want to become more efficient. Here’s how it’s done.
Our aim is to run SQL queries via an SQLAlchemy connection using Pandas to execute the query and display the data in a dataframe. Open a Jupyter notebook and load the packages below. You’ll need to install the pandas
, sqlalchemy
and pymysql
packages, which you can do by typing pip3 install package-name
into your terminal.
I’m assuming that you’re using MySQL as your database, and that this is already set up and running. If you need to set up a local MySQL server, the easiest way to do this is to do this via Docker. If you prefer to connect to a remote MySQL server, you can do that via an SSH tunnel.
import pandas as pd
from sqlalchemy import create_engine
The create_engine()
function in SQLAlchemy allows you to connect to a wide range of different database platforms, providing you have the necessary driver (like pymysql
) installed. To create a connection you need to configure a connection string like the one below.
I’m connecting to mysql
using the pymysql
driver, as the root
user with the password password-here
, on the IP address 172.17.0.3
and port 3306
, and the database name database-here
. Once connected, we’ll store the connection in an object called engine
so we can pass it to each of the SQLAlchemy functions we run.
engine = create_engine('mysql+pymysql://root:password-here@172.17.0.3:3306/database-here')
Now we have a working database connection, we can create an SQL statement and execute it on the MySQL server using the Pandas read_sql()
function. We’ll pass in the engine
object above to the con
argument and return our results in a dataframe and display the first five rows using head()
.
df = pd.read_sql("SELECT * FROM brands", con=engine)
df.head()
brand_id | siteid | parent_id | id_path | level | status | product_count | image_id | position | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 1 | 1 | A | 85 | 25196 | 160 | 0 |
1 | 2 | 1 | 0 | 2 | 1 | A | 140 | 3860 | 30 | 0 |
2 | 3 | 1 | 0 | 3 | 1 | A | 43 | 25323 | 80 | 0 |
3 | 4 | 1 | 0 | 4 | 1 | A | 16 | 4213 | 40 | 0 |
4 | 5 | 1 | 0 | 5 | 1 | A | 198 | 12705 | 40 | 0 |
If you want to run a more complex SQL query which spans multiple lines, you can use triple quotes instead of single quotes. This allows you to format the query in a more readable manner, which will make your code easier to review and debug. For additional readability, it’s a good idea to place larger queries into variables, or even separate files and load them in.
query = """
SELECT
shoporders.id AS order_id,
shoporders.siteid AS site_id,
shoporders.customerid AS customer_id,
shoporders.amazon_orderid AS amazon_order_id,
shoporders.ordertotal AS total_revenue,
shoporders.datecreated AS date_created
FROM shoporders
LEFT JOIN shoporderorigins ON shoporderorigins.id = shoporders.orderoriginid
LEFT JOIN countries ON countries.id = shoporders.deliverycountry
"""
df = pd.read_sql(query, con=engine)
df.head()
order_id | site_id | customer_id | amazon_order_id | total_revenue | date_created | |
---|---|---|---|---|---|---|
0 | 299527 | 1 | 166958 | 74.01 | 2017-04-07 04:54:37 | |
1 | 299528 | 1 | 191708 | 44.62 | 2017-04-07 06:32:54 | |
2 | 299529 | 1 | 199961 | 205-3584248-4822709 | 16.99 | 2017-04-07 07:18:45 |
3 | 299530 | 1 | 199962 | 203-8816967-5772328 | 11.99 | 2017-04-07 07:20:00 |
4 | 299531 | 1 | 199963 | 026-3437725-0197969 | 14.49 | 2017-04-07 07:21:34 |
As with regular SQL queries, you can “bind” placeholder parameters to named variables to create dynamic queries. Here, we’ll create a bound parameter called %s
in our query to hold a customer_id
.
query = """
SELECT
shoporders.id AS order_id,
shoporders.siteid AS site_id,
shoporders.customerid AS customer_id,
shoporders.amazon_orderid AS amazon_order_id,
shoporders.ordertotal AS total_revenue,
shoporders.datecreated AS date_created
FROM shoporders
LEFT JOIN shoporderorigins ON shoporderorigins.id = shoporders.orderoriginid
LEFT JOIN countries ON countries.id = shoporders.deliverycountry
WHERE shoporders.customerid = %s
"""
To pass the bound parameter to the query and replace %s
with our value, we’ll assign the value 166958 to customer_id
, and then pass this to the params
argument. This can take a tuple
, list
, or dict
.
customer_id = "166958"
df = pd.read_sql(query, con=engine, params=(['customer_id']))
df.tail()
order_id | site_id | customer_id | amazon_order_id | total_revenue | date_created | |
---|---|---|---|---|---|---|
235 | 581710 | 1 | 0 | None | 0.0 | 2020-09-03 09:17:02 |
236 | 581758 | 1 | 0 | None | 0.0 | 2020-09-03 11:05:33 |
237 | 581874 | 1 | 0 | None | 0.0 | 2020-09-03 15:37:28 |
238 | 582133 | 1 | 0 | None | 0.0 | 2020-09-04 11:06:20 |
239 | 582270 | 1 | 0 | None | 0.0 | 2020-09-04 15:48:57 |
Now you’ve extracted data from MySQL using a SELECT
query, you’re able to manipulate the dataframe just as you would one that’s been loaded from any other source. This means you can clean it, merge it, connect it to other tables or other data sources, and save it out to almost any file type you like, all using the functions built into Pandas.
Matt Clarke, Monday, March 08, 2021