How to analyse Pandas dataframes using SQL with PandaSQL

Picture by Anna Nekrashevich, Pexels.

6 minutes to read

If, like me, you’ve come from a background where you made heavy use of SQL, then getting to grips with filtering, subsetting, and selecting data in Pandas can be a steep learning curve. I often spend my working days writing ridiculously complex MySQL queries to perform various ecommerce-related analyses, but I initially found it tough to do the same things in Pandas.

However, I recently stumbled across a superb Python package called PandaSQL, which was developed by data science company Yhat, and which aims to make it possible to run SQL queries on data stored within Pandas dataframes. While I would still highly recommend you learn how to do things in Pandas itself, PandaSQL is nonetheless a brilliant and powerful tool.

Load the packages

In this project, I’ll assume you already know how to write SQL, so will forego teaching you this and just cover the use of PandaSQL instead. To get started, open up a Jupyter notebook or Python script and import the Pandas and pandasql package. You can install PandaSQL via PyPi by entering pip3 install pandasql in your terminal.

import pandas as pd
from pandasql import sqldf

Create a helper function

PandaSQL requires access to some environmental and session variables, so we need to pass globals() to its sqldf() function when running a query. To save the hassle of this, we’ll create a helper function called pysqldf using lambda. Passing our SQL query to this function will run it and return the result in a Pandas dataframe.

pysqldf = lambda q: sqldf(q, globals())

Load your raw data

Next, load up the data of your choice. I’ve used the Online Retail dataset, which you can download from the UCI Machine Learning Repository. Since this dataset is missing the line price for each order line, I’ve calculated this and added it to the df_order_items dataframe created.

df_order_items = pd.read_csv('online_retail.csv')
df_order_items['LinePrice'] = df_order_items['Quantity'] * df_order_items['UnitPrice']
df_order_items.head()

	Unnamed: 0	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	LinePrice
0	0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom	15.30
1	1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34
2	2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom	22.00
3	3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34
4	4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34

Create an order-level dataset

We can now write a query to analyse the data in our dataframe and use PandaSQL’s sqldf module to execute it via our pysqldf lambda function. Here, I’m creating a new dataframe called df_orders which contains aggregate data based on a GROUP BY of the order_id.

The query is run via SQLite, not MySQL, so your SQL syntax needs to be SQLite compatible. Its SQL dialect is slightly different, so not everything from regular ANSI SQL will work.

query = """
SELECT 
    DISTINCT(InvoiceNo) AS order_id, 
    CustomerID AS customer_id, 
    InvoiceDate AS order_date,
    COUNT(DISTINCT(StockCode)) AS total_skus,
    SUM(Quantity) AS total_items,
    SUM(LinePrice) AS total_revenue
FROM df_order_items
GROUP BY order_id
"""

df_orders = pysqldf(query)
df_orders.head()

	order_id	customer_id	order_date	total_skus	total_items	total_revenue
0	536365	17850.0	2010-12-01 08:26:00	7	40	139.12
1	536366	17850.0	2010-12-01 08:28:00	2	12	22.20
2	536367	13047.0	2010-12-01 08:34:00	12	83	278.73
3	536368	13047.0	2010-12-01 08:34:00	4	15	70.05
4	536369	13047.0	2010-12-01 08:35:00	1	3	17.85

Create a customer-level dataset

Essentially, anything you can do using SQLite, you can do to your dataframes to query them. In the next step, I’ve created a customer-level dataframe by using a GROUP BY on the CustomerID and then calculating some statistics on the columns using DISTINCT(), COUNT(), and SUM(). You can even do joins and subqueries!

query = """
SELECT 
    DISTINCT(CustomerID) AS customer_id,
    COUNT(DISTINCT(InvoiceNo)) AS total_orders,
    COUNT(DISTINCT(StockCode)) AS total_skus,
    SUM(Quantity) AS total_items,
    SUM(LinePrice) AS total_revenue,
    (SUM(LinePrice) / COUNT(DISTINCT(InvoiceNo))) AS aov
FROM df_order_items
WHERE CustomerID IS NOT NULL 
GROUP BY CustomerID
ORDER BY total_revenue DESC
"""

df_customers = pysqldf(query)
df_customers.head()

	customer_id	total_orders	total_skus	total_items	total_revenue	aov
0	14646.0	77	703	196719	279489.02	3629.727532
1	18102.0	62	151	64122	256438.49	4136.104677
2	17450.0	55	127	69029	187482.17	3408.766727
3	14911.0	248	1794	77180	132572.62	534.567016
4	12415.0	26	444	77242	123725.45	4758.671154

While I definitely recommend still learning how to use Pandas to select, subset, merge, and filter your data, there’s no denying that PandaSQL is a very useful package for those more skilled in SQL than Pandas. I’ve used it for a few things recently, where it’s taken significantly less SQL code to perform a task than it has in Pandas.

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.