How to analyse Pandas dataframes using SQL with PandaSQL

Learn how to use PandaSQL and query the data in your Pandas dataframes using SQL queries instead of complex Pandas code.

How to analyse Pandas dataframes using SQL with PandaSQL
Picture by Anna Nekrashevich, Pexels.
6 minutes to read

If, like me, you’ve come from a background where you made heavy use of SQL, then getting to grips with filtering, subsetting, and selecting data in Pandas can be a steep learning curve. I often spend my working days writing ridiculously complex MySQL queries to perform various ecommerce-related analyses, but I initially found it tough to do the same things in Pandas.

However, I recently stumbled across a superb Python package called PandaSQL, which was developed by data science company Yhat, and which aims to make it possible to run SQL queries on data stored within Pandas dataframes. While I would still highly recommend you learn how to do things in Pandas itself, PandaSQL is nonetheless a brilliant and powerful tool.

Load the packages

In this project, I’ll assume you already know how to write SQL, so will forego teaching you this and just cover the use of PandaSQL instead. To get started, open up a Jupyter notebook or Python script and import the Pandas and pandasql package. You can install PandaSQL via PyPi by entering pip3 install pandasql in your terminal.

import pandas as pd
from pandasql import sqldf

Create a helper function

PandaSQL requires access to some environmental and session variables, so we need to pass globals() to its sqldf() function when running a query. To save the hassle of this, we’ll create a helper function called pysqldf using lambda. Passing our SQL query to this function will run it and return the result in a Pandas dataframe.

pysqldf = lambda q: sqldf(q, globals())

Load your raw data

Next, load up the data of your choice. I’ve used the Online Retail dataset, which you can download from the UCI Machine Learning Repository. Since this dataset is missing the line price for each order line, I’ve calculated this and added it to the df_order_items dataframe created.

df_order_items = pd.read_csv('online_retail.csv')
df_order_items['LinePrice'] = df_order_items['Quantity'] * df_order_items['UnitPrice']
df_order_items.head()
Unnamed: 0 InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country LinePrice
0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34

Create an order-level dataset

We can now write a query to analyse the data in our dataframe and use PandaSQL’s sqldf module to execute it via our pysqldf lambda function. Here, I’m creating a new dataframe called df_orders which contains aggregate data based on a GROUP BY of the order_id.

The query is run via SQLite, not MySQL, so your SQL syntax needs to be SQLite compatible. Its SQL dialect is slightly different, so not everything from regular ANSI SQL will work.

query = """
SELECT 
    DISTINCT(InvoiceNo) AS order_id, 
    CustomerID AS customer_id, 
    InvoiceDate AS order_date,
    COUNT(DISTINCT(StockCode)) AS total_skus,
    SUM(Quantity) AS total_items,
    SUM(LinePrice) AS total_revenue
FROM df_order_items
GROUP BY order_id
"""
df_orders = pysqldf(query)
df_orders.head()
order_id customer_id order_date total_skus total_items total_revenue
0 536365 17850.0 2010-12-01 08:26:00 7 40 139.12
1 536366 17850.0 2010-12-01 08:28:00 2 12 22.20
2 536367 13047.0 2010-12-01 08:34:00 12 83 278.73
3 536368 13047.0 2010-12-01 08:34:00 4 15 70.05
4 536369 13047.0 2010-12-01 08:35:00 1 3 17.85

Create a customer-level dataset

Essentially, anything you can do using SQLite, you can do to your dataframes to query them. In the next step, I’ve created a customer-level dataframe by using a GROUP BY on the CustomerID and then calculating some statistics on the columns using DISTINCT(), COUNT(), and SUM(). You can even do joins and subqueries!

query = """
SELECT 
    DISTINCT(CustomerID) AS customer_id,
    COUNT(DISTINCT(InvoiceNo)) AS total_orders,
    COUNT(DISTINCT(StockCode)) AS total_skus,
    SUM(Quantity) AS total_items,
    SUM(LinePrice) AS total_revenue,
    (SUM(LinePrice) / COUNT(DISTINCT(InvoiceNo))) AS aov
FROM df_order_items
WHERE CustomerID IS NOT NULL 
GROUP BY CustomerID
ORDER BY total_revenue DESC
"""
df_customers = pysqldf(query)
df_customers.head()
customer_id total_orders total_skus total_items total_revenue aov
0 14646.0 77 703 196719 279489.02 3629.727532
1 18102.0 62 151 64122 256438.49 4136.104677
2 17450.0 55 127 69029 187482.17 3408.766727
3 14911.0 248 1794 77180 132572.62 534.567016
4 12415.0 26 444 77242 123725.45 4758.671154

While I definitely recommend still learning how to use Pandas to select, subset, merge, and filter your data, there’s no denying that PandaSQL is a very useful package for those more skilled in SQL than Pandas. I’ve used it for a few things recently, where it’s taken significantly less SQL code to perform a task than it has in Pandas.

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.