If, like me, you’ve come from a background where you made heavy use of SQL, then getting to grips with filtering, subsetting, and selecting data in Pandas can be a steep learning curve. I often spend my working days writing ridiculously complex MySQL queries to perform various ecommerce-related analyses, but I initially found it tough to do the same things in Pandas.
However, I recently stumbled across a superb Python package called PandaSQL, which was developed by data science company Yhat, and which aims to make it possible to run SQL queries on data stored within Pandas dataframes. While I would still highly recommend you learn how to do things in Pandas itself, PandaSQL is nonetheless a brilliant and powerful tool.
In this project, I’ll assume you already know how to write SQL, so will forego teaching you this and just cover the use of PandaSQL instead. To get started, open up a Jupyter notebook or Python script and import the Pandas and pandasql
package. You can install PandaSQL via PyPi by entering pip3 install pandasql
in your terminal.
import pandas as pd
from pandasql import sqldf
PandaSQL requires access to some environmental and session variables, so we need to pass globals()
to its sqldf()
function when running a query. To save the hassle of this, we’ll create a helper function called pysqldf
using lambda
. Passing our SQL query to this function will run it and return the result in a Pandas dataframe.
pysqldf = lambda q: sqldf(q, globals())
Next, load up the data of your choice. I’ve used the Online Retail dataset, which you can download from the UCI Machine Learning Repository. Since this dataset is missing the line price for each order line, I’ve calculated this and added it to the df_order_items
dataframe created.
df_order_items = pd.read_csv('online_retail.csv')
df_order_items['LinePrice'] = df_order_items['Quantity'] * df_order_items['UnitPrice']
df_order_items.head()
Unnamed: 0 | InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | LinePrice | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
1 | 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
2 | 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
3 | 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
4 | 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
We can now write a query to analyse the data in our dataframe and use PandaSQL’s sqldf
module to execute it via our pysqldf
lambda function. Here, I’m creating a new dataframe called df_orders
which contains aggregate data based on a GROUP BY
of the order_id
.
The query is run via SQLite, not MySQL, so your SQL syntax needs to be SQLite compatible. Its SQL dialect is slightly different, so not everything from regular ANSI SQL will work.
query = """
SELECT
DISTINCT(InvoiceNo) AS order_id,
CustomerID AS customer_id,
InvoiceDate AS order_date,
COUNT(DISTINCT(StockCode)) AS total_skus,
SUM(Quantity) AS total_items,
SUM(LinePrice) AS total_revenue
FROM df_order_items
GROUP BY order_id
"""
df_orders = pysqldf(query)
df_orders.head()
order_id | customer_id | order_date | total_skus | total_items | total_revenue | |
---|---|---|---|---|---|---|
0 | 536365 | 17850.0 | 2010-12-01 08:26:00 | 7 | 40 | 139.12 |
1 | 536366 | 17850.0 | 2010-12-01 08:28:00 | 2 | 12 | 22.20 |
2 | 536367 | 13047.0 | 2010-12-01 08:34:00 | 12 | 83 | 278.73 |
3 | 536368 | 13047.0 | 2010-12-01 08:34:00 | 4 | 15 | 70.05 |
4 | 536369 | 13047.0 | 2010-12-01 08:35:00 | 1 | 3 | 17.85 |
Essentially, anything you can do using SQLite, you can do to your dataframes to query them. In the next step, I’ve created a customer-level dataframe by using a GROUP BY
on the CustomerID
and then calculating some statistics on the columns using DISTINCT()
, COUNT()
, and SUM()
. You can even do joins and subqueries!
query = """
SELECT
DISTINCT(CustomerID) AS customer_id,
COUNT(DISTINCT(InvoiceNo)) AS total_orders,
COUNT(DISTINCT(StockCode)) AS total_skus,
SUM(Quantity) AS total_items,
SUM(LinePrice) AS total_revenue,
(SUM(LinePrice) / COUNT(DISTINCT(InvoiceNo))) AS aov
FROM df_order_items
WHERE CustomerID IS NOT NULL
GROUP BY CustomerID
ORDER BY total_revenue DESC
"""
df_customers = pysqldf(query)
df_customers.head()
customer_id | total_orders | total_skus | total_items | total_revenue | aov | |
---|---|---|---|---|---|---|
0 | 14646.0 | 77 | 703 | 196719 | 279489.02 | 3629.727532 |
1 | 18102.0 | 62 | 151 | 64122 | 256438.49 | 4136.104677 |
2 | 17450.0 | 55 | 127 | 69029 | 187482.17 | 3408.766727 |
3 | 14911.0 | 248 | 1794 | 77180 | 132572.62 | 534.567016 |
4 | 12415.0 | 26 | 444 | 77242 | 123725.45 | 4758.671154 |
While I definitely recommend still learning how to use Pandas to select, subset, merge, and filter your data, there’s no denying that PandaSQL is a very useful package for those more skilled in SQL than Pandas. I’ve used it for a few things recently, where it’s taken significantly less SQL code to perform a task than it has in Pandas.
Matt Clarke, Friday, March 12, 2021