If, like me, you’ve come from a background where you made heavy use of SQL, then getting to grips with filtering, subsetting, and selecting data in Pandas can be a steep learning curve. I often spend my working days writing ridiculously complex MySQL queries to perform various ecommerce-related analyses, but I initially found it tough to do the same things in Pandas.
However, I recently stumbled across a superb Python package called PandaSQL, which was developed by data science company Yhat, and which aims to make it possible to run SQL queries on data stored within Pandas dataframes. While I would still highly recommend you learn how to do things in Pandas itself, PandaSQL is nonetheless a brilliant and powerful tool.
In this project, I’ll assume you already know how to write SQL, so will forego teaching you this and just cover the use of PandaSQL instead. To get started, open up a Jupyter notebook or Python script and import the Pandas and
pandasql package. You can install PandaSQL via PyPi by entering
pip3 install pandasql in your terminal.
import pandas as pd from pandasql import sqldf
PandaSQL requires access to some environmental and session variables, so we need to pass
globals() to its
sqldf() function when running a query. To save the hassle of this, we’ll create a helper function called
lambda. Passing our SQL query to this function will run it and return the result in a Pandas dataframe.
pysqldf = lambda q: sqldf(q, globals())
Next, load up the data of your choice. I’ve used the Online Retail dataset, which you can download from the UCI Machine Learning Repository. Since this dataset is missing the line price for each order line, I’ve calculated this and added it to the
df_order_items dataframe created.
df_order_items = pd.read_csv('online_retail.csv') df_order_items['LinePrice'] = df_order_items['Quantity'] * df_order_items['UnitPrice'] df_order_items.head()
|0||0||536365||85123A||WHITE HANGING HEART T-LIGHT HOLDER||6||2010-12-01 08:26:00||2.55||17850.0||United Kingdom||15.30|
|1||1||536365||71053||WHITE METAL LANTERN||6||2010-12-01 08:26:00||3.39||17850.0||United Kingdom||20.34|
|2||2||536365||84406B||CREAM CUPID HEARTS COAT HANGER||8||2010-12-01 08:26:00||2.75||17850.0||United Kingdom||22.00|
|3||3||536365||84029G||KNITTED UNION FLAG HOT WATER BOTTLE||6||2010-12-01 08:26:00||3.39||17850.0||United Kingdom||20.34|
|4||4||536365||84029E||RED WOOLLY HOTTIE WHITE HEART.||6||2010-12-01 08:26:00||3.39||17850.0||United Kingdom||20.34|
We can now write a query to analyse the data in our dataframe and use PandaSQL’s
sqldf module to execute it via our
pysqldf lambda function. Here, I’m creating a new dataframe called
df_orders which contains aggregate data based on a
GROUP BY of the
The query is run via SQLite, not MySQL, so your SQL syntax needs to be SQLite compatible. Its SQL dialect is slightly different, so not everything from regular ANSI SQL will work.
query = """ SELECT DISTINCT(InvoiceNo) AS order_id, CustomerID AS customer_id, InvoiceDate AS order_date, COUNT(DISTINCT(StockCode)) AS total_skus, SUM(Quantity) AS total_items, SUM(LinePrice) AS total_revenue FROM df_order_items GROUP BY order_id """
df_orders = pysqldf(query) df_orders.head()
Essentially, anything you can do using SQLite, you can do to your dataframes to query them. In the next step, I’ve created a customer-level dataframe by using a
GROUP BY on the
CustomerID and then calculating some statistics on the columns using
SUM(). You can even do joins and subqueries!
query = """ SELECT DISTINCT(CustomerID) AS customer_id, COUNT(DISTINCT(InvoiceNo)) AS total_orders, COUNT(DISTINCT(StockCode)) AS total_skus, SUM(Quantity) AS total_items, SUM(LinePrice) AS total_revenue, (SUM(LinePrice) / COUNT(DISTINCT(InvoiceNo))) AS aov FROM df_order_items WHERE CustomerID IS NOT NULL GROUP BY CustomerID ORDER BY total_revenue DESC """
df_customers = pysqldf(query) df_customers.head()
While I definitely recommend still learning how to use Pandas to select, subset, merge, and filter your data, there’s no denying that PandaSQL is a very useful package for those more skilled in SQL than Pandas. I’ve used it for a few things recently, where it’s taken significantly less SQL code to perform a task than it has in Pandas.
Matt Clarke, Friday, March 12, 2021