How to group and aggregate transactional data using Pandas

Learn how to group and aggregate transactional data using Pandas to create new datasets allowing you to analyse products and customers.

How to group and aggregate transactional data using Pandas
Picture by Christian Hume, Unsplash.
9 minutes to read

Transactional item data can be used to create a number of other useful datasets to help you analyse ecommerce products and customers. From the core list of items purchased you can create additional datasets containing order data, customer data, and product data, which can be used for analysis, reporting, or modeling.

Pandas makes it really straightforward to perform this process, thanks to its helpful groupby() and agg() functions which allow you to group by a given column and then calculate aggregate statistics on the selected data. As well as standard metrics, such as sum, count, min, max, nunique, and std, you can also pass in Numpy functions or create your own custom metrics. Here’s how it’s done.

Load the order items

The data set I’m using here is the Online Retail dataset from the UCI Machine Learning Repository. It’s typical of the transaction item data set that you’ll deal with every day if you work in ecommerce and contains one line per SKU, with details on the order ID, the customer ID, order date and the price paid. To tidy it up, I’ve renamed some of the columns and skipped the first row of headers.

import pandas as pd
df = pd.read_csv('online-retail.csv', 
                 names=['invoice', 'sku', 'description', 'qty', 'order_date', 
                        'unit_price', 'customer_id', 'country'],
                low_memory=False, 
                skiprows=1)
df.head()
invoice sku description qty order_date unit_price customer_id country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom

Feature engineering

This data set doesn’t include a line price, so to save hassle in later steps, we will first calculate this by multiplying the qty field by the unit_price field. The order_date field is also an object instead of a datetime, so we can reset this to the correct type using the Pandas to_datetime() function.

df['line_price'] = df['qty'] * df['unit_price']
df['order_date'] = pd.to_datetime(df['order_date'])
df.head()
invoice sku description qty order_date unit_price customer_id country line_price
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 15.30
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 22.00
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 20.34

Create an orders dataframe

From our single dataframe of transactional items, we can now using grouping and aggregation to create other dataframes containing data on orders, customers, and products. To create the orders dataframe, we first groupby() the invoice column and then use the agg() function to calculate new metrics for each of the columns of interest.

To calculate the revenue on each order we use the revenue=('line_price', 'sum') argument, which calculates the sum of the line_price column. We’ll calculate the number of unique SKUs in the order with nunique and the total number of items by calculating the sum of the qty field.

df_orders = df.groupby('invoice').agg(
    revenue=('line_price', 'sum'),
    skus=('sku', 'nunique'),
    items=('qty', 'sum'),
)

df_orders.head()
revenue skus items
invoice
536365 139.12 7 40
536366 22.20 2 12
536367 278.73 12 83
536368 70.05 4 15
536369 17.85 1 3

Create a customers dataframe

This should be pretty self explanatory, as we’re just repeating the above approach to create a customers dataframe. This groups on the customer_id and then uses agg() to calculate the total revenue, the number of unique orders, the number of unique SKUs, the total number of units, and the first and last date of the customers orders.

df_customers = df.groupby('customer_id').agg(
    revenue=('line_price', 'sum'),
    orders=('invoice', 'nunique'),
    skus=('sku', 'nunique'),
    units=('qty', 'sum'),
    first_order_date=('order_date', 'min'),
    last_order_date=('order_date', 'max')    
).reset_index().sort_values(by='revenue', ascending=False)

df_customers.head()
customer_id revenue orders skus units first_order_date last_order_date
1703 14646.0 279489.02 77 703 196719 2010-12-20 10:09:00 2011-12-08 12:12:00
4233 18102.0 256438.49 62 151 64122 2010-12-07 16:42:00 2011-12-09 11:50:00
3758 17450.0 187482.17 55 127 69029 2010-12-07 09:23:00 2011-12-01 13:29:00
1895 14911.0 132572.62 248 1794 77180 2010-12-01 14:05:00 2011-12-08 15:54:00
55 12415.0 123725.45 26 444 77242 2011-01-06 11:12:00 2011-11-15 14:22:00

Create a products dataframe

Finally, to create a products dataframe, we can groupby() the sku column and use the agg() function to calculate the total number of units sold, the average number of units purchased by each customer, the average line price paid, the total number of unique invoices and the total number of customers who’ve purchased the product.

df_products = df.groupby('sku').agg(
    units=('qty', 'sum'),
    avg_units=('qty', 'mean'),    
    revenue=('line_price', 'sum'),
    orders=('invoice', 'nunique'),
    customers=('customer_id', 'nunique'),
)

df_products.head()
units avg_units revenue orders customers
sku
10002 1037 14.205479 759.89 73 40
10080 495 20.625000 119.09 24 19
10120 193 6.433333 40.53 29 25
10123C -13 -3.250000 3.25 4 3
10123G -38 -38.000000 0.00 1 0

Matt Clarke, Saturday, March 06, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Joining Data with pandas

Learn to combine data from multiple tables by joining data together using pandas.

Start course for FREE

Comments