How to group and aggregate transactional data using Pandas

Picture by Christian Hume, Unsplash.

9 minutes to read

Transactional item data can be used to create a number of other useful datasets to help you analyse ecommerce products and customers. From the core list of items purchased you can create additional datasets containing order data, customer data, and product data, which can be used for analysis, reporting, or modeling.

Pandas makes it really straightforward to perform this process, thanks to its helpful groupby() and agg() functions which allow you to group by a given column and then calculate aggregate statistics on the selected data. As well as standard metrics, such as sum, count, min, max, nunique, and std, you can also pass in Numpy functions or create your own custom metrics. Here’s how it’s done.

Load the order items

The data set I’m using here is the Online Retail dataset from the UCI Machine Learning Repository. It’s typical of the transaction item data set that you’ll deal with every day if you work in ecommerce and contains one line per SKU, with details on the order ID, the customer ID, order date and the price paid. To tidy it up, I’ve renamed some of the columns and skipped the first row of headers.

import pandas as pd

df = pd.read_csv('online-retail.csv', 
                 names=['invoice', 'sku', 'description', 'qty', 'order_date', 
                        'unit_price', 'customer_id', 'country'],
                low_memory=False, 
                skiprows=1)
df.head()

	invoice	sku	description	qty	order_date	unit_price	customer_id	country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom

Feature engineering

This data set doesn’t include a line price, so to save hassle in later steps, we will first calculate this by multiplying the qty field by the unit_price field. The order_date field is also an object instead of a datetime, so we can reset this to the correct type using the Pandas to_datetime() function.

df['line_price'] = df['qty'] * df['unit_price']
df['order_date'] = pd.to_datetime(df['order_date'])

df.head()

	invoice	sku	description	qty	order_date	unit_price	customer_id	country	line_price
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom	15.30
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom	22.00
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	20.34

Create an orders dataframe

From our single dataframe of transactional items, we can now using grouping and aggregation to create other dataframes containing data on orders, customers, and products. To create the orders dataframe, we first groupby() the invoice column and then use the agg() function to calculate new metrics for each of the columns of interest.

To calculate the revenue on each order we use the revenue=('line_price', 'sum') argument, which calculates the sum of the line_price column. We’ll calculate the number of unique SKUs in the order with nunique and the total number of items by calculating the sum of the qty field.

df_orders = df.groupby('invoice').agg(
    revenue=('line_price', 'sum'),
    skus=('sku', 'nunique'),
    items=('qty', 'sum'),
)

df_orders.head()

	revenue	skus	items
invoice
536365	139.12	7	40
536366	22.20	2	12
536367	278.73	12	83
536368	70.05	4	15
536369	17.85	1	3

Create a customers dataframe

This should be pretty self explanatory, as we’re just repeating the above approach to create a customers dataframe. This groups on the customer_id and then uses agg() to calculate the total revenue, the number of unique orders, the number of unique SKUs, the total number of units, and the first and last date of the customers orders.

df_customers = df.groupby('customer_id').agg(
    revenue=('line_price', 'sum'),
    orders=('invoice', 'nunique'),
    skus=('sku', 'nunique'),
    units=('qty', 'sum'),
    first_order_date=('order_date', 'min'),
    last_order_date=('order_date', 'max')    
).reset_index().sort_values(by='revenue', ascending=False)

df_customers.head()

	customer_id	revenue	orders	skus	units	first_order_date	last_order_date
1703	14646.0	279489.02	77	703	196719	2010-12-20 10:09:00	2011-12-08 12:12:00
4233	18102.0	256438.49	62	151	64122	2010-12-07 16:42:00	2011-12-09 11:50:00
3758	17450.0	187482.17	55	127	69029	2010-12-07 09:23:00	2011-12-01 13:29:00
1895	14911.0	132572.62	248	1794	77180	2010-12-01 14:05:00	2011-12-08 15:54:00
55	12415.0	123725.45	26	444	77242	2011-01-06 11:12:00	2011-11-15 14:22:00

Create a products dataframe

Finally, to create a products dataframe, we can groupby() the sku column and use the agg() function to calculate the total number of units sold, the average number of units purchased by each customer, the average line price paid, the total number of unique invoices and the total number of customers who’ve purchased the product.

df_products = df.groupby('sku').agg(
    units=('qty', 'sum'),
    avg_units=('qty', 'mean'),    
    revenue=('line_price', 'sum'),
    orders=('invoice', 'nunique'),
    customers=('customer_id', 'nunique'),
)

df_products.head()

	units	avg_units	revenue	orders	customers
sku
10002	1037	14.205479	759.89	73	40
10080	495	20.625000	119.09	24	19
10120	193	6.433333	40.53	29	25
10123C	-13	-3.250000	3.25	4	3
10123G	-38	-38.000000	0.00	1	0

Matt Clarke, Saturday, March 06, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.