How to find the differences between two Pandas dataframes

Picture by Ylanite Koppens, Pexels.

14 minutes to read

Data Science Pandas

When working with data, one common thing you’ll be tasked with doing is identifying what’s changed. For example, let’s say you’ve used your web scraping skills to build an ecommerce product price scraper and are gathering data every day on your competitors. Your product team need to know which products have changed, so they can determine whether they drop prices to match, or increase prices to make some extra margin.

In this tutorial we’ll go over some practical code examples that show how you can spot differences between two Pandas dataframes, identify which columns have values that have changed, and identify the specific rows that differ, and calculate by how much they’ve changed. For this we’ll be working with some real data and using the equals(), merge(), concat(), and compare() Pandas functions. Let’s get started

Load the data

First, open a Jupyter notebook and load up a couple of dataframes. I’ve created a couple of dataframes that include product SKUs, prices, and the stock inventory for a fictitious competitor’s ecommerce site. One dataset was collected on Monday and the other on Tuesday. Our challenge is to see if we can identify what’s changed between the two dataframes to see how many units the competitor sold and what changes they made to their pricing.

import pandas as pd

monday = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/monday.csv')
monday

	sku	price	quantity
0	ABC1	9.99	10
1	ABC2	10.99	8
2	ABC3	78.34	394
3	ABC4	372.43	291
4	ABC5	99.99	1

tuesday = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/tuesday.csv')
tuesday

	sku	price	quantity
0	ABC1	9.99	5
1	ABC2	10.99	2
2	ABC3	69.99	288
3	ABC4	399.99	500
4	ABC5	99.99	1

Use equals() to check if the dataframes are identical

First, we’ll perform a basic test to see if the two dataframes are identical or different. We can do that by appending the equals() function to our first Pandas dataframe and then passing in the name of the second dataframe as an argument. If they’re the same, the equals() function will return True, and if they’re different it will return False.

monday.equals(tuesday)

False

Use equals() to check which dataframe columns have changed

To identify which specific columns in our Pandas dataframe have values that have changed, we can repeat the process but append the equals() function to the Pandas column, or series, we want to check. Running this on our three columns shows that the sku column is the same in both dataframes, but the price and quantity column in each dataframe has changed.

monday['sku'].equals(tuesday['sku'])

True

monday['price'].equals(tuesday['price'])

False

monday['quantity'].equals(tuesday['quantity'])

False

Find rows that are the same in both dataframes

To find the unchanged rows that are the same in both dataframes we can use the merge() function to merge the monday and tuesday dataframes together. We’ll perform an inner join on the data. The indicator argument contains whether a column is added to state whether the value was present in the left_only, right_only, or both, but we don’t really need this, so we’ll set it to False.

unchanged_rows = monday.merge(tuesday, how='inner', indicator=False)
unchanged_rows

	sku	price	quantity
0	ABC5	99.99	1

Find rows that are different in both dataframes

Next, we’ll find the rows that are different in both dataframes because the quantity or price column value has changed. To identify the rows that are different in each dataframe, we’ll use the Pandas concat() function, to which we’ll pass a Python list containing the monday and tuesday dataframes, which joins the two dataframes vertically. Finally, we’ll use the drop_duplictes() function to drop any rows that are the same.

This does show us the rows that are different in both dataframes, but we really want to see only the data from the tuesday dataframe, not both sets of data, so we need to do a bit more to make this useful.

changed_rows = pd.concat([monday, tuesday]).drop_duplicates(keep=False)
changed_rows

	sku	price	quantity
0	ABC1	9.99	10
1	ABC2	10.99	8
2	ABC3	78.34	394
3	ABC4	372.43	291
0	ABC1	9.99	5
1	ABC2	10.99	2
2	ABC3	69.99	288
3	ABC4	399.99	500

Find rows that have changed in the latest dataframe

There are various ways to find only the rows that have changed in the latest dataframe. The easiest way is to add a new column called day and assign it the name of the day or dataframe. We can then repeat the concat() process to merge the two dataframes vertically, drop the duplicates, and filter the dataframe so it only shows data from the most recent tuesday dataframe.

monday['day'] = 'monday'
tuesday['day'] = 'tuesday'

changed_rows = pd.concat([monday, tuesday]).drop_duplicates(keep=False)
changed_rows[changed_rows['day']=='tuesday']

	sku	price	quantity	day
0	ABC1	9.99	5	tuesday
1	ABC2	10.99	2	tuesday
2	ABC3	69.99	288	tuesday
3	ABC4	399.99	500	tuesday
4	ABC5	99.99	1	tuesday

Show how values have changed between two dataframes

Arguably the coolest way to compare Pandas dataframes and show how the values have changed is to use the useful compare() function that was added in Pandas version 1.1. This merges the two dataframes but adds a nifty MultiIndex to show the two sets of data alongside each other so you can see the differences between column values.

To use the compare() function we append .compare() to the monday dataframe, then pass the tuesday dataframe as the first argument. The keep_shape=True and keep_equal=True arguments are needed to ensure we see both full dataframes, including the data that hasn’t changed. Without these, Pandas will drop the sku column making the data much harder to compare.

We get back a MultiIndex dataframe in which our first dataframe, monday, is labeled self, while our joined dataframe, tuesday, is labeled other. We can now see the full dataframes side by side and spot what’s changed between the two.

comparison = monday.compare(tuesday, keep_shape=True, keep_equal=True)
comparison

	sku		price		quantity		day
	self	other	self	other	self	other	self	other
0	ABC1	ABC1	9.99	9.99	10	5	monday	tuesday
1	ABC2	ABC2	10.99	10.99	8	2	monday	tuesday
2	ABC3	ABC3	78.34	69.99	394	288	monday	tuesday
3	ABC4	ABC4	372.43	399.99	291	500	monday	tuesday
4	ABC5	ABC5	99.99	99.99	1	1	monday	tuesday

Calculate changes in column values manually

Finally, there’s the manual approach to finding differences between column values in dataframes. Although a bit more laborious, this is generally the method I favour, since you can calculate useful metrics, such as change or percentage change with a couple of extra lines of code.

To do this effectively, the best technique is to use the Pandas rename() function to rename Pandas column names so they describe the pre- and post-period data, and to use the drop() function to get rid of any columns that are shared across both, with the exception of the column that you’re using for the join. Then it’s just a case of using merge() to join the data on their shared column and calculating the changes.

monday = monday.rename(columns={'price': 'price_was', 'quantity': 'quantity_was'})
monday = monday.drop(columns=['day'])
tuesday = tuesday.rename(columns={'price': 'price_now', 'quantity': 'quantity_now'})
tuesday = tuesday.drop(columns=['day'])

merged = monday.merge(tuesday, how='left', on='sku')
merged['price_change'] = merged['price_now'] - merged['price_was']
merged['quantity_change'] = merged['quantity_now'] - merged['quantity_was']
merged = merged[['sku', 'price_was', 'price_now', 'price_change', 'quantity_was', 'quantity_now', 'quantity_change']]

merged

	sku	price_was	price_now	price_change	quantity_was	quantity_now	quantity_change
0	ABC1	9.99	9.99	0.00	10	5	-5
1	ABC2	10.99	10.99	0.00	8	2	-6
2	ABC3	78.34	69.99	-8.35	394	288	-106
3	ABC4	372.43	399.99	27.56	291	500	209
4	ABC5	99.99	99.99	0.00	1	1	0

If you’re only interested in the rows that have had a price change or a quantity change, it’s just a case of filtering the dataframe based on rows where the change is not equal to zero. Therefore, to show only the rows where the price has increased or decreased we can use merged[merged['price_change']!=0].

merged[merged['price_change']!=0]

	sku	price_was	price_now	price_change	quantity_was	quantity_now	quantity_change
2	ABC3	78.34	69.99	-8.35	394	288	-106
3	ABC4	372.43	399.99	27.56	291	500	209

Matt Clarke, Thursday, August 25, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.