How to find the differences between two Pandas dataframes

Learn how to compare Pandas dataframes and individual columns to see if there are any differences using the equals(), merge(), concat(), and compare() Pandas functions.

How to find the differences between two Pandas dataframes
Picture by Ylanite Koppens, Pexels.
14 minutes to read

When working with data, one common thing you’ll be tasked with doing is identifying what’s changed. For example, let’s say you’ve used your web scraping skills to build an ecommerce product price scraper and are gathering data every day on your competitors. Your product team need to know which products have changed, so they can determine whether they drop prices to match, or increase prices to make some extra margin.

In this tutorial we’ll go over some practical code examples that show how you can spot differences between two Pandas dataframes, identify which columns have values that have changed, and identify the specific rows that differ, and calculate by how much they’ve changed. For this we’ll be working with some real data and using the equals(), merge(), concat(), and compare() Pandas functions. Let’s get started

Load the data

First, open a Jupyter notebook and load up a couple of dataframes. I’ve created a couple of dataframes that include product SKUs, prices, and the stock inventory for a fictitious competitor’s ecommerce site. One dataset was collected on Monday and the other on Tuesday. Our challenge is to see if we can identify what’s changed between the two dataframes to see how many units the competitor sold and what changes they made to their pricing.

import pandas as pd
monday = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/monday.csv')
monday
sku price quantity
0 ABC1 9.99 10
1 ABC2 10.99 8
2 ABC3 78.34 394
3 ABC4 372.43 291
4 ABC5 99.99 1
tuesday = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/tuesday.csv')
tuesday
sku price quantity
0 ABC1 9.99 5
1 ABC2 10.99 2
2 ABC3 69.99 288
3 ABC4 399.99 500
4 ABC5 99.99 1

Use equals() to check if the dataframes are identical

First, we’ll perform a basic test to see if the two dataframes are identical or different. We can do that by appending the equals() function to our first Pandas dataframe and then passing in the name of the second dataframe as an argument. If they’re the same, the equals() function will return True, and if they’re different it will return False.

monday.equals(tuesday)
False

Use equals() to check which dataframe columns have changed

To identify which specific columns in our Pandas dataframe have values that have changed, we can repeat the process but append the equals() function to the Pandas column, or series, we want to check. Running this on our three columns shows that the sku column is the same in both dataframes, but the price and quantity column in each dataframe has changed.

monday['sku'].equals(tuesday['sku'])
True
monday['price'].equals(tuesday['price'])
False
monday['quantity'].equals(tuesday['quantity'])
False

Find rows that are the same in both dataframes

To find the unchanged rows that are the same in both dataframes we can use the merge() function to merge the monday and tuesday dataframes together. We’ll perform an inner join on the data. The indicator argument contains whether a column is added to state whether the value was present in the left_only, right_only, or both, but we don’t really need this, so we’ll set it to False.

unchanged_rows = monday.merge(tuesday, how='inner', indicator=False)
unchanged_rows
sku price quantity
0 ABC5 99.99 1

Find rows that are different in both dataframes

Next, we’ll find the rows that are different in both dataframes because the quantity or price column value has changed. To identify the rows that are different in each dataframe, we’ll use the Pandas concat() function, to which we’ll pass a Python list containing the monday and tuesday dataframes, which joins the two dataframes vertically. Finally, we’ll use the drop_duplictes() function to drop any rows that are the same.

This does show us the rows that are different in both dataframes, but we really want to see only the data from the tuesday dataframe, not both sets of data, so we need to do a bit more to make this useful.

changed_rows = pd.concat([monday, tuesday]).drop_duplicates(keep=False)
changed_rows
sku price quantity
0 ABC1 9.99 10
1 ABC2 10.99 8
2 ABC3 78.34 394
3 ABC4 372.43 291
0 ABC1 9.99 5
1 ABC2 10.99 2
2 ABC3 69.99 288
3 ABC4 399.99 500

Find rows that have changed in the latest dataframe

There are various ways to find only the rows that have changed in the latest dataframe. The easiest way is to add a new column called day and assign it the name of the day or dataframe. We can then repeat the concat() process to merge the two dataframes vertically, drop the duplicates, and filter the dataframe so it only shows data from the most recent tuesday dataframe.

monday['day'] = 'monday'
tuesday['day'] = 'tuesday'
changed_rows = pd.concat([monday, tuesday]).drop_duplicates(keep=False)
changed_rows[changed_rows['day']=='tuesday']
sku price quantity day
0 ABC1 9.99 5 tuesday
1 ABC2 10.99 2 tuesday
2 ABC3 69.99 288 tuesday
3 ABC4 399.99 500 tuesday
4 ABC5 99.99 1 tuesday

Show how values have changed between two dataframes

Arguably the coolest way to compare Pandas dataframes and show how the values have changed is to use the useful compare() function that was added in Pandas version 1.1. This merges the two dataframes but adds a nifty MultiIndex to show the two sets of data alongside each other so you can see the differences between column values.

To use the compare() function we append .compare() to the monday dataframe, then pass the tuesday dataframe as the first argument. The keep_shape=True and keep_equal=True arguments are needed to ensure we see both full dataframes, including the data that hasn’t changed. Without these, Pandas will drop the sku column making the data much harder to compare.

We get back a MultiIndex dataframe in which our first dataframe, monday, is labeled self, while our joined dataframe, tuesday, is labeled other. We can now see the full dataframes side by side and spot what’s changed between the two.

comparison = monday.compare(tuesday, keep_shape=True, keep_equal=True)
comparison
sku price quantity day
self other self other self other self other
0 ABC1 ABC1 9.99 9.99 10 5 monday tuesday
1 ABC2 ABC2 10.99 10.99 8 2 monday tuesday
2 ABC3 ABC3 78.34 69.99 394 288 monday tuesday
3 ABC4 ABC4 372.43 399.99 291 500 monday tuesday
4 ABC5 ABC5 99.99 99.99 1 1 monday tuesday

Calculate changes in column values manually

Finally, there’s the manual approach to finding differences between column values in dataframes. Although a bit more laborious, this is generally the method I favour, since you can calculate useful metrics, such as change or percentage change with a couple of extra lines of code.

To do this effectively, the best technique is to use the Pandas rename() function to rename Pandas column names so they describe the pre- and post-period data, and to use the drop() function to get rid of any columns that are shared across both, with the exception of the column that you’re using for the join. Then it’s just a case of using merge() to join the data on their shared column and calculating the changes.

monday = monday.rename(columns={'price': 'price_was', 'quantity': 'quantity_was'})
monday = monday.drop(columns=['day'])
tuesday = tuesday.rename(columns={'price': 'price_now', 'quantity': 'quantity_now'})
tuesday = tuesday.drop(columns=['day'])
merged = monday.merge(tuesday, how='left', on='sku')
merged['price_change'] = merged['price_now'] - merged['price_was']
merged['quantity_change'] = merged['quantity_now'] - merged['quantity_was']
merged = merged[['sku', 'price_was', 'price_now', 'price_change', 'quantity_was', 'quantity_now', 'quantity_change']]
merged
sku price_was price_now price_change quantity_was quantity_now quantity_change
0 ABC1 9.99 9.99 0.00 10 5 -5
1 ABC2 10.99 10.99 0.00 8 2 -6
2 ABC3 78.34 69.99 -8.35 394 288 -106
3 ABC4 372.43 399.99 27.56 291 500 209
4 ABC5 99.99 99.99 0.00 1 1 0

If you’re only interested in the rows that have had a price change or a quantity change, it’s just a case of filtering the dataframe based on rows where the change is not equal to zero. Therefore, to show only the rows where the price has increased or decreased we can use merged[merged['price_change']!=0].

merged[merged['price_change']!=0]
sku price_was price_now price_change quantity_was quantity_now quantity_change
2 ABC3 78.34 69.99 -8.35 394 288 -106
3 ABC4 372.43 399.99 27.56 291 500 209

Matt Clarke, Thursday, August 25, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.