When working with data, one common thing you’ll be tasked with doing is identifying what’s changed. For example, let’s say you’ve used your web scraping skills to build an ecommerce product price scraper and are gathering data every day on your competitors. Your product team need to know which products have changed, so they can determine whether they drop prices to match, or increase prices to make some extra margin.
In this tutorial we’ll go over some practical code examples that show how you can spot differences between two Pandas dataframes, identify which columns have values that have changed, and identify the specific rows that differ, and calculate by how much they’ve changed. For this we’ll be working with some real data and using the equals()
, merge()
, concat()
, and compare()
Pandas functions. Let’s get started
First, open a Jupyter notebook and load up a couple of dataframes. I’ve created a couple of dataframes that include product SKUs, prices, and the stock inventory for a fictitious competitor’s ecommerce site. One dataset was collected on Monday and the other on Tuesday. Our challenge is to see if we can identify what’s changed between the two dataframes to see how many units the competitor sold and what changes they made to their pricing.
import pandas as pd
monday = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/monday.csv')
monday
sku | price | quantity | |
---|---|---|---|
0 | ABC1 | 9.99 | 10 |
1 | ABC2 | 10.99 | 8 |
2 | ABC3 | 78.34 | 394 |
3 | ABC4 | 372.43 | 291 |
4 | ABC5 | 99.99 | 1 |
tuesday = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/tuesday.csv')
tuesday
sku | price | quantity | |
---|---|---|---|
0 | ABC1 | 9.99 | 5 |
1 | ABC2 | 10.99 | 2 |
2 | ABC3 | 69.99 | 288 |
3 | ABC4 | 399.99 | 500 |
4 | ABC5 | 99.99 | 1 |
First, we’ll perform a basic test to see if the two dataframes are identical or different. We can do that by appending the equals()
function to our first Pandas dataframe and then passing in the name of the second dataframe as an argument. If they’re the same, the equals()
function will return True
, and if they’re different it will return False
.
monday.equals(tuesday)
False
To identify which specific columns in our Pandas dataframe have values that have changed, we can repeat the process but append the equals()
function to the Pandas column, or series, we want to check. Running this on our three columns shows that the sku
column is the same in both dataframes, but the price
and quantity
column in each dataframe has changed.
monday['sku'].equals(tuesday['sku'])
True
monday['price'].equals(tuesday['price'])
False
monday['quantity'].equals(tuesday['quantity'])
False
To find the unchanged rows that are the same in both dataframes we can use the merge()
function to merge the monday
and tuesday
dataframes together. We’ll perform an inner
join on the data. The indicator
argument contains whether a column is added to state whether the value was present in the left_only
, right_only
, or both
, but we don’t really need this, so we’ll set it to False
.
unchanged_rows = monday.merge(tuesday, how='inner', indicator=False)
unchanged_rows
sku | price | quantity | |
---|---|---|---|
0 | ABC5 | 99.99 | 1 |
Next, we’ll find the rows that are different in both dataframes because the quantity
or price
column value has changed. To identify the rows that are different in each dataframe, we’ll use the Pandas concat()
function, to which we’ll pass a Python list containing the monday
and tuesday
dataframes, which joins the two dataframes vertically. Finally, we’ll use the drop_duplictes()
function to drop any rows that are the same.
This does show us the rows that are different in both dataframes, but we really want to see only the data from the tuesday
dataframe, not both sets of data, so we need to do a bit more to make this useful.
changed_rows = pd.concat([monday, tuesday]).drop_duplicates(keep=False)
changed_rows
sku | price | quantity | |
---|---|---|---|
0 | ABC1 | 9.99 | 10 |
1 | ABC2 | 10.99 | 8 |
2 | ABC3 | 78.34 | 394 |
3 | ABC4 | 372.43 | 291 |
0 | ABC1 | 9.99 | 5 |
1 | ABC2 | 10.99 | 2 |
2 | ABC3 | 69.99 | 288 |
3 | ABC4 | 399.99 | 500 |
There are various ways to find only the rows that have changed in the latest dataframe. The easiest way is to add a new column called day
and assign it the name of the day or dataframe. We can then repeat the concat()
process to merge the two dataframes vertically, drop the duplicates, and filter the dataframe so it only shows data from the most recent tuesday
dataframe.
monday['day'] = 'monday'
tuesday['day'] = 'tuesday'
changed_rows = pd.concat([monday, tuesday]).drop_duplicates(keep=False)
changed_rows[changed_rows['day']=='tuesday']
sku | price | quantity | day | |
---|---|---|---|---|
0 | ABC1 | 9.99 | 5 | tuesday |
1 | ABC2 | 10.99 | 2 | tuesday |
2 | ABC3 | 69.99 | 288 | tuesday |
3 | ABC4 | 399.99 | 500 | tuesday |
4 | ABC5 | 99.99 | 1 | tuesday |
Arguably the coolest way to compare Pandas dataframes and show how the values have changed is to use the useful compare()
function that was added in Pandas version 1.1. This merges the two dataframes but adds a nifty MultiIndex
to show the two sets of data alongside each other so you can see the differences between column values.
To use the compare()
function we append .compare()
to the monday
dataframe, then pass the tuesday
dataframe as the first argument. The keep_shape=True
and keep_equal=True
arguments are needed to ensure we see both full dataframes, including the data that hasn’t changed. Without these, Pandas will drop the sku
column making the data much harder to compare.
We get back a MultiIndex
dataframe in which our first dataframe, monday
, is labeled self
, while our joined dataframe, tuesday
, is labeled other. We can now see the full dataframes side by side and spot what’s changed between the two.
comparison = monday.compare(tuesday, keep_shape=True, keep_equal=True)
comparison
sku | price | quantity | day | |||||
---|---|---|---|---|---|---|---|---|
self | other | self | other | self | other | self | other | |
0 | ABC1 | ABC1 | 9.99 | 9.99 | 10 | 5 | monday | tuesday |
1 | ABC2 | ABC2 | 10.99 | 10.99 | 8 | 2 | monday | tuesday |
2 | ABC3 | ABC3 | 78.34 | 69.99 | 394 | 288 | monday | tuesday |
3 | ABC4 | ABC4 | 372.43 | 399.99 | 291 | 500 | monday | tuesday |
4 | ABC5 | ABC5 | 99.99 | 99.99 | 1 | 1 | monday | tuesday |
Finally, there’s the manual approach to finding differences between column values in dataframes. Although a bit more laborious, this is generally the method I favour, since you can calculate useful metrics, such as change or percentage change with a couple of extra lines of code.
To do this effectively, the best technique is to use the Pandas rename()
function to rename Pandas column names so they describe the pre- and post-period data, and to use the drop()
function to get rid of any columns that are shared across both, with the exception of the column that you’re using for the join. Then it’s just a case of using merge()
to join the data on their shared column and calculating the changes.
monday = monday.rename(columns={'price': 'price_was', 'quantity': 'quantity_was'})
monday = monday.drop(columns=['day'])
tuesday = tuesday.rename(columns={'price': 'price_now', 'quantity': 'quantity_now'})
tuesday = tuesday.drop(columns=['day'])
merged = monday.merge(tuesday, how='left', on='sku')
merged['price_change'] = merged['price_now'] - merged['price_was']
merged['quantity_change'] = merged['quantity_now'] - merged['quantity_was']
merged = merged[['sku', 'price_was', 'price_now', 'price_change', 'quantity_was', 'quantity_now', 'quantity_change']]
merged
sku | price_was | price_now | price_change | quantity_was | quantity_now | quantity_change | |
---|---|---|---|---|---|---|---|
0 | ABC1 | 9.99 | 9.99 | 0.00 | 10 | 5 | -5 |
1 | ABC2 | 10.99 | 10.99 | 0.00 | 8 | 2 | -6 |
2 | ABC3 | 78.34 | 69.99 | -8.35 | 394 | 288 | -106 |
3 | ABC4 | 372.43 | 399.99 | 27.56 | 291 | 500 | 209 |
4 | ABC5 | 99.99 | 99.99 | 0.00 | 1 | 1 | 0 |
If you’re only interested in the rows that have had a price change or a quantity change, it’s just a case of filtering the dataframe based on rows where the change is not equal to zero. Therefore, to show only the rows where the price
has increased or decreased we can use merged[merged['price_change']!=0]
.
merged[merged['price_change']!=0]
sku | price_was | price_now | price_change | quantity_was | quantity_now | quantity_change | |
---|---|---|---|---|---|---|---|
2 | ABC3 | 78.34 | 69.99 | -8.35 | 394 | 288 | -106 |
3 | ABC4 | 372.43 | 399.99 | 27.56 | 291 | 500 | 209 |
Matt Clarke, Thursday, August 25, 2022