Picture by Ylanite Koppens, Pexels.

When working with data, one common thing you’ll be tasked with doing is identifying what’s changed. For example, let’s say you’ve used your web scraping skills to build an ecommerce product price scraper and are gathering data every day on your competitors. Your product team need to know which products have changed, so they can determine whether they drop prices to match, or increase prices to make some extra margin.

In this tutorial we’ll go over some practical code examples that show how you can spot differences between two Pandas dataframes, identify which columns have values that have changed, and identify the specific rows that differ, and calculate by how much they’ve changed. For this we’ll be working with some real data and using the `equals()`, `merge()`, `concat()`, and `compare()` Pandas functions. Let’s get started

First, open a Jupyter notebook and load up a couple of dataframes. I’ve created a couple of dataframes that include product SKUs, prices, and the stock inventory for a fictitious competitor’s ecommerce site. One dataset was collected on Monday and the other on Tuesday. Our challenge is to see if we can identify what’s changed between the two dataframes to see how many units the competitor sold and what changes they made to their pricing.

``````import pandas as pd
``````
``````monday = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/monday.csv')
monday
``````
sku price quantity
0 ABC1 9.99 10
1 ABC2 10.99 8
2 ABC3 78.34 394
3 ABC4 372.43 291
4 ABC5 99.99 1
``````tuesday = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/tuesday.csv')
tuesday
``````
sku price quantity
0 ABC1 9.99 5
1 ABC2 10.99 2
2 ABC3 69.99 288
3 ABC4 399.99 500
4 ABC5 99.99 1

### Use equals() to check if the dataframes are identical

First, we’ll perform a basic test to see if the two dataframes are identical or different. We can do that by appending the `equals()` function to our first Pandas dataframe and then passing in the name of the second dataframe as an argument. If they’re the same, the `equals()` function will return `True`, and if they’re different it will return `False`.

``````monday.equals(tuesday)
``````
``````False
``````

### Use equals() to check which dataframe columns have changed

To identify which specific columns in our Pandas dataframe have values that have changed, we can repeat the process but append the `equals()` function to the Pandas column, or series, we want to check. Running this on our three columns shows that the `sku` column is the same in both dataframes, but the `price` and `quantity` column in each dataframe has changed.

``````monday['sku'].equals(tuesday['sku'])
``````
``````True
``````
``````monday['price'].equals(tuesday['price'])
``````
``````False
``````
``````monday['quantity'].equals(tuesday['quantity'])
``````
``````False
``````

### Find rows that are the same in both dataframes

To find the unchanged rows that are the same in both dataframes we can use the `merge()` function to merge the `monday` and `tuesday` dataframes together. We’ll perform an `inner` join on the data. The `indicator` argument contains whether a column is added to state whether the value was present in the `left_only`, `right_only`, or `both`, but we don’t really need this, so we’ll set it to `False`.

``````unchanged_rows = monday.merge(tuesday, how='inner', indicator=False)
unchanged_rows
``````
sku price quantity
0 ABC5 99.99 1

### Find rows that are different in both dataframes

Next, we’ll find the rows that are different in both dataframes because the `quantity` or `price` column value has changed. To identify the rows that are different in each dataframe, we’ll use the Pandas `concat()` function, to which we’ll pass a Python list containing the `monday` and `tuesday` dataframes, which joins the two dataframes vertically. Finally, we’ll use the `drop_duplictes()` function to drop any rows that are the same.

This does show us the rows that are different in both dataframes, but we really want to see only the data from the `tuesday` dataframe, not both sets of data, so we need to do a bit more to make this useful.

``````changed_rows = pd.concat([monday, tuesday]).drop_duplicates(keep=False)
changed_rows
``````
sku price quantity
0 ABC1 9.99 10
1 ABC2 10.99 8
2 ABC3 78.34 394
3 ABC4 372.43 291
0 ABC1 9.99 5
1 ABC2 10.99 2
2 ABC3 69.99 288
3 ABC4 399.99 500

### Find rows that have changed in the latest dataframe

There are various ways to find only the rows that have changed in the latest dataframe. The easiest way is to add a new column called `day` and assign it the name of the day or dataframe. We can then repeat the `concat()` process to merge the two dataframes vertically, drop the duplicates, and filter the dataframe so it only shows data from the most recent `tuesday` dataframe.

``````monday['day'] = 'monday'
tuesday['day'] = 'tuesday'
``````
``````changed_rows = pd.concat([monday, tuesday]).drop_duplicates(keep=False)
changed_rows[changed_rows['day']=='tuesday']
``````
sku price quantity day
0 ABC1 9.99 5 tuesday
1 ABC2 10.99 2 tuesday
2 ABC3 69.99 288 tuesday
3 ABC4 399.99 500 tuesday
4 ABC5 99.99 1 tuesday

### Show how values have changed between two dataframes

Arguably the coolest way to compare Pandas dataframes and show how the values have changed is to use the useful `compare()` function that was added in Pandas version 1.1. This merges the two dataframes but adds a nifty `MultiIndex` to show the two sets of data alongside each other so you can see the differences between column values.

To use the `compare()` function we append `.compare()` to the `monday` dataframe, then pass the `tuesday` dataframe as the first argument. The `keep_shape=True` and `keep_equal=True` arguments are needed to ensure we see both full dataframes, including the data that hasn’t changed. Without these, Pandas will drop the `sku` column making the data much harder to compare.

We get back a `MultiIndex` dataframe in which our first dataframe, `monday`, is labeled `self`, while our joined dataframe, `tuesday`, is labeled other. We can now see the full dataframes side by side and spot what’s changed between the two.

``````comparison = monday.compare(tuesday, keep_shape=True, keep_equal=True)
comparison
``````
sku price quantity day
self other self other self other self other
0 ABC1 ABC1 9.99 9.99 10 5 monday tuesday
1 ABC2 ABC2 10.99 10.99 8 2 monday tuesday
2 ABC3 ABC3 78.34 69.99 394 288 monday tuesday
3 ABC4 ABC4 372.43 399.99 291 500 monday tuesday
4 ABC5 ABC5 99.99 99.99 1 1 monday tuesday

### Calculate changes in column values manually

Finally, there’s the manual approach to finding differences between column values in dataframes. Although a bit more laborious, this is generally the method I favour, since you can calculate useful metrics, such as change or percentage change with a couple of extra lines of code.

To do this effectively, the best technique is to use the Pandas `rename()` function to rename Pandas column names so they describe the pre- and post-period data, and to use the `drop()` function to get rid of any columns that are shared across both, with the exception of the column that you’re using for the join. Then it’s just a case of using `merge()` to join the data on their shared column and calculating the changes.

``````monday = monday.rename(columns={'price': 'price_was', 'quantity': 'quantity_was'})
monday = monday.drop(columns=['day'])
tuesday = tuesday.rename(columns={'price': 'price_now', 'quantity': 'quantity_now'})
tuesday = tuesday.drop(columns=['day'])
``````
``````merged = monday.merge(tuesday, how='left', on='sku')
merged['price_change'] = merged['price_now'] - merged['price_was']
merged['quantity_change'] = merged['quantity_now'] - merged['quantity_was']
merged = merged[['sku', 'price_was', 'price_now', 'price_change', 'quantity_was', 'quantity_now', 'quantity_change']]
``````
``````merged
``````
sku price_was price_now price_change quantity_was quantity_now quantity_change
0 ABC1 9.99 9.99 0.00 10 5 -5
1 ABC2 10.99 10.99 0.00 8 2 -6
2 ABC3 78.34 69.99 -8.35 394 288 -106
3 ABC4 372.43 399.99 27.56 291 500 209
4 ABC5 99.99 99.99 0.00 1 1 0

If you’re only interested in the rows that have had a price change or a quantity change, it’s just a case of filtering the dataframe based on rows where the change is not equal to zero. Therefore, to show only the rows where the `price` has increased or decreased we can use `merged[merged['price_change']!=0]`.

``````merged[merged['price_change']!=0]
``````
sku price_was price_now price_change quantity_was quantity_now quantity_change
2 ABC3 78.34 69.99 -8.35 394 288 -106
3 ABC4 372.43 399.99 27.56 291 500 209

Matt Clarke, Thursday, August 25, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.