When working with Pandas dataframes you’ll often need to calculate the percentage change or percentage difference between the values in two columns. There are various ways to do this in Pandas.
In this quick and easy tutorial, I’ll show you three different approaches you can use to calculate the percentage change between two columns, including the Pandas pct_change()
function, lambda
functions, and custom functions added using both apply()
and assign()
.
To get started, open a Jupyter notebook, import the Pandas library using the import pandas as pd
naming convention, and create a Pandas dataframe that contains two columns of numeric data to compare. The simple example dataset below the number of orders placed from each of five countries over two years.
import pandas as pd
df = pd.DataFrame({'country':['USA','Canada','Mexico','Brazil','UK'],
'orders_2022': [23, 12, 34, 45, 67],
'orders_2023': [12, 34, 56, 78, 90]})
df
country | orders_2022 | orders_2023 | |
---|---|---|---|
0 | USA | 23 | 12 |
1 | Canada | 12 | 34 |
2 | Mexico | 34 | 56 |
3 | Brazil | 45 | 78 |
4 | UK | 67 | 90 |
Pandas, rather helpfully, includes a built-in function called pct_change()
that allows you to calculate the percentage change across rows or columns in a dataframe. By default, pct_change()
sets the optional axis
parameter to 0
which means that it will calculate the percentage change between one row and the next. However, by setting axis=1
we can calculate the percentage change between columns instead.
df['pct_change'] = df[['orders_2022', 'orders_2023']].pct_change(axis=1)['orders_2023']
df
country | orders_2022 | orders_2023 | pct_change | |
---|---|---|---|---|
0 | USA | 23 | 12 | -0.478261 |
1 | Canada | 12 | 34 | 1.833333 |
2 | Mexico | 34 | 56 | 0.647059 |
3 | Brazil | 45 | 78 | 0.733333 |
4 | UK | 67 | 90 | 0.343284 |
As you’ll notice above, pct_change()
really returns a fractional change rather than a percentage change, so the -47.8% change in orders for the USA between 2022 and 2023 is shown as -0.478261 instead of -0.478261%. You need to multiply the value by 100 to get the actual percentage difference or change. You may also wish to use round()
to round to two decimal places and cast the value to a str
dtype and append a percentage symbol to aid readability.
df['pct_change_str'] = ((df[['orders_2022', 'orders_2023']].pct_change(axis=1)['orders_2023']) * 100).round(2).map(str) + '%'
df[['country', 'orders_2022', 'orders_2023', 'pct_change_str']]
country | orders_2022 | orders_2023 | pct_change_str | |
---|---|---|---|---|
0 | USA | 23 | 12 | -47.83% |
1 | Canada | 12 | 34 | 183.33% |
2 | Mexico | 34 | 56 | 64.71% |
3 | Brazil | 45 | 78 | 73.33% |
4 | UK | 67 | 90 | 34.33% |
Another way to calculate percentage difference or percentage change between Pandas columns is via a lambda function. These anonymous functions are useful when you want to apply a function to a Pandas dataframe or series without having to define a function first. We can calculate the percentage difference and multiply it by 100 to get the percentage in a single line of code using the apply()
method.
df['pct_change_lambda'] = df[['orders_2022', 'orders_2023']].apply(lambda x: ((x[1] - x[0]) / x[0]) * 100, axis=1)
df[['country', 'orders_2022', 'orders_2023', 'pct_change_lambda']]
If you prefer to use the Pandas assign()
method, you can do so as well. The assign()
method also avoids the potential of getting the SettingWithCopyWarning
error.
df = df.assign(pct_change_assign = lambda x: ((x['orders_2023'] - x['orders_2022']) / x['orders_2022']) * 100)
df[['country', 'orders_2022', 'orders_2023', 'pct_change_assign']]
country | orders_2022 | orders_2023 | pct_change_assign | |
---|---|---|---|---|
0 | USA | 23 | 12 | -47.826087 |
1 | Canada | 12 | 34 | 183.333333 |
2 | Mexico | 34 | 56 | 64.705882 |
3 | Brazil | 45 | 78 | 73.333333 |
4 | UK | 67 | 90 | 34.328358 |
Finally, the other way to calculate the percentage difference between two columns is to create a custom function and apply it to the dataframe. While this means creating a custom function, it can result in cleaner code than using a lambda
function, so it’s worth considering if you want to avoid using pct_change()
so you’ve got total control over the output.
def custom_calc(x):
return (((x[1] - x[0]) / x[0]) * 100).round(2)
df['custom_calc_apply'] = df[['orders_2022', 'orders_2023']].apply(custom_calc, axis=1)
df[['country', 'orders_2022', 'orders_2023', 'custom_calc_apply']]
country | orders_2022 | orders_2023 | custom_calc_apply | |
---|---|---|---|---|
0 | USA | 23 | 12 | -47.83 |
1 | Canada | 12 | 34 | 183.33 |
2 | Mexico | 34 | 56 | 64.71 |
3 | Brazil | 45 | 78 | 73.33 |
4 | UK | 67 | 90 | 34.33 |
Matt Clarke, Sunday, January 01, 2023