When working with Pandas dataframes, it’s a very common task to calculate the difference between two rows. For example, you might want to calculate the difference in the number of visitors to your website between two days, or the difference in the price of a stock between two days.
There are actually a number of different ways to calculate the difference between two rows in Pandas and calculate their percentage change. In this post, we’ll look at two of the most common methods: diff()
and pct_change()
, which are designed specifically for this task, and doing the same thing across column values.
To get started, open a new Jupyter notebook and import the data. We’ll use the pandas
library to read the data from a CSV file into a dataframe using the read_csv()
function. Pandas supports importing data from a number of different file formats, including CSV, Excel, JSON, and SQL.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/causal_impact_dataset.csv')
df.head()
date | clicks | impressions | ctr | position | |
---|---|---|---|---|---|
0 | 2021-07-04 | 136 | 6301 | 2.16 | 31.97 |
1 | 2021-07-05 | 264 | 8697 | 3.04 | 27.66 |
2 | 2021-07-06 | 299 | 9236 | 3.24 | 26.38 |
3 | 2021-07-07 | 276 | 10008 | 2.76 | 26.77 |
4 | 2021-07-08 | 283 | 9725 | 2.91 | 25.83 |
Crucially, you need to ensure your Pandas dataframe has been sorted into a logical order before you calculate the differences between rows or their percentage change. You can do this by appending .sort_values(by='column_name_here')
to the end of your dataframe, and passing in the column name you want to sort by.
To calculate the difference between selected values in each row of our dataframe we’ll simply append .diff()
to the end of our column name and then assign the value to a new column in our dataframe. We don’t need to do it here, but the axis
parameter can be used to calculate the difference between columns instead of rows, and the periods
parameter can be used to calculate the difference between rows that are further apart than the next row by using shift()
.
```python
df['clicks_diff'] = df['clicks'].diff()
df['impressions_diff'] = df['impressions'].diff()
df['ctr_diff'] = df['ctr'].diff()
df['position_diff'] = df['position'].diff()
df.head(10)
date | clicks | impressions | ctr | position | clicks_diff | impressions_diff | ctr_diff | position_diff | |
---|---|---|---|---|---|---|---|---|---|
0 | 2021-07-04 | 136 | 6301 | 2.16 | 31.97 | NaN | NaN | NaN | NaN |
1 | 2021-07-05 | 264 | 8697 | 3.04 | 27.66 | 128.0 | 2396.0 | 0.88 | -4.31 |
2 | 2021-07-06 | 299 | 9236 | 3.24 | 26.38 | 35.0 | 539.0 | 0.20 | -1.28 |
3 | 2021-07-07 | 276 | 10008 | 2.76 | 26.77 | -23.0 | 772.0 | -0.48 | 0.39 |
4 | 2021-07-08 | 283 | 9725 | 2.91 | 25.83 | 7.0 | -283.0 | 0.15 | -0.94 |
5 | 2021-07-09 | 205 | 8273 | 2.48 | 27.65 | -78.0 | -1452.0 | -0.43 | 1.82 |
6 | 2021-07-10 | 106 | 7186 | 1.48 | 33.74 | -99.0 | -1087.0 | -1.00 | 6.09 |
7 | 2021-07-11 | 164 | 7431 | 2.21 | 31.75 | 58.0 | 245.0 | 0.73 | -1.99 |
8 | 2021-07-12 | 316 | 11101 | 2.85 | 26.94 | 152.0 | 3670.0 | 0.64 | -4.81 |
9 | 2021-07-13 | 314 | 11494 | 2.73 | 26.90 | -2.0 | 393.0 | -0.12 | -0.04 |
The same kind of approach can be used to calculate the percentage change between selected values in each row of our dataframe. As with diff()
, we simply append .pct_change()
to the end of the column name and then assign the value to a new column.
The pct_change()
function will calculate the percentage change between each row and the previous row. This means that the first row will always be NaN
as there is no previous row to compare it to. As with diff()
, the pct_change()
function has some other arguments that can be used to change the behaviour of the function.
For example, we can use the periods
argument to specify the number of rows to compare to. This is useful if we want to compare the current row to a row that is not the previous row. For example, if we wanted to compare the current row to the row that was 3 rows ago, we could use periods=3
.
df['clicks_pct_change'] = df['clicks'].pct_change()
df['impressions_pct_change'] = df['impressions'].pct_change()
df['ctr_pct_change'] = df['ctr'].pct_change()
df['position_pct_change'] = df['position'].pct_change()
df.head(10)
date | clicks | impressions | ctr | position | clicks_diff | impressions_diff | ctr_diff | position_diff | clicks_pct_change | impressions_pct_change | ctr_pct_change | position_pct_change | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-07-04 | 136 | 6301 | 2.16 | 31.97 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2021-07-05 | 264 | 8697 | 3.04 | 27.66 | 128.0 | 2396.0 | 0.88 | -4.31 | 0.941176 | 0.380257 | 0.407407 | -0.134814 |
2 | 2021-07-06 | 299 | 9236 | 3.24 | 26.38 | 35.0 | 539.0 | 0.20 | -1.28 | 0.132576 | 0.061975 | 0.065789 | -0.046276 |
3 | 2021-07-07 | 276 | 10008 | 2.76 | 26.77 | -23.0 | 772.0 | -0.48 | 0.39 | -0.076923 | 0.083586 | -0.148148 | 0.014784 |
4 | 2021-07-08 | 283 | 9725 | 2.91 | 25.83 | 7.0 | -283.0 | 0.15 | -0.94 | 0.025362 | -0.028277 | 0.054348 | -0.035114 |
5 | 2021-07-09 | 205 | 8273 | 2.48 | 27.65 | -78.0 | -1452.0 | -0.43 | 1.82 | -0.275618 | -0.149306 | -0.147766 | 0.070461 |
6 | 2021-07-10 | 106 | 7186 | 1.48 | 33.74 | -99.0 | -1087.0 | -1.00 | 6.09 | -0.482927 | -0.131391 | -0.403226 | 0.220253 |
7 | 2021-07-11 | 164 | 7431 | 2.21 | 31.75 | 58.0 | 245.0 | 0.73 | -1.99 | 0.547170 | 0.034094 | 0.493243 | -0.058980 |
8 | 2021-07-12 | 316 | 11101 | 2.85 | 26.94 | 152.0 | 3670.0 | 0.64 | -4.81 | 0.926829 | 0.493877 | 0.289593 | -0.151496 |
9 | 2021-07-13 | 314 | 11494 | 2.73 | 26.90 | -2.0 | 393.0 | -0.12 | -0.04 | -0.006329 | 0.035402 | -0.042105 | -0.001485 |
To calculate the percentage change in a metric versus the same day last week we can pass in a value to the periods
argument of the pct_change()
function. This will calculate the percentage change in the metric versus the same day last week.
# Calculate the percentage change between each row and the previous week
df['clicks_pct_change_week'] = df['clicks'].pct_change(periods=7)
df['impressions_pct_change_week'] = df['impressions'].pct_change(periods=7)
df['ctr_pct_change_week'] = df['ctr'].pct_change(periods=7)
df['position_pct_change_week'] = df['position'].pct_change(periods=7)
# Show the original data and the weekly percentage changes
df[['clicks', 'clicks_pct_change_week', 'impressions', 'impressions_pct_change_week', 'ctr', 'ctr_pct_change_week', 'position', 'position_pct_change_week']].tail(10)
clicks | clicks_pct_change_week | impressions | impressions_pct_change_week | ctr | ctr_pct_change_week | position | position_pct_change_week | |
---|---|---|---|---|---|---|---|---|
18 | 424 | 0.452055 | 15988 | 0.367080 | 2.65 | 0.060000 | 26.65 | -0.026662 |
19 | 357 | 0.469136 | 14590 | 0.343586 | 2.45 | 0.093750 | 26.92 | -0.094517 |
20 | 256 | 0.651613 | 11987 | 0.548508 | 2.14 | 0.070000 | 30.34 | -0.081997 |
21 | 257 | 0.502924 | 12227 | 0.365078 | 2.10 | 0.099476 | 29.55 | -0.054097 |
22 | 471 | 0.539216 | 15934 | 0.232900 | 2.96 | 0.248945 | 24.28 | -0.104390 |
23 | 462 | 0.350877 | 15516 | 0.191339 | 2.98 | 0.133080 | 24.55 | -0.051757 |
24 | 418 | 0.208092 | 15590 | 0.104029 | 2.68 | 0.093878 | 24.90 | -0.058245 |
25 | 477 | 0.125000 | 16842 | 0.053415 | 2.83 | 0.067925 | 23.83 | -0.105816 |
26 | 340 | -0.047619 | 13867 | -0.049554 | 2.45 | 0.000000 | 26.18 | -0.027489 |
27 | 194 | -0.242188 | 9749 | -0.186702 | 1.99 | -0.070093 | 31.42 | 0.035597 |
Matt Clarke, Saturday, September 10, 2022