The Pandas rolling()
method can be used to calculate a rolling mean or rolling average (also known as a moving average), which is simply the mean of a specific time series data column value over a specified number of periods.
Rolling means and rolling averages are commonly used in finance and retail sectors, where they are used to smooth out short-term fluctuations in data and highlight longer-term trends or cycles.
To get started, open a Jupyter notebook, import the Pandas library, find some time series data and load it into a Pandas dataframe. I’m using a dataset of ecommerce sales data I created for test purposes. You can access the file via my GitHub repository. It contains 19 months of ecommerce sales data and was exported from Google Analytics.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/monthly-ecommerce-data.csv')
df
year_month | sessions | transactions | conversion_rate | revenue | aov | |
---|---|---|---|---|---|---|
0 | 201912 | 21,133 | 129 | 0.610 | 18,377.60 | 142.462 |
1 | 202001 | 25,801 | 241 | 0.934 | 15,019.35 | 62.321 |
2 | 202002 | 21,712 | 193 | 0.889 | 8,214.80 | 42.564 |
3 | 202003 | 16,836 | 547 | 3.249 | 34,988.27 | 63.964 |
4 | 202004 | 27,635 | 692 | 2.504 | 50,703.73 | 73.271 |
5 | 202005 | 55,834 | 966 | 1.730 | 68,150.07 | 70.549 |
6 | 202006 | 51,665 | 773 | 1.496 | 47,291.56 | 61.179 |
7 | 202007 | 46,257 | 596 | 1.288 | 62,172.81 | 104.317 |
8 | 202008 | 46,016 | 563 | 1.223 | 39,721.21 | 70.553 |
9 | 202009 | 31,176 | 418 | 1.341 | 33,729.31 | 80.692 |
10 | 202010 | 27,008 | 499 | 1.848 | 43,778.32 | 87.732 |
11 | 202011 | 25,318 | 467 | 1.845 | 28,227.67 | 60.445 |
12 | 202012 | 21,052 | 274 | 1.302 | 18,856.75 | 68.820 |
13 | 202101 | 28,682 | 451 | 1.572 | 27,603.02 | 61.204 |
14 | 202102 | 26,895 | 290 | 1.078 | 16,280.24 | 56.139 |
15 | 202103 | 28,585 | 628 | 2.197 | 56,863.03 | 90.546 |
16 | 202104 | 26,686 | 499 | 1.870 | 90,540.51 | 181.444 |
17 | 202105 | 27,716 | 676 | 2.439 | 44,683.35 | 66.100 |
18 | 202106 | 28,022 | 451 | 1.609 | 21,595.98 | 47.885 |
In order to use the Pandas rolling()
function, the data obviously needs to be numeric. When data are exported from Google Analytics, it often adds commas to the numbers to improve their readability. However, this results in what was a numeric value, becoming a categorical value with an object dtype
. You can identify the dtypes of your columns by using the Pandas info()
function. This shows we have two object
columns to fix.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year_month 19 non-null int64
1 sessions 19 non-null object
2 transactions 19 non-null int64
3 conversion_rate 19 non-null float64
4 revenue 19 non-null object
5 aov 19 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 1.0+ KB
To convert the non-numeric object
data in the sessions
and revenue
columns to float
and int
dtypes, we can first use the str.replace()
method to remove the comma and replace it with an empty string, then we can use the astype()
method to cast the column data to the correct dtype. If you run this code, then re-run df.info()
, you should find that the columns are now numeric, and we can use the rolling()
function without issues.
df['sessions'] = df['sessions'].str.replace(',', '').astype(int)
df['revenue'] = df['revenue'].str.replace(',', '').astype(float)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year_month 19 non-null int64
1 sessions 19 non-null int64
2 transactions 19 non-null int64
3 conversion_rate 19 non-null float64
4 revenue 19 non-null float64
5 aov 19 non-null float64
dtypes: float64(3), int64(3)
memory usage: 1.0 KB
Now we’ve ensured our data are in the numeric format required, we can calculate the rolling mean or rolling average of a column using the Pandas rolling()
function. The default argument for the rolling()
is called window
, and is an int
specifying the number of rows to use.
We’ll use rolling()
to calculate rolling means or rolling averages for the conversion_rate
column. We’ll calculate a three-month, six-month, nine-month, and 12-month rolling average, and assign the values back to the original dataframe.
df['mean_conversion_rate_3m'] = df['conversion_rate'].rolling(3).mean()
df['mean_conversion_rate_6m'] = df['conversion_rate'].rolling(6).mean()
df['mean_conversion_rate_9m'] = df['conversion_rate'].rolling(9).mean()
df['mean_conversion_rate_12m'] = df['conversion_rate'].rolling(12).mean()
If you print the dataframe you’ll see that the time series data are sorted with the oldest values at the top of the dataframe and the latest values at the bottom. For each of the rolling()
calculations, Pandas will look at each row and then calculate the rolling mean based on the previous n rows.
For example, for the year_month
202002, when calculating a three-month rolling average with rolling(3).mean()
, Pandas will look back at the values of 202002, 202001, and 201912, to calculate the rolling mean based on the current value and the previous two. When there are insufficient prior values to use for the calculation, NaN
values are added.
df
year_month | sessions | transactions | conversion_rate | revenue | aov | mean_conversion_rate_3m | mean_conversion_rate_6m | mean_conversion_rate_12m | |
---|---|---|---|---|---|---|---|---|---|
0 | 201912 | 21133 | 129 | 0.610 | 18377.60 | 142.462 | NaN | NaN | NaN |
1 | 202001 | 25801 | 241 | 0.934 | 15019.35 | 62.321 | NaN | NaN | NaN |
2 | 202002 | 21712 | 193 | 0.889 | 8214.80 | 42.564 | 0.811000 | NaN | NaN |
3 | 202003 | 16836 | 547 | 3.249 | 34988.27 | 63.964 | 1.690667 | NaN | NaN |
4 | 202004 | 27635 | 692 | 2.504 | 50703.73 | 73.271 | 2.214000 | NaN | NaN |
5 | 202005 | 55834 | 966 | 1.730 | 68150.07 | 70.549 | 2.494333 | 1.652667 | NaN |
6 | 202006 | 51665 | 773 | 1.496 | 47291.56 | 61.179 | 1.910000 | 1.800333 | NaN |
7 | 202007 | 46257 | 596 | 1.288 | 62172.81 | 104.317 | 1.504667 | 1.859333 | NaN |
8 | 202008 | 46016 | 563 | 1.223 | 39721.21 | 70.553 | 1.335667 | 1.915000 | NaN |
9 | 202009 | 31176 | 418 | 1.341 | 33729.31 | 80.692 | 1.284000 | 1.597000 | NaN |
10 | 202010 | 27008 | 499 | 1.848 | 43778.32 | 87.732 | 1.470667 | 1.487667 | NaN |
11 | 202011 | 25318 | 467 | 1.845 | 28227.67 | 60.445 | 1.678000 | 1.506833 | 1.579750 |
12 | 202012 | 21052 | 274 | 1.302 | 18856.75 | 68.820 | 1.665000 | 1.474500 | 1.637417 |
13 | 202101 | 28682 | 451 | 1.572 | 27603.02 | 61.204 | 1.573000 | 1.521833 | 1.690583 |
14 | 202102 | 26895 | 290 | 1.078 | 16280.24 | 56.139 | 1.317333 | 1.497667 | 1.706333 |
15 | 202103 | 28585 | 628 | 2.197 | 56863.03 | 90.546 | 1.615667 | 1.640333 | 1.618667 |
16 | 202104 | 26686 | 499 | 1.870 | 90540.51 | 181.444 | 1.715000 | 1.644000 | 1.565833 |
17 | 202105 | 27716 | 676 | 2.439 | 44683.35 | 66.100 | 2.168667 | 1.743000 | 1.624917 |
18 | 202106 | 28022 | 451 | 1.609 | 21595.98 | 47.885 | 1.972667 | 1.794167 | 1.634333 |
Matt Clarke, Tuesday, November 29, 2022