How to calculate a rolling average or rolling mean in Pandas

Learn how to use the Pandas rolling() method to calculate the rolling mean, rolling average, or moving average of dataframe rows in a time series dataset.

How to calculate a rolling average or rolling mean in Pandas
Picture by Mikhail Nilov, Pexels.
8 minutes to read

The Pandas rolling() method can be used to calculate a rolling mean or rolling average (also known as a moving average), which is simply the mean of a specific time series data column value over a specified number of periods.

Rolling means and rolling averages are commonly used in finance and retail sectors, where they are used to smooth out short-term fluctuations in data and highlight longer-term trends or cycles.

Import a Pandas dataframe

To get started, open a Jupyter notebook, import the Pandas library, find some time series data and load it into a Pandas dataframe. I’m using a dataset of ecommerce sales data I created for test purposes. You can access the file via my GitHub repository. It contains 19 months of ecommerce sales data and was exported from Google Analytics.

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/monthly-ecommerce-data.csv')
df
year_month sessions transactions conversion_rate revenue aov
0 201912 21,133 129 0.610 18,377.60 142.462
1 202001 25,801 241 0.934 15,019.35 62.321
2 202002 21,712 193 0.889 8,214.80 42.564
3 202003 16,836 547 3.249 34,988.27 63.964
4 202004 27,635 692 2.504 50,703.73 73.271
5 202005 55,834 966 1.730 68,150.07 70.549
6 202006 51,665 773 1.496 47,291.56 61.179
7 202007 46,257 596 1.288 62,172.81 104.317
8 202008 46,016 563 1.223 39,721.21 70.553
9 202009 31,176 418 1.341 33,729.31 80.692
10 202010 27,008 499 1.848 43,778.32 87.732
11 202011 25,318 467 1.845 28,227.67 60.445
12 202012 21,052 274 1.302 18,856.75 68.820
13 202101 28,682 451 1.572 27,603.02 61.204
14 202102 26,895 290 1.078 16,280.24 56.139
15 202103 28,585 628 2.197 56,863.03 90.546
16 202104 26,686 499 1.870 90,540.51 181.444
17 202105 27,716 676 2.439 44,683.35 66.100
18 202106 28,022 451 1.609 21,595.98 47.885

Prepare the data

In order to use the Pandas rolling() function, the data obviously needs to be numeric. When data are exported from Google Analytics, it often adds commas to the numbers to improve their readability. However, this results in what was a numeric value, becoming a categorical value with an object dtype. You can identify the dtypes of your columns by using the Pandas info() function. This shows we have two object columns to fix.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year_month       19 non-null     int64  
 1   sessions         19 non-null     object 
 2   transactions     19 non-null     int64  
 3   conversion_rate  19 non-null     float64
 4   revenue          19 non-null     object 
 5   aov              19 non-null     float64
dtypes: float64(2), int64(2), object(2)
memory usage: 1.0+ KB

To convert the non-numeric object data in the sessions and revenue columns to float and int dtypes, we can first use the str.replace() method to remove the comma and replace it with an empty string, then we can use the astype() method to cast the column data to the correct dtype. If you run this code, then re-run df.info(), you should find that the columns are now numeric, and we can use the rolling() function without issues.

df['sessions'] = df['sessions'].str.replace(',', '').astype(int)
df['revenue'] = df['revenue'].str.replace(',', '').astype(float)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year_month       19 non-null     int64  
 1   sessions         19 non-null     int64  
 2   transactions     19 non-null     int64  
 3   conversion_rate  19 non-null     float64
 4   revenue          19 non-null     float64
 5   aov              19 non-null     float64
dtypes: float64(3), int64(3)
memory usage: 1.0 KB

Use rolling() to calculate the rolling mean of a column

Now we’ve ensured our data are in the numeric format required, we can calculate the rolling mean or rolling average of a column using the Pandas rolling() function. The default argument for the rolling() is called window, and is an int specifying the number of rows to use.

We’ll use rolling() to calculate rolling means or rolling averages for the conversion_rate column. We’ll calculate a three-month, six-month, nine-month, and 12-month rolling average, and assign the values back to the original dataframe.

df['mean_conversion_rate_3m'] = df['conversion_rate'].rolling(3).mean()
df['mean_conversion_rate_6m'] = df['conversion_rate'].rolling(6).mean()
df['mean_conversion_rate_9m'] = df['conversion_rate'].rolling(9).mean()
df['mean_conversion_rate_12m'] = df['conversion_rate'].rolling(12).mean()

If you print the dataframe you’ll see that the time series data are sorted with the oldest values at the top of the dataframe and the latest values at the bottom. For each of the rolling() calculations, Pandas will look at each row and then calculate the rolling mean based on the previous n rows.

For example, for the year_month 202002, when calculating a three-month rolling average with rolling(3).mean(), Pandas will look back at the values of 202002, 202001, and 201912, to calculate the rolling mean based on the current value and the previous two. When there are insufficient prior values to use for the calculation, NaN values are added.

df
year_month sessions transactions conversion_rate revenue aov mean_conversion_rate_3m mean_conversion_rate_6m mean_conversion_rate_12m
0 201912 21133 129 0.610 18377.60 142.462 NaN NaN NaN
1 202001 25801 241 0.934 15019.35 62.321 NaN NaN NaN
2 202002 21712 193 0.889 8214.80 42.564 0.811000 NaN NaN
3 202003 16836 547 3.249 34988.27 63.964 1.690667 NaN NaN
4 202004 27635 692 2.504 50703.73 73.271 2.214000 NaN NaN
5 202005 55834 966 1.730 68150.07 70.549 2.494333 1.652667 NaN
6 202006 51665 773 1.496 47291.56 61.179 1.910000 1.800333 NaN
7 202007 46257 596 1.288 62172.81 104.317 1.504667 1.859333 NaN
8 202008 46016 563 1.223 39721.21 70.553 1.335667 1.915000 NaN
9 202009 31176 418 1.341 33729.31 80.692 1.284000 1.597000 NaN
10 202010 27008 499 1.848 43778.32 87.732 1.470667 1.487667 NaN
11 202011 25318 467 1.845 28227.67 60.445 1.678000 1.506833 1.579750
12 202012 21052 274 1.302 18856.75 68.820 1.665000 1.474500 1.637417
13 202101 28682 451 1.572 27603.02 61.204 1.573000 1.521833 1.690583
14 202102 26895 290 1.078 16280.24 56.139 1.317333 1.497667 1.706333
15 202103 28585 628 2.197 56863.03 90.546 1.615667 1.640333 1.618667
16 202104 26686 499 1.870 90540.51 181.444 1.715000 1.644000 1.565833
17 202105 27716 676 2.439 44683.35 66.100 2.168667 1.743000 1.624917
18 202106 28022 451 1.609 21595.98 47.885 1.972667 1.794167 1.634333

Matt Clarke, Tuesday, November 29, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.