How to calculate a rolling average or rolling mean in Pandas

Picture by Mikhail Nilov, Pexels.

8 minutes to read

Data Science Pandas

The Pandas rolling() method can be used to calculate a rolling mean or rolling average (also known as a moving average), which is simply the mean of a specific time series data column value over a specified number of periods.

Rolling means and rolling averages are commonly used in finance and retail sectors, where they are used to smooth out short-term fluctuations in data and highlight longer-term trends or cycles.

Import a Pandas dataframe

To get started, open a Jupyter notebook, import the Pandas library, find some time series data and load it into a Pandas dataframe. I’m using a dataset of ecommerce sales data I created for test purposes. You can access the file via my GitHub repository. It contains 19 months of ecommerce sales data and was exported from Google Analytics.

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/monthly-ecommerce-data.csv')
df

	year_month	sessions	transactions	conversion_rate	revenue	aov
0	201912	21,133	129	0.610	18,377.60	142.462
1	202001	25,801	241	0.934	15,019.35	62.321
2	202002	21,712	193	0.889	8,214.80	42.564
3	202003	16,836	547	3.249	34,988.27	63.964
4	202004	27,635	692	2.504	50,703.73	73.271
5	202005	55,834	966	1.730	68,150.07	70.549
6	202006	51,665	773	1.496	47,291.56	61.179
7	202007	46,257	596	1.288	62,172.81	104.317
8	202008	46,016	563	1.223	39,721.21	70.553
9	202009	31,176	418	1.341	33,729.31	80.692
10	202010	27,008	499	1.848	43,778.32	87.732
11	202011	25,318	467	1.845	28,227.67	60.445
12	202012	21,052	274	1.302	18,856.75	68.820
13	202101	28,682	451	1.572	27,603.02	61.204
14	202102	26,895	290	1.078	16,280.24	56.139
15	202103	28,585	628	2.197	56,863.03	90.546
16	202104	26,686	499	1.870	90,540.51	181.444
17	202105	27,716	676	2.439	44,683.35	66.100
18	202106	28,022	451	1.609	21,595.98	47.885

Prepare the data

In order to use the Pandas rolling() function, the data obviously needs to be numeric. When data are exported from Google Analytics, it often adds commas to the numbers to improve their readability. However, this results in what was a numeric value, becoming a categorical value with an object dtype. You can identify the dtypes of your columns by using the Pandas info() function. This shows we have two object columns to fix.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year_month       19 non-null     int64  
 1   sessions         19 non-null     object 
 2   transactions     19 non-null     int64  
 3   conversion_rate  19 non-null     float64
 4   revenue          19 non-null     object 
 5   aov              19 non-null     float64
dtypes: float64(2), int64(2), object(2)
memory usage: 1.0+ KB

To convert the non-numeric object data in the sessions and revenue columns to float and int dtypes, we can first use the str.replace() method to remove the comma and replace it with an empty string, then we can use the astype() method to cast the column data to the correct dtype. If you run this code, then re-run df.info(), you should find that the columns are now numeric, and we can use the rolling() function without issues.

df['sessions'] = df['sessions'].str.replace(',', '').astype(int)
df['revenue'] = df['revenue'].str.replace(',', '').astype(float)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year_month       19 non-null     int64  
 1   sessions         19 non-null     int64  
 2   transactions     19 non-null     int64  
 3   conversion_rate  19 non-null     float64
 4   revenue          19 non-null     float64
 5   aov              19 non-null     float64
dtypes: float64(3), int64(3)
memory usage: 1.0 KB

Use rolling() to calculate the rolling mean of a column

Now we’ve ensured our data are in the numeric format required, we can calculate the rolling mean or rolling average of a column using the Pandas rolling() function. The default argument for the rolling() is called window, and is an int specifying the number of rows to use.

We’ll use rolling() to calculate rolling means or rolling averages for the conversion_rate column. We’ll calculate a three-month, six-month, nine-month, and 12-month rolling average, and assign the values back to the original dataframe.

df['mean_conversion_rate_3m'] = df['conversion_rate'].rolling(3).mean()
df['mean_conversion_rate_6m'] = df['conversion_rate'].rolling(6).mean()
df['mean_conversion_rate_9m'] = df['conversion_rate'].rolling(9).mean()
df['mean_conversion_rate_12m'] = df['conversion_rate'].rolling(12).mean()

If you print the dataframe you’ll see that the time series data are sorted with the oldest values at the top of the dataframe and the latest values at the bottom. For each of the rolling() calculations, Pandas will look at each row and then calculate the rolling mean based on the previous n rows.

For example, for the year_month 202002, when calculating a three-month rolling average with rolling(3).mean(), Pandas will look back at the values of 202002, 202001, and 201912, to calculate the rolling mean based on the current value and the previous two. When there are insufficient prior values to use for the calculation, NaN values are added.

df

	year_month	sessions	transactions	conversion_rate	revenue	aov	mean_conversion_rate_3m	mean_conversion_rate_6m	mean_conversion_rate_12m
0	201912	21133	129	0.610	18377.60	142.462	NaN	NaN	NaN
1	202001	25801	241	0.934	15019.35	62.321	NaN	NaN	NaN
2	202002	21712	193	0.889	8214.80	42.564	0.811000	NaN	NaN
3	202003	16836	547	3.249	34988.27	63.964	1.690667	NaN	NaN
4	202004	27635	692	2.504	50703.73	73.271	2.214000	NaN	NaN
5	202005	55834	966	1.730	68150.07	70.549	2.494333	1.652667	NaN
6	202006	51665	773	1.496	47291.56	61.179	1.910000	1.800333	NaN
7	202007	46257	596	1.288	62172.81	104.317	1.504667	1.859333	NaN
8	202008	46016	563	1.223	39721.21	70.553	1.335667	1.915000	NaN
9	202009	31176	418	1.341	33729.31	80.692	1.284000	1.597000	NaN
10	202010	27008	499	1.848	43778.32	87.732	1.470667	1.487667	NaN
11	202011	25318	467	1.845	28227.67	60.445	1.678000	1.506833	1.579750
12	202012	21052	274	1.302	18856.75	68.820	1.665000	1.474500	1.637417
13	202101	28682	451	1.572	27603.02	61.204	1.573000	1.521833	1.690583
14	202102	26895	290	1.078	16280.24	56.139	1.317333	1.497667	1.706333
15	202103	28585	628	2.197	56863.03	90.546	1.615667	1.640333	1.618667
16	202104	26686	499	1.870	90540.51	181.444	1.715000	1.644000	1.565833
17	202105	27716	676	2.439	44683.35	66.100	2.168667	1.743000	1.624917
18	202106	28022	451	1.609	21595.98	47.885	1.972667	1.794167	1.634333

Matt Clarke, Tuesday, November 29, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.