Time series data have a reputation for being somewhat complicated, partly because they’re made up of a number of different components that work together. At the most basic level these consist of the trend - indicating whether a time series metric is going up or down over time - and the seasonality, which can be yearly, monthly, or daily.
Most time series forecasting models use a technique called time series decomposition to split out these components from the time series, so they can separate the trend and seasonality to identify noise and other changes in the underlying metric being forecast. Ordinarily, this can be quite a complex procedure, but it’s fairly straightforward in the Prophet model.
Prophet includes an automatic time series decomposition feature which allows you to remove the trend and noise from your data to see the underlying seasonality, or remove the noise and seasonality to see the underlying trend. Without seasonal decomposition, these things can be much harder to identify. Here’s how it’s done.
We’ll be using three packages for this project: Pandas for displaying and manipulating our data, GAPandas for fetching Google Analytics data using the Reporting API, and the Prophet forecasting model for time series decomposition. Any packages you don’t have can be installed by entering pip3 install package-name
.
import pandas as pd
import gapandas as gp
from fbprophet import Prophet
from fbprophet.plot import plot_plotly
from fbprophet.plot import plot_components_plotly
from fbprophet.plot import add_changepoints_to_plot
from fbprophet.plot import plot_yearly
If you already have a time series dataset you can skip this step. If you want to personal time series decomposition on data from a Google Analytics account you’ll need to set up GAPandas. This is explained in more detail in this guide, but you’ll require your JSON keyfile and the view ID for the account you want to access.
service = gp.get_service('client-secret.json', verbose=False)
view = '123456789'
Next, we’ll create a simple API query payload and pass it to Google Analytics using GAPandas. This will return a time series dataframe in which your chosen metric is shown alongside the date. Google Analytics will automatically fill in any blanks with zeros. I’m examining some web analytics data for one of my personal sites.
payload = {
'start_date': '2016-01-01',
'end_date': '2021-01-01',
'metrics': 'ga:entrances',
'dimensions': 'ga:date'
}
df = gp.run_query(service, view, payload)
df.head()
date | entrances | |
---|---|---|
0 | 2016-01-01 | 7 |
1 | 2016-01-02 | 2 |
2 | 2016-01-03 | 2 |
3 | 2016-01-04 | 1 |
4 | 2016-01-05 | 6 |
The Prophet model requires a dataframe containing two columns: a datetime
column called ds
and a field containing your metric called y
. Use the Pandas rename()
function to rename the columns and ensure the date column is set to datetime
using the to_datetime()
function.
df = df.rename(columns={'date':'ds', 'entrances':'y'})
df['ds'] = pd.to_datetime(df['ds'], format='%Y-%m-%d')
df.head()
ds | y | |
---|---|---|
0 | 2016-01-01 | 7 |
1 | 2016-01-02 | 2 |
2 | 2016-01-03 | 2 |
3 | 2016-01-04 | 1 |
4 | 2016-01-05 | 6 |
Next, we’ll configure Prophet to use daily_seasonality
, because the traffic on the site I’ve used varies according to the day of the week. Then, we’ll fit the model to our df
dataframe containing the ds
and y
columns.
model = Prophet(daily_seasonality=True)
model.fit(df)
INFO:numexpr.utils:Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
<fbprophet.forecaster.Prophet at 0x7f102da58490>
In order to give Prophet somewhere to store its predictions for our forecast, we need to extend the dataframe of dates to include those in the future period we want to predict. I’ve done this using the make_future_dataframe()
function which I have set to the next 365 days.
future = model.make_future_dataframe(periods=365)
future.tail()
ds | |
---|---|
2188 | 2021-12-28 |
2189 | 2021-12-29 |
2190 | 2021-12-30 |
2191 | 2021-12-31 |
2192 | 2022-01-01 |
Now we can get the Prophet model to predict y
for the next 365 days using the predict()
function. By examining the fields in the forecast
dataframe, we can see that we get a yhat
holding our predicted value, plus a yhat_lower
and a yhat_upper
, representing the confidence interval on either side.
forecast = model.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
For 2021-12-28
Prophet is forecasting that the site will generate a yhat
of 1031 entrances, which will lie between the yhat_lower
of 875 and the yhat_upper
of 1176.
ds | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|
2188 | 2021-12-28 | 1031.545626 | 875.017722 | 1176.853593 |
2189 | 2021-12-29 | 1022.972801 | 866.592976 | 1174.910263 |
2190 | 2021-12-30 | 1024.170165 | 874.078189 | 1175.733781 |
2191 | 2021-12-31 | 1023.099677 | 876.396980 | 1176.541808 |
2192 | 2022-01-01 | 1048.272269 | 906.088285 | 1203.786461 |
Next we can plot it on a time series plot using Prophet’s plot()
function. The dark blue line represents the prediction from yhat
and the pale blue lines represent the yhat_upper
and yhat_lower
. The black dots represent the actual data.
On my data set there are some outliers, caused by site issues and sudden traffic spikes, as well as a level shift anomaly caused by traffic increasing and then dropping as a result of the pandemic. The period after 2021 represents our future period, so there are no black dots here.
forecast_plot = model.plot(forecast)
Finally, now we have the time series in the model and have made our forecast we can perform the time series decomposition step.
Prophet actually makes this really easy and it can be generated simply by calling the plot_components()
function and passing it the dataframe containing our dataframe from forecast
. This generates four separate plots which have been extracted from the time series.
forecast_components = model.plot_components(forecast)
Matt Clarke, Saturday, March 13, 2021