When dealing with temporal or time series data, the dates themselves often yield information that can vastly improve the performance of your model. However, to get the best from these dates you first need to do some feature engineering to change them into a more usable format.
There are a wide range of different ways of performing calendrical calculations in Python. Both Pandas and Numpy include built-in functions to let you access specific information about dates, calculate date ranges and determine useful things, like whether a given date is a weekday, weekend, holiday or business day.
Here, we’re going to build a simple Pandas DataFrame and calculate which days within the period are weekdays, weekends, working or business days or holidays. To get started, open up a Jupyter notebook and load the Pandas, Numpy, and Datetime packages up.
import pandas as pd
import numpy as np
import datetime as dt
pd.set_option('max_rows',1000)
Next, we’re going to create a Pandas DataFrame containing all of the dates over a particular date range. Pandas makes it quite straightforward to create a DataFrame of dates using its helpful date_range()
function.
Simply pass this an initial start date, define the number of periods to look ahead and set the frequency using the freq
offset alias. I’ve used D
here for day and 365
for periods, but there are a huge range of others available, including MS
for month start. You can even pass in custom values.
period = pd.date_range('2020-01-01', periods=365, freq='D')
df = pd.DataFrame({ 'date': period })
df.head(365)
date | |
---|---|
0 | 2020-01-01 |
1 | 2020-01-02 |
2 | 2020-01-03 |
3 | 2020-01-04 |
4 | 2020-01-05 |
5 | 2020-01-06 |
6 | 2020-01-07 |
7 | 2020-01-08 |
8 | 2020-01-09 |
9 | 2020-01-10 |
10 | 2020-01-11 |
Now that we have our DataFrame of dates, we can use some other techniques to identify any special features they may have that could influence our models or analyses. We’ll start by identifying which dates fall on weekends and which do not. The dayofweek
method can be used to determine the day number and from that we can tell whether it fell on a weekend or not.
df['is_weekend'] = ((pd.DatetimeIndex(df['date']).dayofweek) // 5 == 1).astype(int)
df.head(14)
date | is_weekend | |
---|---|---|
0 | 2020-01-01 | 0 |
1 | 2020-01-02 | 0 |
2 | 2020-01-03 | 0 |
3 | 2020-01-04 | 1 |
4 | 2020-01-05 | 1 |
5 | 2020-01-06 | 0 |
6 | 2020-01-07 | 0 |
7 | 2020-01-08 | 0 |
8 | 2020-01-09 | 0 |
9 | 2020-01-10 | 0 |
10 | 2020-01-11 | 1 |
11 | 2020-01-12 | 1 |
12 | 2020-01-13 | 0 |
13 | 2020-01-14 | 0 |
Identifying weekdays and weekends is pretty straightforward, but identifying holidays is much harder because they vary by year, by country and sometimes even within a country. There is a Python package called Holidays which makes it a bit easier to identify and label holidays, but it’s not perfect for those of us in the UK, where we have separate holidays in England, Wales and Scotland.
Instead, we’ll use Pandas.tseries.holiday
and extend the AbstractHolidayCalendar
to create our own one for England and Wales that we can use whenever we need it. Although some UK holidays do not occur on fixed dates, they follow similar patterns every year, making it possible to make a class to determine whether a date is a holiday, without the need to maintain a massive database of dates.
Some holidays, like GoodFriday and EasterMonday, are handled automatically as they’re common across the world, but for the others we can create simple rules to define them. When a given holiday falls on a weekend, we get a “substitute day” on either the next Monday or next Monday or Tuesday, allowing us to have the same number each year.
from pandas.tseries.holiday import (
AbstractHolidayCalendar, Holiday, DateOffset, EasterMonday, GoodFriday, MO, next_monday, next_monday_or_tuesday
)
class EnglandWalesHolidays(AbstractHolidayCalendar):
rules = [
Holiday('New Year\'s Day', month=1, day=1, observance=next_monday),
GoodFriday,
EasterMonday,
Holiday('Early May bank holiday', month=5, day=1, offset=DateOffset(weekday=MO(1))),
Holiday('Spring bank holiday', month=5, day=31, offset=DateOffset(weekday=MO(-1))),
Holiday('Summer bank holiday', month=8, day=1, offset=DateOffset(weekday=MO(1))),
Holiday('Christmas Day', month=12, day=25, observance=next_monday),
Holiday('Boxing Day', month=12, day=26, observance=next_monday_or_tuesday)
]
Next we can use our class, passing in the start
date as the first date in our dataframe and the end
date as the last date in our dataframe, to return all of the dates on which holidays fell, based on the rules we defined in our class.
holidays = EnglandWalesHolidays().holidays(start=df.date.min(), end=df.date.max())
holidays
DatetimeIndex(['2020-01-01', '2020-04-10', '2020-04-13', '2020-05-04',
'2020-05-25', '2020-08-03', '2020-12-25', '2020-12-28'],
dtype='datetime64[ns]', freq=None)
Finally, we can use Numpy’s where()
function to check whether each date in the dataframe is present in the holidays object and assign a 1 or 0 accordingly.
df['is_holiday'] = np.where(df['date'].isin(holidays), 1, 0)
df.head()
date | is_weekend | is_holiday | |
---|---|---|---|
0 | 2020-01-01 | 0 | 1 |
1 | 2020-01-02 | 0 | 0 |
2 | 2020-01-03 | 0 | 0 |
3 | 2020-01-04 | 1 | 0 |
4 | 2020-01-05 | 1 | 0 |
To identify working days all we need to do now is check whether is_holiday
or is_weekend
contains a 1. If it does, it’s a non-working day and we assign a 0 to is_working_day
, otherwise we assign a 1.
df['is_working_day'] = np.where((df['is_weekend']==1) | (df['is_holiday']==1), 0, 1)
df.head(14)
date | is_weekend | is_holiday | is_working_day | |
---|---|---|---|---|
0 | 2020-01-01 | 0 | 1 | 0 |
1 | 2020-01-02 | 0 | 0 | 1 |
2 | 2020-01-03 | 0 | 0 | 1 |
3 | 2020-01-04 | 1 | 0 | 0 |
4 | 2020-01-05 | 1 | 0 | 0 |
5 | 2020-01-06 | 0 | 0 | 1 |
6 | 2020-01-07 | 0 | 0 | 1 |
7 | 2020-01-08 | 0 | 0 | 1 |
8 | 2020-01-09 | 0 | 0 | 1 |
9 | 2020-01-10 | 0 | 0 | 1 |
10 | 2020-01-11 | 1 | 0 | 0 |
11 | 2020-01-12 | 1 | 0 | 0 |
12 | 2020-01-13 | 0 | 0 | 1 |
13 | 2020-01-14 | 0 | 0 | 1 |
As models are rarely, if ever, able to use dates and times as they are, you will benefit from encoding them as categorical variables as these can often yield useful information about temporal patterns. There are a few ways to create features from existing dates. However, the most versatile is to use strftime()
. However, I have used the handy datetime quarter
function for one feature.
df['day_of_month'] = df['date'].dt.strftime("%d").astype(int)
df['day_of_week'] = df['date'].dt.strftime("%w").astype(int)
df['day_of_year'] = df['date'].dt.strftime("%j").astype(int)
df['week'] = df['date'].dt.strftime("%W").astype(int)
df['month'] = df['date'].dt.strftime("%m").astype(int)
df['year'] = df['date'].dt.strftime("%Y").astype(int)
df['year_month'] = df['date'].dt.strftime("%Y%m").astype(int)
df['quarter'] = df['date'].dt.quarter.astype(int)
df.head()
date | is_weekend | is_holiday | is_working_day | day_of_month | month | day_of_week | day_of_year | week | year | year_month | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-01-01 | 0 | 1 | 0 | 1 | 1 | 3 | 1 | 0 | 2020 | 202001 |
1 | 2020-01-02 | 0 | 0 | 1 | 2 | 1 | 4 | 2 | 0 | 2020 | 202001 |
2 | 2020-01-03 | 0 | 0 | 1 | 3 | 1 | 5 | 3 | 0 | 2020 | 202001 |
3 | 2020-01-04 | 1 | 0 | 0 | 4 | 1 | 6 | 4 | 0 | 2020 | 202001 |
4 | 2020-01-05 | 1 | 0 | 0 | 5 | 1 | 0 | 5 | 0 | 2020 | 202001 |
To drop the time component from a column containing a datetime, you can use the Pandas to_datetime()
function and then append .dt.date
to extract only the date part.
df['date'] = pd.to_datetime(df['date']).dt.date
2021-03-13
Matt Clarke, Monday, March 01, 2021