How to engineer date features using Pandas

In time series datasets dates often hold the key to improving performance, but they need to be transformed via feature engineering to reveal their secrets.

How to engineer date features using Pandas
13 minutes to read

When dealing with temporal or time series data, the dates themselves often yield information that can vastly improve the performance of your model. However, to get the best from these dates you first need to do some feature engineering to change them into a more usable format.

There are a wide range of different ways of performing calendrical calculations in Python. Both Pandas and Numpy include built-in functions to let you access specific information about dates, calculate date ranges and determine useful things, like whether a given date is a weekday, weekend, holiday or business day.

Here, we’re going to build a simple Pandas DataFrame and calculate which days within the period are weekdays, weekends, working or business days or holidays. To get started, open up a Jupyter notebook and load the Pandas, Numpy, and Datetime packages up.

import pandas as pd
import numpy as np
import datetime as dt
pd.set_option('max_rows',1000)

Create a DataFrame containing each date

Next, we’re going to create a Pandas DataFrame containing all of the dates over a particular date range. Pandas makes it quite straightforward to create a DataFrame of dates using its helpful date_range() function.

Simply pass this an initial start date, define the number of periods to look ahead and set the frequency using the freq offset alias. I’ve used D here for day and 365 for periods, but there are a huge range of others available, including MS for month start. You can even pass in custom values.

period = pd.date_range('2020-01-01', periods=365, freq='D')
df = pd.DataFrame({ 'date': period })
df.head(365)
date
0 2020-01-01
1 2020-01-02
2 2020-01-03
3 2020-01-04
4 2020-01-05
5 2020-01-06
6 2020-01-07
7 2020-01-08
8 2020-01-09
9 2020-01-10
10 2020-01-11

Identify weekdays and weekends

Now that we have our DataFrame of dates, we can use some other techniques to identify any special features they may have that could influence our models or analyses. We’ll start by identifying which dates fall on weekends and which do not. The dayofweek method can be used to determine the day number and from that we can tell whether it fell on a weekend or not.

df['is_weekend'] = ((pd.DatetimeIndex(df['date']).dayofweek) // 5 == 1).astype(int)
df.head(14)
date is_weekend
0 2020-01-01 0
1 2020-01-02 0
2 2020-01-03 0
3 2020-01-04 1
4 2020-01-05 1
5 2020-01-06 0
6 2020-01-07 0
7 2020-01-08 0
8 2020-01-09 0
9 2020-01-10 0
10 2020-01-11 1
11 2020-01-12 1
12 2020-01-13 0
13 2020-01-14 0

Identify holidays

Identifying weekdays and weekends is pretty straightforward, but identifying holidays is much harder because they vary by year, by country and sometimes even within a country. There is a Python package called Holidays which makes it a bit easier to identify and label holidays, but it’s not perfect for those of us in the UK, where we have separate holidays in England, Wales and Scotland.

Instead, we’ll use Pandas.tseries.holiday and extend the AbstractHolidayCalendar to create our own one for England and Wales that we can use whenever we need it. Although some UK holidays do not occur on fixed dates, they follow similar patterns every year, making it possible to make a class to determine whether a date is a holiday, without the need to maintain a massive database of dates.

Some holidays, like GoodFriday and EasterMonday, are handled automatically as they’re common across the world, but for the others we can create simple rules to define them. When a given holiday falls on a weekend, we get a “substitute day” on either the next Monday or next Monday or Tuesday, allowing us to have the same number each year.

from pandas.tseries.holiday import (
    AbstractHolidayCalendar, Holiday, DateOffset, EasterMonday, GoodFriday, MO, next_monday, next_monday_or_tuesday
)

class EnglandWalesHolidays(AbstractHolidayCalendar):
    rules = [
        Holiday('New Year\'s Day', month=1, day=1, observance=next_monday),
        GoodFriday,
        EasterMonday,
        Holiday('Early May bank holiday', month=5, day=1, offset=DateOffset(weekday=MO(1))),
        Holiday('Spring bank holiday', month=5, day=31, offset=DateOffset(weekday=MO(-1))),
        Holiday('Summer bank holiday', month=8, day=1, offset=DateOffset(weekday=MO(1))), 
        Holiday('Christmas Day', month=12, day=25, observance=next_monday), 
        Holiday('Boxing Day', month=12, day=26, observance=next_monday_or_tuesday) 
    ]

Next we can use our class, passing in the start date as the first date in our dataframe and the end date as the last date in our dataframe, to return all of the dates on which holidays fell, based on the rules we defined in our class.

holidays = EnglandWalesHolidays().holidays(start=df.date.min(), end=df.date.max())
holidays
DatetimeIndex(['2020-01-01', '2020-04-10', '2020-04-13', '2020-05-04',
               '2020-05-25', '2020-08-03', '2020-12-25', '2020-12-28'],
              dtype='datetime64[ns]', freq=None)

Finally, we can use Numpy’s where() function to check whether each date in the dataframe is present in the holidays object and assign a 1 or 0 accordingly.

df['is_holiday'] = np.where(df['date'].isin(holidays), 1, 0)
df.head()
date is_weekend is_holiday
0 2020-01-01 0 1
1 2020-01-02 0 0
2 2020-01-03 0 0
3 2020-01-04 1 0
4 2020-01-05 1 0

Identifying working days

To identify working days all we need to do now is check whether is_holiday or is_weekend contains a 1. If it does, it’s a non-working day and we assign a 0 to is_working_day, otherwise we assign a 1.

df['is_working_day'] = np.where((df['is_weekend']==1) | (df['is_holiday']==1), 0, 1)
df.head(14)
date is_weekend is_holiday is_working_day
0 2020-01-01 0 1 0
1 2020-01-02 0 0 1
2 2020-01-03 0 0 1
3 2020-01-04 1 0 0
4 2020-01-05 1 0 0
5 2020-01-06 0 0 1
6 2020-01-07 0 0 1
7 2020-01-08 0 0 1
8 2020-01-09 0 0 1
9 2020-01-10 0 0 1
10 2020-01-11 1 0 0
11 2020-01-12 1 0 0
12 2020-01-13 0 0 1
13 2020-01-14 0 0 1

Identifying other numeric calendrical features

As models are rarely, if ever, able to use dates and times as they are, you will benefit from encoding them as categorical variables as these can often yield useful information about temporal patterns. There are a few ways to create features from existing dates. However, the most versatile is to use strftime(). However, I have used the handy datetime quarter function for one feature.

df['day_of_month'] = df['date'].dt.strftime("%d").astype(int) 
df['day_of_week'] = df['date'].dt.strftime("%w").astype(int)
df['day_of_year'] = df['date'].dt.strftime("%j").astype(int)
df['week'] = df['date'].dt.strftime("%W").astype(int)
df['month'] = df['date'].dt.strftime("%m").astype(int)
df['year'] = df['date'].dt.strftime("%Y").astype(int)
df['year_month'] = df['date'].dt.strftime("%Y%m").astype(int)
df['quarter'] = df['date'].dt.quarter.astype(int)
df.head()
date is_weekend is_holiday is_working_day day_of_month month day_of_week day_of_year week year year_month
0 2020-01-01 0 1 0 1 1 3 1 0 2020 202001
1 2020-01-02 0 0 1 2 1 4 2 0 2020 202001
2 2020-01-03 0 0 1 3 1 5 3 0 2020 202001
3 2020-01-04 1 0 0 4 1 6 4 0 2020 202001
4 2020-01-05 1 0 0 5 1 0 5 0 2020 202001

Drop the time from a Pandas datetime

To drop the time component from a column containing a datetime, you can use the Pandas to_datetime() function and then append .dt.date to extract only the date part.

df['date'] = pd.to_datetime(df['date']).dt.date
2021-03-13

Matt Clarke, Monday, March 01, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.