How to use Pandas shift() to create lagged variables

Picture by Sohel Patel, Pexels.

7 minutes to read

Data Science Pandas Time series

The Pandas shift() function is used to shift the position of a dataframe or series by a specified number of periods. It’s commonly used for the creation of so-called lagged variables or lagged features in time series datasets.

In this quick tutorial, we’ll cover the basics of the shift() function and show how you can create lagged variables from values in a Pandas dataframe.

Create a Pandas dataframe

To get started, open a Jupyter notebook and import the Pandas library. Then, either import your own dataset into Pandas, or create a dataframe of dummy data. We’ll use a dictionary to create a dataframe of dates and order volumes.

import pandas as pd

df = pd.DataFrame({'date': ['2023-01-01', '2023-01-02', '2023-01-03', 
                            '2023-01-04', '2023-01-05', '2023-01-06', 
                            '2023-01-07', '2023-01-08', '2023-01-09'], 
                   'orders': [3432, 3233, 3434, 3235, 3436, 
                              3237, 3438, 3239, 3440]})
df

	date	orders
0	2023-01-01	3432
1	2023-01-02	3233
2	2023-01-03	3434
3	2023-01-04	3235
4	2023-01-05	3436
5	2023-01-06	3237
6	2023-01-07	3438
7	2023-01-08	3239
8	2023-01-09	3440

Create lagged variables with shift()

To use the shift() method you simply append the function to your dataframe column and pass in an integer argument to define how many rows you want to shift. To create a lag based on one day in our date column we’ll use shift(1), and to create a lag based on one week we’ll use shift(7), then we’ll use the Pandas assign() method to create a new column, and we’ll save it back to the dataframe.

If you look at new columns you’ll see that the orders_lag1 column contains either a NaN value when there was no previous row to shift, or the value of the row one row above. Similarly, for the orders_lag7 column, we have NaN values for the first seven dates because there were insufficient prior values.

df = df.assign(
    orders_lag1 = df['orders'].shift(1),
    orders_lag7 = df['orders'].shift(7)
)
df

	date	orders	orders_lag1	orders_lag7
0	2023-01-01	3432	NaN	NaN
1	2023-01-02	3233	3432.0	NaN
2	2023-01-03	3434	3233.0	NaN
3	2023-01-04	3235	3434.0	NaN
4	2023-01-05	3436	3235.0	NaN
5	2023-01-06	3237	3436.0	NaN
6	2023-01-07	3438	3237.0	NaN
7	2023-01-08	3239	3438.0	3432.0
8	2023-01-09	3440	3239.0	3233.0

Using shift() with groupby()

The other really neat thing you can do with shift() is combine it with a groupby() operation. To show this in action we’ll first create another dataframe containing various customer IDs, order IDs, order dates, and order amounts.

df = pd.DataFrame({'order_id': [100, 101, 102, 103, 104, 105],
                   'customer_id': [1, 2, 3, 1, 2, 2],
                   'order_date': ['2023-01-01', '2023-01-01', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06'],
                   'order_amount': [129.23, 82.23, 343.23, 12.22, 53.22, 292.11]})
df

	order_id	customer_id	order_date	order_amount
0	100	1	2023-01-01	129.23
1	101	2	2023-01-01	82.23
2	102	3	2023-01-03	343.23
3	103	1	2023-01-04	12.22
4	104	2	2023-01-05	53.22
5	105	2	2023-01-06	292.11

Next, we’ll use method chaining and a lambda function to calculate the date of each customer’s previous order, then we’ll use the Pandas to_datetime() method to calculate how many days have passed between the date of the current order and the date of the previous one, and we’ll return the value in days.

# Group by customer_id and return the date of the previous order
df = (df
     .assign(
        last_order_date = lambda x: x.groupby('customer_id')['order_date'].shift(1),
        days_since_last_order = lambda x: (pd.to_datetime(x['order_date']) 
        - pd.to_datetime(x['last_order_date'])).dt.days
     )
)
df

	order_id	customer_id	order_date	order_amount	last_order_date	days_since_last_order
0	100	1	2023-01-01	129.23	NaN	NaN
1	101	2	2023-01-01	82.23	NaN	NaN
2	102	3	2023-01-03	343.23	NaN	NaN
3	103	1	2023-01-04	12.22	2023-01-01	3.0
4	104	2	2023-01-05	53.22	2023-01-01	4.0
5	105	2	2023-01-06	292.11	2023-01-05	1.0

Matt Clarke, Monday, January 16, 2023

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.