How to use Pandas shift() to create lagged variables

Learn how to use the Pandas filter() function to filter or subset a dataframe based on the column names using their names, substring matches, or regular expressions.

How to use Pandas shift() to create lagged variables
Picture by Sohel Patel, Pexels.
7 minutes to read

The Pandas shift() function is used to shift the position of a dataframe or series by a specified number of periods. It’s commonly used for the creation of so-called lagged variables or lagged features in time series datasets.

In this quick tutorial, we’ll cover the basics of the shift() function and show how you can create lagged variables from values in a Pandas dataframe.

Create a Pandas dataframe

To get started, open a Jupyter notebook and import the Pandas library. Then, either import your own dataset into Pandas, or create a dataframe of dummy data. We’ll use a dictionary to create a dataframe of dates and order volumes.

import pandas as pd
df = pd.DataFrame({'date': ['2023-01-01', '2023-01-02', '2023-01-03', 
                            '2023-01-04', '2023-01-05', '2023-01-06', 
                            '2023-01-07', '2023-01-08', '2023-01-09'], 
                   'orders': [3432, 3233, 3434, 3235, 3436, 
                              3237, 3438, 3239, 3440]})
df
date orders
0 2023-01-01 3432
1 2023-01-02 3233
2 2023-01-03 3434
3 2023-01-04 3235
4 2023-01-05 3436
5 2023-01-06 3237
6 2023-01-07 3438
7 2023-01-08 3239
8 2023-01-09 3440

Create lagged variables with shift()

To use the shift() method you simply append the function to your dataframe column and pass in an integer argument to define how many rows you want to shift. To create a lag based on one day in our date column we’ll use shift(1), and to create a lag based on one week we’ll use shift(7), then we’ll use the Pandas assign() method to create a new column, and we’ll save it back to the dataframe.

If you look at new columns you’ll see that the orders_lag1 column contains either a NaN value when there was no previous row to shift, or the value of the row one row above. Similarly, for the orders_lag7 column, we have NaN values for the first seven dates because there were insufficient prior values.

df = df.assign(
    orders_lag1 = df['orders'].shift(1),
    orders_lag7 = df['orders'].shift(7)
)
df
date orders orders_lag1 orders_lag7
0 2023-01-01 3432 NaN NaN
1 2023-01-02 3233 3432.0 NaN
2 2023-01-03 3434 3233.0 NaN
3 2023-01-04 3235 3434.0 NaN
4 2023-01-05 3436 3235.0 NaN
5 2023-01-06 3237 3436.0 NaN
6 2023-01-07 3438 3237.0 NaN
7 2023-01-08 3239 3438.0 3432.0
8 2023-01-09 3440 3239.0 3233.0

Using shift() with groupby()

The other really neat thing you can do with shift() is combine it with a groupby() operation. To show this in action we’ll first create another dataframe containing various customer IDs, order IDs, order dates, and order amounts.

df = pd.DataFrame({'order_id': [100, 101, 102, 103, 104, 105],
                   'customer_id': [1, 2, 3, 1, 2, 2],
                   'order_date': ['2023-01-01', '2023-01-01', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06'],
                   'order_amount': [129.23, 82.23, 343.23, 12.22, 53.22, 292.11]})
df
order_id customer_id order_date order_amount
0 100 1 2023-01-01 129.23
1 101 2 2023-01-01 82.23
2 102 3 2023-01-03 343.23
3 103 1 2023-01-04 12.22
4 104 2 2023-01-05 53.22
5 105 2 2023-01-06 292.11

Next, we’ll use method chaining and a lambda function to calculate the date of each customer’s previous order, then we’ll use the Pandas to_datetime() method to calculate how many days have passed between the date of the current order and the date of the previous one, and we’ll return the value in days.

# Group by customer_id and return the date of the previous order
df = (df
     .assign(
        last_order_date = lambda x: x.groupby('customer_id')['order_date'].shift(1),
        days_since_last_order = lambda x: (pd.to_datetime(x['order_date']) 
        - pd.to_datetime(x['last_order_date'])).dt.days
     )
)
df
order_id customer_id order_date order_amount last_order_date days_since_last_order
0 100 1 2023-01-01 129.23 NaN NaN
1 101 2 2023-01-01 82.23 NaN NaN
2 102 3 2023-01-03 343.23 NaN NaN
3 103 1 2023-01-04 12.22 2023-01-01 3.0
4 104 2 2023-01-05 53.22 2023-01-01 4.0
5 105 2 2023-01-06 292.11 2023-01-05 1.0

Matt Clarke, Monday, January 16, 2023

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.