How to use the Pandas apply function on dataframe rows and columns

The Pandas apply function lets you apply a function to columns or rows in your Pandas dataframe. Here's how you use it.

How to use the Pandas apply function on dataframe rows and columns
Picture by Life of Pix, Pexels.
9 minutes to read

The Pandas apply() function allows you to run custom functions on the values in a Series or column of your Pandas dataframe. The Pandas apply function can be used for a wide range of data science tasks including Exploratory Data Analysis (or EDA) and in the feature engineering process that precedes machine learning model training.

In this tutorial I’ll show you how you can use the Pandas apply function to run a function on a column of your dataframe, and run a function on several columns for each row in your dataframe. It’s a powerful tool for data cleaning and feature engineering and is very easy to learn. Here’s how it works.

Load the package

First, open a Jupyter notebook and import the pandas package. If you don’t have this installed you can install it with the following command: pip3 install pandas, or by entering !pip3 install pandas in a cell in the Jupyter notebook. We’ll also use the Pandas set_option() function to increase the maximum number of columns that Pandas will display when printing a dataframe.

import pandas as pd
pd.set_option('max_columns', 100)

Load the data

Next, we’ll import data into Pandas by loading a remote CSV of data from my datasets GitHub repository using the read_csv() function. We can then use the head() function to display the first few rows of the dataframe.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/telco.csv')
df.head()
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

To get an idea of what columns and data types are present in the Pandas dataframe we will use the Pandas info() function. The info() function returns the names of the columns, their data types and the number of null values.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

We can then use the Pandas describe() function to get a summary of the data and print some descriptive statistics to give us a feel of what the data looks like.

df.describe()
SeniorCitizen tenure MonthlyCharges
count 7043.000000 7043.000000 7043.000000
mean 0.162147 32.371149 64.761692
std 0.368612 24.559481 30.090047
min 0.000000 0.000000 18.250000
25% 0.000000 9.000000 35.500000
50% 0.000000 29.000000 70.350000
75% 0.000000 55.000000 89.850000
max 1.000000 72.000000 118.750000

Apply a function to each value in a dataframe column

The Pandas apply() function can be used in two main ways: it can be used to run a function on a specific column or Series in a Pandas dataframe, or it can be used to run a function on several columns in each row of the dataframe.

To start off, we will use the apply() function to run a function on a specific column in the dataframe. The apply() function takes a function as an argument and applies that function to each value in a column.

We’ll create a column called annual_charges() that contains the annual charges for each customer by multiplying the monthly charges by 12.

def annual_charges(monthly_charges):
    return 12 * monthly_charges

To run the function, we’ll create a new dataframe column called AnnualCharges using df['AnnualCharges'] and then run the function using df['AnnualCharges'].apply(annual_charges). This take the MonthlyCharges column value in each row, pass it to the annual_charges() function and put the result in the AnnualCharges column.

df['AnnualCharges'] = df['MonthlyCharges'].apply(annual_charges)
df[['customerID', 'tenure', 'MonthlyCharges', 'AnnualCharges']].head()
customerID tenure MonthlyCharges AnnualCharges
0 7590-VHVEG 1 29.85 358.2
1 5575-GNVDE 34 56.95 683.4
2 3668-QPYBK 2 53.85 646.2
3 7795-CFOCW 45 42.30 507.6
4 9237-HQITU 2 70.70 848.4

Apply a function to each row in dataframe column

To run a function on each row in a column, we can use the apply() function. This time, our function will take the entire row as an argument, and we will need to define the column names we want to use.

We’ll create a function called average_monthly_charges() that will multiply the value in the MonthlyCharges column by the value in the tenure column for every row in the Pandas dataframe.

def average_monthly_charges(row):
    return row['MonthlyCharges'] * row['tenure']

To run the function we use a similar approach as we did when using apply() on a single column, but we additionally pass in the axis=1 argument to tell Pandas to run the function on each row.

df['ChargesPerAnnum'] = df.apply(average_monthly_charges, axis=1)

If you run the code, you’ll see that Pandas has now looped over all the rows in the dataframe and run the function on the two columns we specified, assigning the new column to the dataframe.

df[['customerID', 'tenure', 'MonthlyCharges', 'ChargesPerAnnum', 'AnnualCharges']].head()
customerID tenure MonthlyCharges ChargesPerAnnum AnnualCharges
0 7590-VHVEG 1 29.85 29.85 358.2
1 5575-GNVDE 34 56.95 1936.30 683.4
2 3668-QPYBK 2 53.85 107.70 646.2
3 7795-CFOCW 45 42.30 1903.50 507.6
4 9237-HQITU 2 70.70 141.40 848.4

Matt Clarke, Saturday, November 27, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.