How to use the Pandas apply function on dataframe rows and columns

Picture by Life of Pix, Pexels.

9 minutes to read

Data Science Pandas

The Pandas apply() function allows you to run custom functions on the values in a Series or column of your Pandas dataframe. The Pandas apply function can be used for a wide range of data science tasks including Exploratory Data Analysis (or EDA) and in the feature engineering process that precedes machine learning model training.

In this tutorial I’ll show you how you can use the Pandas apply function to run a function on a column of your dataframe, and run a function on several columns for each row in your dataframe. It’s a powerful tool for data cleaning and feature engineering and is very easy to learn. Here’s how it works.

Load the package

First, open a Jupyter notebook and import the pandas package. If you don’t have this installed you can install it with the following command: pip3 install pandas, or by entering !pip3 install pandas in a cell in the Jupyter notebook. We’ll also use the Pandas set_option() function to increase the maximum number of columns that Pandas will display when printing a dataframe.

import pandas as pd
pd.set_option('max_columns', 100)

Load the data

Next, we’ll import data into Pandas by loading a remote CSV of data from my datasets GitHub repository using the read_csv() function. We can then use the head() function to display the first few rows of the dataframe.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/telco.csv')
df.head()

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.5	No
2	3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

To get an idea of what columns and data types are present in the Pandas dataframe we will use the Pandas info() function. The info() function returns the names of the columns, their data types and the number of null values.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 customerID        7043 non-null   object 
 gender            7043 non-null   object 
 SeniorCitizen     7043 non-null   int64  
 Partner           7043 non-null   object 
 Dependents        7043 non-null   object 
 tenure            7043 non-null   int64  
 PhoneService      7043 non-null   object 
 MultipleLines     7043 non-null   object 
 InternetService   7043 non-null   object 
 OnlineSecurity    7043 non-null   object 
OnlineBackup      7043 non-null   object 
DeviceProtection  7043 non-null   object 
TechSupport       7043 non-null   object 
StreamingTV       7043 non-null   object 
StreamingMovies   7043 non-null   object 
Contract          7043 non-null   object 
PaperlessBilling  7043 non-null   object 
PaymentMethod     7043 non-null   object 
MonthlyCharges    7043 non-null   float64
TotalCharges      7043 non-null   object 
Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

We can then use the Pandas describe() function to get a summary of the data and print some descriptive statistics to give us a feel of what the data looks like.

df.describe()

	SeniorCitizen	tenure	MonthlyCharges
count	7043.000000	7043.000000	7043.000000
mean	0.162147	32.371149	64.761692
std	0.368612	24.559481	30.090047
min	0.000000	0.000000	18.250000
25%	0.000000	9.000000	35.500000
50%	0.000000	29.000000	70.350000
75%	0.000000	55.000000	89.850000
max	1.000000	72.000000	118.750000

Apply a function to each value in a dataframe column

The Pandas apply() function can be used in two main ways: it can be used to run a function on a specific column or Series in a Pandas dataframe, or it can be used to run a function on several columns in each row of the dataframe.

To start off, we will use the apply() function to run a function on a specific column in the dataframe. The apply() function takes a function as an argument and applies that function to each value in a column.

We’ll create a column called annual_charges() that contains the annual charges for each customer by multiplying the monthly charges by 12.

def annual_charges(monthly_charges):
    return 12 * monthly_charges

To run the function, we’ll create a new dataframe column called AnnualCharges using df['AnnualCharges'] and then run the function using df['AnnualCharges'].apply(annual_charges). This take the MonthlyCharges column value in each row, pass it to the annual_charges() function and put the result in the AnnualCharges column.

df['AnnualCharges'] = df['MonthlyCharges'].apply(annual_charges)

df[['customerID', 'tenure', 'MonthlyCharges', 'AnnualCharges']].head()

	customerID	tenure	MonthlyCharges	AnnualCharges
0	7590-VHVEG	1	29.85	358.2
1	5575-GNVDE	34	56.95	683.4
2	3668-QPYBK	2	53.85	646.2
3	7795-CFOCW	45	42.30	507.6
4	9237-HQITU	2	70.70	848.4

Apply a function to each row in dataframe column

To run a function on each row in a column, we can use the apply() function. This time, our function will take the entire row as an argument, and we will need to define the column names we want to use.

We’ll create a function called average_monthly_charges() that will multiply the value in the MonthlyCharges column by the value in the tenure column for every row in the Pandas dataframe.

def average_monthly_charges(row):
    return row['MonthlyCharges'] * row['tenure']

To run the function we use a similar approach as we did when using apply() on a single column, but we additionally pass in the axis=1 argument to tell Pandas to run the function on each row.

df['ChargesPerAnnum'] = df.apply(average_monthly_charges, axis=1)

If you run the code, you’ll see that Pandas has now looped over all the rows in the dataframe and run the function on the two columns we specified, assigning the new column to the dataframe.

df[['customerID', 'tenure', 'MonthlyCharges', 'ChargesPerAnnum', 'AnnualCharges']].head()

	customerID	tenure	MonthlyCharges	ChargesPerAnnum	AnnualCharges
0	7590-VHVEG	1	29.85	29.85	358.2
1	5575-GNVDE	34	56.95	1936.30	683.4
2	3668-QPYBK	2	53.85	107.70	646.2
3	7795-CFOCW	45	42.30	1903.50	507.6
4	9237-HQITU	2	70.70	141.40	848.4

Matt Clarke, Saturday, November 27, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.