The Pandas apply()
function allows you to run custom functions on the values in a Series or column of your Pandas dataframe. The Pandas apply function can be used for a wide range of data science tasks including Exploratory Data Analysis (or EDA) and in the feature engineering process that precedes machine learning model training.
In this tutorial I’ll show you how you can use the Pandas apply function to run a function on a column of your dataframe, and run a function on several columns for each row in your dataframe. It’s a powerful tool for data cleaning and feature engineering and is very easy to learn. Here’s how it works.
First, open a Jupyter notebook and import the pandas
package. If you don’t have this installed you can install it with the following command: pip3 install pandas
, or by entering !pip3 install pandas
in a cell in the Jupyter notebook. We’ll also use the Pandas set_option()
function to increase the maximum number of columns that Pandas will display when printing a dataframe.
import pandas as pd
pd.set_option('max_columns', 100)
Next, we’ll import data into Pandas by loading a remote CSV of data from my datasets GitHub repository using the read_csv()
function. We can then use the head()
function to display the first few rows of the dataframe.
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/telco.csv')
df.head()
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
To get an idea of what columns and data types are present in the Pandas dataframe we will use the Pandas info()
function. The info()
function returns the names of the columns, their data types and the number of null values.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
We can then use the Pandas describe()
function to get a summary of the data and print some descriptive statistics to give us a feel of what the data looks like.
df.describe()
SeniorCitizen | tenure | MonthlyCharges | |
---|---|---|---|
count | 7043.000000 | 7043.000000 | 7043.000000 |
mean | 0.162147 | 32.371149 | 64.761692 |
std | 0.368612 | 24.559481 | 30.090047 |
min | 0.000000 | 0.000000 | 18.250000 |
25% | 0.000000 | 9.000000 | 35.500000 |
50% | 0.000000 | 29.000000 | 70.350000 |
75% | 0.000000 | 55.000000 | 89.850000 |
max | 1.000000 | 72.000000 | 118.750000 |
The Pandas apply()
function can be used in two main ways: it can be used to run a function on a specific column or Series in a Pandas dataframe, or it can be used to run a function on several columns in each row of the dataframe.
To start off, we will use the apply()
function to run a function on a specific column in the dataframe. The apply()
function takes a function as an argument and applies that function to each value in a column.
We’ll create a column called annual_charges()
that contains the annual charges for each customer by multiplying the monthly charges by 12.
def annual_charges(monthly_charges):
return 12 * monthly_charges
To run the function, we’ll create a new dataframe column called AnnualCharges
using df['AnnualCharges']
and then run the function using df['AnnualCharges'].apply(annual_charges)
. This take the MonthlyCharges
column value in each row, pass it to the annual_charges()
function and put the result in the AnnualCharges
column.
df['AnnualCharges'] = df['MonthlyCharges'].apply(annual_charges)
df[['customerID', 'tenure', 'MonthlyCharges', 'AnnualCharges']].head()
customerID | tenure | MonthlyCharges | AnnualCharges | |
---|---|---|---|---|
0 | 7590-VHVEG | 1 | 29.85 | 358.2 |
1 | 5575-GNVDE | 34 | 56.95 | 683.4 |
2 | 3668-QPYBK | 2 | 53.85 | 646.2 |
3 | 7795-CFOCW | 45 | 42.30 | 507.6 |
4 | 9237-HQITU | 2 | 70.70 | 848.4 |
To run a function on each row in a column, we can use the apply()
function. This time, our function will take the entire row as an argument, and we will need to define the column names we want to use.
We’ll create a function called average_monthly_charges()
that will multiply the value in the MonthlyCharges
column by the value in the tenure
column for every row in the Pandas dataframe.
def average_monthly_charges(row):
return row['MonthlyCharges'] * row['tenure']
To run the function we use a similar approach as we did when using apply()
on a single column, but we additionally pass in the axis=1
argument to tell Pandas to run the function on each row.
df['ChargesPerAnnum'] = df.apply(average_monthly_charges, axis=1)
If you run the code, you’ll see that Pandas has now looped over all the rows in the dataframe and run the function on the two columns we specified, assigning the new column to the dataframe.
df[['customerID', 'tenure', 'MonthlyCharges', 'ChargesPerAnnum', 'AnnualCharges']].head()
customerID | tenure | MonthlyCharges | ChargesPerAnnum | AnnualCharges | |
---|---|---|---|---|---|
0 | 7590-VHVEG | 1 | 29.85 | 29.85 | 358.2 |
1 | 5575-GNVDE | 34 | 56.95 | 1936.30 | 683.4 |
2 | 3668-QPYBK | 2 | 53.85 | 107.70 | 646.2 |
3 | 7795-CFOCW | 45 | 42.30 | 1903.50 | 507.6 |
4 | 9237-HQITU | 2 | 70.70 | 141.40 | 848.4 |
Matt Clarke, Saturday, November 27, 2021