How to add feature engineering to a scikit-learn pipeline

Picture by Andrea Piacquadio, Pexels.

19 minutes to read

When building a machine learning model, feature engineering is one of the most important steps. Feature engineering is the process of creating new features from existing data and can often be the difference between an average model and an outstanding one. Models are often excellent at identifying patterns in data, but they can only do this if the data is in a format that they can understand and often benefit from domain knowledge to help them identify the most important patterns.

Feature engineering is typically undertaken on the Pandas dataframe from which you load your training data. However, there’s a more elegant way to do this. You can use the FunctionTransformer class from scikit-learn to create a scikit-learn pipeline that includes feature engineering steps.

The benefit of using FunctionTransformer is that you can use the same pipeline for training and testing. This is important because you don’t want to accidentally leak information from the test set into the training set. In this tutorial, you will learn how to use FunctionTransformer to create a scikit-learn pipeline that includes feature engineering steps for an XGBoost contractual churn model.

Load the packages

To get started, open a new Jupyter notebook and load the packages that you will need. To keep things simple, we’ll be using a single XGBoost classification model, and we’ll skip model selection and feature selection to keep the code easier to understand.

As well as Pandas and XGBoost we’ll be using a variety of scikit-learn classes. These include the FunctionTransformer class, the Pipeline class, the StandardScaler class, and the OneHotEncoder class, among others. We’ll also be using the Optuna package to tune the model. If you don’t have Optuna installed, you can install it via Pip using pip install optuna.

import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import roc_auc_score
import optuna

import warnings
warnings.filterwarnings("ignore")

Load the data

The dataset we’re using is a customer churn dataset from Kaggle. This includes a variety of customer information, such as the number of customer service calls they’ve made, the number of voicemail messages they’ve left, and the number of international calls they’ve made.

It also includes a target variable, churn, which indicates whether the customer has churned or not. Our classification model is going to predict whether a customer will churn or not by using the other variables as features.

Before we start, we’ll tidy the data by converting the churn column to a numeric value, so it can be used in the model. We’ll also change the column names to lowercase to make it easier to remember capitalisation using the Pandas rename function.

df = pd.read_csv('train.csv')
df = df.rename(columns=str.lower)
df['churn'] = df['churn'].replace(('yes', 'no'), (1, 0))
df.head(3).T

	0	1	2
state	OH	NJ	OH
account_length	107	137	84
area_code	area_code_415	area_code_415	area_code_408
international_plan	no	no	yes
voice_mail_plan	yes	no	no
number_vmail_messages	26	0	0
total_day_minutes	161.6	243.4	299.4
total_day_calls	123	114	71
total_day_charge	27.47	41.38	50.9
total_eve_minutes	195.5	121.2	61.9
total_eve_calls	103	110	88
total_eve_charge	16.62	10.3	5.26
total_night_minutes	254.4	162.6	196.9
total_night_calls	103	104	89
total_night_charge	11.45	7.32	8.86
total_intl_minutes	13.7	12.2	6.6
total_intl_calls	3	5	7
total_intl_charge	3.7	3.29	1.78
number_customer_service_calls	1	0	2
churn	0	0	0

Create the test and train datasets

Next, we’ll assign all the columns apart from the churn column to our X feature set and the churn column to our y target variable. We’ll then split the data into a training set and a test set using the train_test_split function from scikit-learn. We’ll use 30% of the data for testing and 70% for training.

X = df.drop(['churn'], axis=1)
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=0)

Create our feature engineering functions

We’ll create a number of functions to create new features from the existing features. These functions will be used in the FunctionTransformer class to create new features from the existing features. These functions are very basic and simply add various Pandas column values together to create new features that might help improve our customer churn model’s performance.

They calculate the total minutes spent on calls, the total number of calls, the total charge and the total number of customer service calls. Each function takes the dataframe as an input and returns a new dataframe with the new features in a new column. The feature engineering functions you choose to create could do almost anything.

def get_total_net_minutes(df):
    df['total_net_minutes'] = df['total_day_minutes'] + df['total_eve_minutes'] + df['total_night_minutes']
    return df

def get_total_net_calls(df):
    df['total_net_calls'] = df['total_day_calls'] + df['total_eve_calls'] + df['total_night_calls']
    return df

def get_total_net_charge(df):
    df['total_net_charge'] = df['total_day_charge'] + df['total_eve_charge'] + df['total_night_charge']
    return df

def cs_calls_per_month(df):
    df['cs_calls_per_month'] = (df['number_customer_service_calls'] + df['number_vmail_messages']) / df['account_length']
    return df

Create a feature engineering pipeline

Next, we’ll use the ColumnTransformer class to run each of our feature engineering functions via the FunctionTransformer class. For each one, we’re definining a name, calling FunctionTransformer(), passing in the function name and then defining the columns that the function should be applied to. Everything is stored in a variable called feature_engineering so we can call it later from the pipeline.

feature_engineering = ColumnTransformer([
    ('total_net_minutes', FunctionTransformer(get_total_net_minutes, validate=False),
     ['total_day_minutes', 'total_eve_minutes', 'total_night_minutes']),
    ('total_net_calls', FunctionTransformer(get_total_net_calls, validate=False),
     ['total_day_calls', 'total_eve_calls', 'total_night_calls']),
    ('total_net_charge', FunctionTransformer(get_total_net_charge, validate=False),
     ['total_day_charge', 'total_eve_charge', 'total_night_charge']),
    ('cs_calls_per_month', FunctionTransformer(cs_calls_per_month, validate=False),
     ['account_length', 'number_customer_service_calls', 'number_vmail_messages']),
])

Define the categorical and numerical columns

There are often other things you might want to do to your data before you pass it to the model. To save the hassle of doing this in Pandas on every column, we’ll instead use the select_dtypes() function to select columns based on their Pandas dtypes. We’ll identify the numeric columns and the categorical columns and store them in variables called numeric_columns and categorical_columns respectively.

categorical_columns = list(X_train.select_dtypes(include=['object']).columns.values.tolist())
numeric_columns = list(X_train.select_dtypes(exclude=['object']).columns.values.tolist())

Create pipelines for numeric and categorical data

For the numerical data, we’ll use the SimpleImputer class to fill in any missing values with the mean of the column. For the categorical columns we’ll use the SimpleImputer class to fill in any missing values with the most frequent value in the column. We’ll then use the OneHotEncoder class to one-hot encode the categorical columns.

numeric_transformer = SimpleImputer(strategy='constant')

categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore')),
    ])

Create a preprocessor

Now we’ve created our feature engineering pipeline and our pipelines for the numerical and categorical data, we’ll use the ColumnTransformer class to combine them into a single preprocessor. We’ll use the numeric_columns and categorical_columns variables we created earlier to define which columns should be passed to which pipeline.

preprocessor = ColumnTransformer(
    transformers=[
        ('feature_engineering', feature_engineering, numeric_columns),
        ('numeric_transformers', numeric_transformer, numeric_columns),
        ('categorical_transformers', categorical_transformer, categorical_columns),
    ])

Create the contractual churn model

Finally, we’ll create our contractual churn churn model. We’ll use the XGBClassifier class from XGBoost to create our churn model. As I have a powerful GPU in my data science workstation, I’m passing in the optional tree_method='gpu_hist' parameter to use the GPU to train the model. If you don’t have a GPU, you can remove this parameter. We’ll then use the Pipeline class to combine our preprocessor and our model into a single pipeline. We’ll then fit the pipeline to our training data and use it to make predictions on our test data.

model = XGBClassifier(random_state=0, eval_metric='mlogloss', tree_method='gpu_hist')
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', model)])
pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('feature_engineering',
                                                  ColumnTransformer(transformers=[('total_net_minutes',
                                                                                   FunctionTransformer(func=<function get_total_net_minutes at 0x7f2db43c5680>),
                                                                                   ['total_day_minutes',
                                                                                    'total_eve_minutes',
                                                                                    'total_night_minutes']),
                                                                                  ('total_net_calls',
                                                                                   FunctionTransformer(func=<function get_total_net_call...
                               gamma=0, gpu_id=0, importance_type='gain',
                               interaction_constraints='',
                               learning_rate=0.300000012, max_delta_step=0,
                               max_depth=6, min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=0, num_parallel_tree=1, random_state=0,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                               subsample=1, tree_method='gpu_hist',
                               validate_parameters=1, verbosity=None))])

Evaluate the base model

Now that we’ve created our model, we can evaluate it. We’ll first generate some predictions from the test data using the predict() function. We’ll then use the accuracy_score() function to calculate the accuracy of the model and the roc_auc_score() function to calculate the AUC of the model. XGBClassifier is extremely effective, so the base customer churn model scores very well indeed for a first attempt.

predictions = pipeline.predict(X_test)
print('Accuracy: ', accuracy_score(y_test, predictions))
print('AUC: ', roc_auc_score(y_test, predictions))

Accuracy:  0.9764705882352941
AUC:  0.9157312505900991

Use Optuna to tune the model

We could leave things there, but we can probably generate some easy improvements by using Optuna to tune the XGBoost hyperparameters. Optuna is a powerful hyperparameter tuning library that can be used to tune the hyperparameters of any machine learning model.

To get started, we first need to create an objective function. The objective function is the function that Optuna will try to minimize. In this case, we’ll use the accuracy_score(). We’ll then create a study object and pass in the objective function. We’ll then use the optimize() function to start the optimization process.

```python
def objective(trial):
    params = {
        'model__n_estimators': trial.suggest_int('model__n_estimators', 100, 1000),
        'model__learning_rate': trial.suggest_float('model__learning_rate', 0.01, 0.1),
        'model__max_depth': trial.suggest_int('model__max_depth', 3, 10),
        'model__min_child_weight': trial.suggest_int('model__min_child_weight', 1, 10),
        'model__gamma': trial.suggest_float('model__gamma', 0.01, 0.1),
        'model__subsample': trial.suggest_float('model__subsample', 0.01, 1.0),
        'model__colsample_bytree': trial.suggest_float('model__colsample_bytree', 0.01, 1.0),
        'model__reg_alpha': trial.suggest_float('model__reg_alpha', 1e-5, 10.0),
        'model__reg_lambda': trial.suggest_float('model__reg_lambda', 1e-5, 10.0),
        'model__scale_pos_weight': trial.suggest_float('model__scale_pos_weight', 1e-5, 10.0),
        'model__n_jobs': 4,
        'model__eval_metric': 'mlogloss',
        'model__tree_method': 'gpu_hist',
    }
    pipeline.set_params(**params)
    pipeline.fit(X_train, y_train)
    predictions = pipeline.predict(X_test)
    return accuracy_score(y_test, predictions)

Run the optimization

Next, we’ll use Optuna to run the optimization and create a study that sets out to maximise the accuracy of the model. We’ll then use the optimize() function to start the optimization process and we’ll run 100 trials, showing a progress bar as the tasks run. To avoid clogging up my notebook with data, I’ve also optionally disable verbose output from Optuna.

optuna.logging.set_verbosity(optuna.logging.WARNING)
study = optuna.create_study(study_name='churn model', 
                            direction='maximize')
study.optimize(objective, n_trials=100, show_progress_bar=True)

Print the best parameters

Depending on the speed of your computer that should take a few minutes to run. Once it’s finished, we can print the best parameters that Optuna found and the maximum score achieved. It looks like Optuna was able to find a model that scores almost 98% accuracy, which is a nice little improvement on the base model.

print('Best parameters', study.best_params)

Best parameters {'model__n_estimators': 389, 'model__learning_rate': 0.09650401509403127, 'model__max_depth': 8, 'model__min_child_weight': 1, 'model__gamma': 0.09735495146805667, 'model__subsample': 0.8390259995692267, 'model__colsample_bytree': 0.9689657157655308, 'model__reg_alpha': 9.068919816172016, 'model__reg_lambda': 5.966582881537109, 'model__scale_pos_weight': 5.53771951672144}

print('Best score', study.best_value)

Best score 0.9780392156862745

print('Best model', study.best_trial)

Best model FrozenTrial(number=75, values=[0.9780392156862745], datetime_start=datetime.datetime(2022, 10, 16, 9, 27, 26, 328358), datetime_complete=datetime.datetime(2022, 10, 16, 9, 27, 27, 248163), params={'model__n_estimators': 389, 'model__learning_rate': 0.09650401509403127, 'model__max_depth': 8, 'model__min_child_weight': 1, 'model__gamma': 0.09735495146805667, 'model__subsample': 0.8390259995692267, 'model__colsample_bytree': 0.9689657157655308, 'model__reg_alpha': 9.068919816172016, 'model__reg_lambda': 5.966582881537109, 'model__scale_pos_weight': 5.53771951672144}, distributions={'model__n_estimators': IntDistribution(high=1000, log=False, low=100, step=1), 'model__learning_rate': FloatDistribution(high=0.1, log=False, low=0.01, step=None), 'model__max_depth': IntDistribution(high=10, log=False, low=3, step=1), 'model__min_child_weight': IntDistribution(high=10, log=False, low=1, step=1), 'model__gamma': FloatDistribution(high=0.1, log=False, low=0.01, step=None), 'model__subsample': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'model__colsample_bytree': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'model__reg_alpha': FloatDistribution(high=10.0, log=False, low=1e-05, step=None), 'model__reg_lambda': FloatDistribution(high=10.0, log=False, low=1e-05, step=None), 'model__scale_pos_weight': FloatDistribution(high=10.0, log=False, low=1e-05, step=None)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=75, state=TrialState.COMPLETE, value=None)

Re-fit the model using the best parameters

Finally, we can re-fit the model using the best parameters found by Optuna and then use it to make predictions on the test set. We can then print the accuracy score to see how well the model performs on unseen data.

pipeline.set_params(**study.best_params)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

The tuned model achieves an accuracy score of 97.8%, which is a nice improvement on the base model, as well as a ROC/AUC score of 92.6%, so it shows why it makes sense to spend an extra five minutes tuning hyperparameters.

print('Accuracy: ', accuracy_score(y_test, predictions))
print('AUC: ', roc_auc_score(y_test, predictions))

Accuracy:  0.9780392156862745
AUC:  0.9263845032153835

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1102
           1       0.98      0.86      0.91       173

    accuracy                           0.98      1275
   macro avg       0.98      0.93      0.95      1275
weighted avg       0.98      0.98      0.98      1275

Matt Clarke, Sunday, October 16, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.