When building a machine learning model, feature engineering is one of the most important steps. Feature engineering is the process of creating new features from existing data and can often be the difference between an average model and an outstanding one. Models are often excellent at identifying patterns in data, but they can only do this if the data is in a format that they can understand and often benefit from domain knowledge to help them identify the most important patterns.
Feature engineering is typically undertaken on the Pandas dataframe from which you load your training data. However, there’s a more elegant way to do this. You can use the FunctionTransformer class from scikit-learn to create a scikit-learn pipeline that includes feature engineering steps.
The benefit of using FunctionTransformer is that you can use the same pipeline for training and testing. This is important because you don’t want to accidentally leak information from the test set into the training set. In this tutorial, you will learn how to use FunctionTransformer to create a scikit-learn pipeline that includes feature engineering steps for an XGBoost contractual churn model.
To get started, open a new Jupyter notebook and load the packages that you will need. To keep things simple, we’ll be using a single XGBoost classification model, and we’ll skip model selection and feature selection to keep the code easier to understand.
As well as Pandas and XGBoost we’ll be using a variety of scikit-learn classes. These include the FunctionTransformer class, the Pipeline class, the StandardScaler class, and the OneHotEncoder class, among others. We’ll also be using the Optuna package to tune the model. If you don’t have Optuna installed, you can install it via Pip using pip install optuna
.
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import roc_auc_score
import optuna
import warnings
warnings.filterwarnings("ignore")
The dataset we’re using is a customer churn dataset from Kaggle. This includes a variety of customer information, such as the number of customer service calls they’ve made, the number of voicemail messages they’ve left, and the number of international calls they’ve made.
It also includes a target variable, churn, which indicates whether the customer has churned or not. Our classification model is going to predict whether a customer will churn or not by using the other variables as features.
Before we start, we’ll tidy the data by converting the churn column to a numeric value, so it can be used in the model. We’ll also change the column names to lowercase to make it easier to remember capitalisation using the Pandas rename function.
df = pd.read_csv('train.csv')
df = df.rename(columns=str.lower)
df['churn'] = df['churn'].replace(('yes', 'no'), (1, 0))
df.head(3).T
0 | 1 | 2 | |
---|---|---|---|
state | OH | NJ | OH |
account_length | 107 | 137 | 84 |
area_code | area_code_415 | area_code_415 | area_code_408 |
international_plan | no | no | yes |
voice_mail_plan | yes | no | no |
number_vmail_messages | 26 | 0 | 0 |
total_day_minutes | 161.6 | 243.4 | 299.4 |
total_day_calls | 123 | 114 | 71 |
total_day_charge | 27.47 | 41.38 | 50.9 |
total_eve_minutes | 195.5 | 121.2 | 61.9 |
total_eve_calls | 103 | 110 | 88 |
total_eve_charge | 16.62 | 10.3 | 5.26 |
total_night_minutes | 254.4 | 162.6 | 196.9 |
total_night_calls | 103 | 104 | 89 |
total_night_charge | 11.45 | 7.32 | 8.86 |
total_intl_minutes | 13.7 | 12.2 | 6.6 |
total_intl_calls | 3 | 5 | 7 |
total_intl_charge | 3.7 | 3.29 | 1.78 |
number_customer_service_calls | 1 | 0 | 2 |
churn | 0 | 0 | 0 |
Next, we’ll assign all the columns apart from the churn
column to our X
feature set and the churn
column to our y
target variable. We’ll then split the data into a training set and a test set using the train_test_split function from scikit-learn. We’ll use 30% of the data for testing and 70% for training.
X = df.drop(['churn'], axis=1)
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=0)
We’ll create a number of functions to create new features from the existing features. These functions will be used in the FunctionTransformer class to create new features from the existing features. These functions are very basic and simply add various Pandas column values together to create new features that might help improve our customer churn model’s performance.
They calculate the total minutes spent on calls, the total number of calls, the total charge and the total number of customer service calls. Each function takes the dataframe as an input and returns a new dataframe with the new features in a new column. The feature engineering functions you choose to create could do almost anything.
def get_total_net_minutes(df):
df['total_net_minutes'] = df['total_day_minutes'] + df['total_eve_minutes'] + df['total_night_minutes']
return df
def get_total_net_calls(df):
df['total_net_calls'] = df['total_day_calls'] + df['total_eve_calls'] + df['total_night_calls']
return df
def get_total_net_charge(df):
df['total_net_charge'] = df['total_day_charge'] + df['total_eve_charge'] + df['total_night_charge']
return df
def cs_calls_per_month(df):
df['cs_calls_per_month'] = (df['number_customer_service_calls'] + df['number_vmail_messages']) / df['account_length']
return df
Next, we’ll use the ColumnTransformer class to run each of our feature engineering functions via the FunctionTransformer class. For each one, we’re definining a name, calling FunctionTransformer(), passing in the function name and then defining the columns that the function should be applied to. Everything is stored in a variable called feature_engineering
so we can call it later from the pipeline.
feature_engineering = ColumnTransformer([
('total_net_minutes', FunctionTransformer(get_total_net_minutes, validate=False),
['total_day_minutes', 'total_eve_minutes', 'total_night_minutes']),
('total_net_calls', FunctionTransformer(get_total_net_calls, validate=False),
['total_day_calls', 'total_eve_calls', 'total_night_calls']),
('total_net_charge', FunctionTransformer(get_total_net_charge, validate=False),
['total_day_charge', 'total_eve_charge', 'total_night_charge']),
('cs_calls_per_month', FunctionTransformer(cs_calls_per_month, validate=False),
['account_length', 'number_customer_service_calls', 'number_vmail_messages']),
])
There are often other things you might want to do to your data before you pass it to the model. To save the hassle of doing this in Pandas on every column, we’ll instead use the select_dtypes()
function to select columns based on their Pandas dtypes. We’ll identify the numeric columns and the categorical columns and store them in variables called numeric_columns
and categorical_columns
respectively.
categorical_columns = list(X_train.select_dtypes(include=['object']).columns.values.tolist())
numeric_columns = list(X_train.select_dtypes(exclude=['object']).columns.values.tolist())
For the numerical data, we’ll use the SimpleImputer class to fill in any missing values with the mean of the column. For the categorical columns we’ll use the SimpleImputer class to fill in any missing values with the most frequent value in the column. We’ll then use the OneHotEncoder class to one-hot encode the categorical columns.
numeric_transformer = SimpleImputer(strategy='constant')
categorical_transformer = Pipeline(
steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore')),
])
Now we’ve created our feature engineering pipeline and our pipelines for the numerical and categorical data, we’ll use the ColumnTransformer class to combine them into a single preprocessor. We’ll use the numeric_columns
and categorical_columns
variables we created earlier to define which columns should be passed to which pipeline.
preprocessor = ColumnTransformer(
transformers=[
('feature_engineering', feature_engineering, numeric_columns),
('numeric_transformers', numeric_transformer, numeric_columns),
('categorical_transformers', categorical_transformer, categorical_columns),
])
Finally, we’ll create our contractual churn churn model. We’ll use the XGBClassifier class from XGBoost to create our churn model. As I have a powerful GPU in my data science workstation, I’m passing in the optional tree_method='gpu_hist'
parameter to use the GPU to train the model. If you don’t have a GPU, you can remove this parameter. We’ll then use the Pipeline class to combine our preprocessor and our model into a single pipeline. We’ll then fit the pipeline to our training data and use it to make predictions on our test data.
model = XGBClassifier(random_state=0, eval_metric='mlogloss', tree_method='gpu_hist')
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)])
pipeline.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('feature_engineering',
ColumnTransformer(transformers=[('total_net_minutes',
FunctionTransformer(func=<function get_total_net_minutes at 0x7f2db43c5680>),
['total_day_minutes',
'total_eve_minutes',
'total_night_minutes']),
('total_net_calls',
FunctionTransformer(func=<function get_total_net_call...
gamma=0, gpu_id=0, importance_type='gain',
interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100,
n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
subsample=1, tree_method='gpu_hist',
validate_parameters=1, verbosity=None))])
Now that we’ve created our model, we can evaluate it. We’ll first generate some predictions from the test data using the predict()
function. We’ll then use the accuracy_score()
function to calculate the accuracy of the model and the roc_auc_score()
function to calculate the AUC of the model. XGBClassifier is extremely effective, so the base customer churn model scores very well indeed for a first attempt.
predictions = pipeline.predict(X_test)
print('Accuracy: ', accuracy_score(y_test, predictions))
print('AUC: ', roc_auc_score(y_test, predictions))
Accuracy: 0.9764705882352941
AUC: 0.9157312505900991
We could leave things there, but we can probably generate some easy improvements by using Optuna to tune the XGBoost hyperparameters. Optuna is a powerful hyperparameter tuning library that can be used to tune the hyperparameters of any machine learning model.
To get started, we first need to create an objective function. The objective function is the function that Optuna will try to minimize. In this case, we’ll use the accuracy_score()
. We’ll then create a study object and pass in the objective function. We’ll then use the optimize()
function to start the optimization process.
```python
def objective(trial):
params = {
'model__n_estimators': trial.suggest_int('model__n_estimators', 100, 1000),
'model__learning_rate': trial.suggest_float('model__learning_rate', 0.01, 0.1),
'model__max_depth': trial.suggest_int('model__max_depth', 3, 10),
'model__min_child_weight': trial.suggest_int('model__min_child_weight', 1, 10),
'model__gamma': trial.suggest_float('model__gamma', 0.01, 0.1),
'model__subsample': trial.suggest_float('model__subsample', 0.01, 1.0),
'model__colsample_bytree': trial.suggest_float('model__colsample_bytree', 0.01, 1.0),
'model__reg_alpha': trial.suggest_float('model__reg_alpha', 1e-5, 10.0),
'model__reg_lambda': trial.suggest_float('model__reg_lambda', 1e-5, 10.0),
'model__scale_pos_weight': trial.suggest_float('model__scale_pos_weight', 1e-5, 10.0),
'model__n_jobs': 4,
'model__eval_metric': 'mlogloss',
'model__tree_method': 'gpu_hist',
}
pipeline.set_params(**params)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
return accuracy_score(y_test, predictions)
Next, we’ll use Optuna to run the optimization and create a study that sets out to maximise the accuracy of the model. We’ll then use the optimize()
function to start the optimization process and we’ll run 100 trials, showing a progress bar as the tasks run. To avoid clogging up my notebook with data, I’ve also optionally disable verbose output from Optuna.
optuna.logging.set_verbosity(optuna.logging.WARNING)
study = optuna.create_study(study_name='churn model',
direction='maximize')
study.optimize(objective, n_trials=100, show_progress_bar=True)
Depending on the speed of your computer that should take a few minutes to run. Once it’s finished, we can print the best parameters that Optuna found and the maximum score achieved. It looks like Optuna was able to find a model that scores almost 98% accuracy, which is a nice little improvement on the base model.
print('Best parameters', study.best_params)
Best parameters {'model__n_estimators': 389, 'model__learning_rate': 0.09650401509403127, 'model__max_depth': 8, 'model__min_child_weight': 1, 'model__gamma': 0.09735495146805667, 'model__subsample': 0.8390259995692267, 'model__colsample_bytree': 0.9689657157655308, 'model__reg_alpha': 9.068919816172016, 'model__reg_lambda': 5.966582881537109, 'model__scale_pos_weight': 5.53771951672144}
print('Best score', study.best_value)
Best score 0.9780392156862745
print('Best model', study.best_trial)
Best model FrozenTrial(number=75, values=[0.9780392156862745], datetime_start=datetime.datetime(2022, 10, 16, 9, 27, 26, 328358), datetime_complete=datetime.datetime(2022, 10, 16, 9, 27, 27, 248163), params={'model__n_estimators': 389, 'model__learning_rate': 0.09650401509403127, 'model__max_depth': 8, 'model__min_child_weight': 1, 'model__gamma': 0.09735495146805667, 'model__subsample': 0.8390259995692267, 'model__colsample_bytree': 0.9689657157655308, 'model__reg_alpha': 9.068919816172016, 'model__reg_lambda': 5.966582881537109, 'model__scale_pos_weight': 5.53771951672144}, distributions={'model__n_estimators': IntDistribution(high=1000, log=False, low=100, step=1), 'model__learning_rate': FloatDistribution(high=0.1, log=False, low=0.01, step=None), 'model__max_depth': IntDistribution(high=10, log=False, low=3, step=1), 'model__min_child_weight': IntDistribution(high=10, log=False, low=1, step=1), 'model__gamma': FloatDistribution(high=0.1, log=False, low=0.01, step=None), 'model__subsample': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'model__colsample_bytree': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'model__reg_alpha': FloatDistribution(high=10.0, log=False, low=1e-05, step=None), 'model__reg_lambda': FloatDistribution(high=10.0, log=False, low=1e-05, step=None), 'model__scale_pos_weight': FloatDistribution(high=10.0, log=False, low=1e-05, step=None)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=75, state=TrialState.COMPLETE, value=None)
Finally, we can re-fit the model using the best parameters found by Optuna and then use it to make predictions on the test set. We can then print the accuracy score to see how well the model performs on unseen data.
pipeline.set_params(**study.best_params)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
The tuned model achieves an accuracy score of 97.8%, which is a nice improvement on the base model, as well as a ROC/AUC score of 92.6%, so it shows why it makes sense to spend an extra five minutes tuning hyperparameters.
print('Accuracy: ', accuracy_score(y_test, predictions))
print('AUC: ', roc_auc_score(y_test, predictions))
Accuracy: 0.9780392156862745
AUC: 0.9263845032153835
print(classification_report(y_test, predictions))
precision recall f1-score support
0 0.98 1.00 0.99 1102
1 0.98 0.86 0.91 173
accuracy 0.98 1275
macro avg 0.98 0.93 0.95 1275
weighted avg 0.98 0.98 0.98 1275
Matt Clarke, Sunday, October 16, 2022