Although scikit-learn’s machine learning estimator models can be used out-of-the-box with no tuning, you can usually generate further improvements with a little of tweaking. Each estimator class accepts arguments called hyper-parameters that allow you to make modifications to the way the model runs. Find the right combination of hyper-parameters, and you can get some nice improvements. These gains are rarely massive (for that you’ll usually need better features, ensemble models and stacking) but every little helps.
However, although there is a science to configuration of these model hyper-parameters, a brute force technique called grid search is generally applied to help you find the exact combination that brings you the best result. While you could get close to this by manually adjusting the hyper-parameters by hand, re-running the model and checking your ROC/AUC score, it would be extremely laborious and time consuming.
Let’s take a look at applying the grid search technique to a real dataset and see what improvements we can generate from the baseline result with an unoptimised default model. For this, we’re going to ue the Wisconsin Breast Cancer dataset, which is supplied with scikit-learn. First, we’ll load up the Python packages we need, and then load the data into a Pandas dataframe so we can see what work is required prior to modeling.
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
data = load_breast_cancer()
df = pd.DataFrame(np.c_[data['data'], data['target']],
columns= np.append(data['feature_names'], ['target']))
df.head()
mean radius | mean texture | mean perimeter | ... | worst symmetry | worst fractal dimension | target | |
---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | ... | 0.4601 | 0.11890 | 0.0 |
1 | 20.57 | 17.77 | 132.90 | ... | 0.2750 | 0.08902 | 0.0 |
2 | 19.69 | 21.25 | 130.00 | ... | 0.3613 | 0.08758 | 0.0 |
3 | 11.42 | 20.38 | 77.58 | ... | 0.6638 | 0.17300 | 0.0 |
4 | 20.29 | 14.34 | 135.10 | ... | 0.2364 | 0.07678 | 0.0 |
5 rows × 31 columns
To understand how imbalanced the dataset is, we next use the Pandas value_counts()
function to return the number
of each value found in the target column. The data aren’t too imbalanced, so that makes things a little easier.
df['target'].value_counts()
1.0 357
0.0 212
Name: target, dtype: int64
As the data are fine for basic use as they are, we can now reload the data into an X
and y
dataset by passing the return_X_y=True
parameter to the load_breast_cancer()
function. Once we have the X
and y
, we then use the train_test_split()
function to divide the data up into X_train
and y_train
and X_test
and y_test
.
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.30,
random_state=1)
Next, we’ll create a basic model. This can be any of scikit-learn’s models, but I’ve gone for the XGBoost
XGBClassifier as it can run on a GPU so hyper-parameter tuning flies by much faster than it normally would. As
you’ll see below, I’ve not passed any hyperparameters to XGBClassifier()
at all, leaving it to run in its default state. If you print model
you can see all of the hyper-parameter settings the given model uses.
model = XGBClassifier()
model
XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
colsample_bynode=None, colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain', interaction_constraints=None,
learning_rate=None, max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None, tree_method=None,
validate_parameters=None, verbosity=None)
To determine the effectiveness of XGBoost on our dataset, we can use K fold cross validation. I’ve set this to use 10 folds (or splits) and to repeat the test three times. The random_state
parameter simply makes the results reproducible if you re-run them later, which can save lots of confusion. Then, we use the cross_val_score()
function to return the ROC/AUC score for each run. A score of 1.0 is a perfect prediction, so all of the scores we generate are pretty decent.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)
for score in scores:
print(score)
0.9893333333333334
1.0
1.0
0.9946666666666667
0.9893333333333334
0.96
0.9946666666666667
1.0
0.9857142857142858
1.0
0.992
0.9946666666666667
0.992
0.9973333333333334
1.0
1.0
1.0
0.9840000000000001
0.9971428571428571
0.9472222222222223
0.9893333333333334
0.9946666666666667
0.9946666666666667
0.984
0.9466666666666668
0.9893333333333334
1.0
1.0
1.0
1.0
The mean ROC/AUC score across all the folds was 0.99055, which is not bad for an unoptimised model. Next, we’ll tune the model and see what improvements we can generate.
print('Mean ROC/AUC = ', scores.mean())
Mean ROC/AUC = 0.9905582010582011
Although there are two main approaches to tuning your model’s hyper-parameters with grid search in scikit-learn: GridSearchCV and RandomisedSearchCV, GridSearchCV is the most commonly seen. When you provide a grid of parameter values GridSearchCV exhaustively tests each combination of parameters until it finds the one which yields the best score.
Obviously, this means that the more parameters you provide, the longer it takes to complete. With XGBoost, one useful trick is to configure the model to run on your GPU instead of your CPU as this reduces the run time dramatically. On large datasets with lots of parameters, it could easily take hours or days for a GridSearchCV task to run.
Every scikit-learn model has a load of different hyper-parameters you can tweak to improve your model’s performance. While there are some standard parameters across models, many are specific to the model you’re using. To find which hyper-parameters your model supports, and how they are currently configured, we can print model.get_params()
to see a full list.
model.get_params()
{'objective': 'binary:logistic',
'base_score': None,
'booster': None,
'colsample_bylevel': None,
'colsample_bynode': None,
'colsample_bytree': None,
'gamma': None,
'gpu_id': None,
'importance_type': 'gain',
'interaction_constraints': None,
'learning_rate': None,
'max_delta_step': None,
'max_depth': None,
'min_child_weight': None,
'missing': nan,
'monotone_constraints': None,
'n_estimators': 100,
'n_jobs': None,
'num_parallel_tree': None,
'random_state': None,
'reg_alpha': None,
'reg_lambda': None,
'scale_pos_weight': None,
'subsample': None,
'tree_method': None,
'validate_parameters': None,
'verbosity': None}
While you could blindly pass a load of different values and brute force the right combination, it is probably going to save you time if you check the documentation for the model you’re using to find out what hyper-parameters they are and what they do. Then you can simply configure the ones that matter to your dataset.
We’re going to configure a few of these - at random, just to show how this works - with a range of values to see what works best. Simply create a variable for each hyper-parameter and assign it a Python list of values, then create a param_grid
Python dictionary containing each labelled hyper-parameter.
There are two techniques you can use here: you can do each hyper-parameter on its own, and then save the value, or you can do the whole lot at once. If you have a massive dataset then doing it in steps may be your only option, but I find doing all of the parameters together seems to give better results - and it’s easy if you leave it running overnight.
The other thing you can do to get incremental improvements is find the value that represents an improvement then bracket either side. For example, if you find that a 2 gives a good score on a given hyper-parameter, try again with [1, 2, 3]
and see if you get an improvement.
colsample_bytree = [0.3, 0.5, 1.0]
gamma = [0.1, 1, 1.5]
learning_rate = [0.001, 0.01]
min_child_weight = [1, 5, 10]
scale_pos_weight = [1, 2, 4]
subsample = [0.8, 0.9, 1.0]
n_estimators = [50, 100, 150]
max_depth = [5, 10]
param_grid = dict(
colsample_bytree=colsample_bytree,
gamma=gamma,
learning_rate=learning_rate,
min_child_weight=min_child_weight,
scale_pos_weight=scale_pos_weight,
subsample=subsample,
n_estimators=n_estimators,
max_depth=max_depth,
)
Once you have your param_grid
the next step is to run GridSearchCV()
on your model, pass in the parameters to test and define how you’ll determine what is “best”. We’re going to use ROC/AUC again. If you run this, GridSearchCV will now test all of the parameters in your param_grid
and return the details on the combination with yields the highest ROC/AUC score.
model = XGBClassifier(random_state=1, verbosity=1)
grid_search = GridSearchCV(estimator=model,
param_grid=param_grid,
scoring='roc_auc',
)
best_model = grid_search.fit(X_train, y_train)
print('Optimum parameters', best_model.best_params_)
Optimum parameters {'colsample_bytree': 0.3, 'gamma': 0.1, 'learning_rate': 0.01, 'max_depth': 5, 'min_child_weight': 1, 'n_estimators': 150, 'scale_pos_weight': 1, 'subsample': 0.8}
Now we have some tuned hyper-parameters, we can pass them to a model and re-train it, and then compare the K fold cross validation score with the one we generated with the default parameters. Our very quick and dirty tune up has given us a bit of an extra boost, with the ROC/AUC score increasing from 0.9905 to 0.9928. This might not look like much, but we already had a good score. To get further improvement from this method you can bracket around the values again, try new parameters and keep tweaking until you get a further improvement.
tuned_model = XGBClassifier(random_state=1,
colsample_bytree=0.3,
gamma=1,
learning_rate=0.01,
max_depth=5,
min_child_weight=1,
n_estimators=100,
scale_pos_weight=2,
subsample=0.9)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(tuned_model, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC/AUC = ', scores.mean())
Mean ROC/AUC = 0.9928222222222224
The other way of performing a grid search is to use RandomizedSearchCV
instead of GridSearchCV
. The main issue with GridSearchCV
is that it tries every combination, so it can be really, really slow when you provide lots of parameters to test and have a large dataset. RandomizedSearchCV
works in a different way. Instead of trying every option, it just tries a sample from a param_distributions
dictionary instead of a param_grid
. This is way faster and runs in a split second on this dataset.
colsample_bytree = [0.1, 0.3, 0.5, 1.0]
gamma = [0, 0.1, 1]
learning_rate = [0.001, 0.05, 0.08, 0.1]
min_child_weight = [1, 5, 10, 20]
scale_pos_weight = [0.5, 1, 2, 4, 6]
subsample = [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
n_estimators = [25, 50, 100, 150]
max_depth = [3, 5, 10, 20, 40, 100]
param_distributions = dict(
colsample_bytree=colsample_bytree,
gamma=gamma,
learning_rate=learning_rate,
min_child_weight=min_child_weight,
scale_pos_weight=scale_pos_weight,
subsample=subsample,
n_estimators=n_estimators,
max_depth=max_depth,
)
model = XGBClassifier(random_state=1, verbosity=1)
grid_search = RandomizedSearchCV(estimator=model,
param_distributions=param_distributions,
scoring='roc_auc',
)
best_model = grid_search.fit(X_train, y_train)
print('Optimum parameters', best_model.best_params_)
Optimum parameters {'subsample': 0.6, 'scale_pos_weight': 4, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 20, 'learning_rate': 0.05, 'gamma': 1, 'colsample_bytree': 1.0}
Finally, we will take the values from RandomizedSearchCV
and re-run our cross-fold validation to see what improvements the new parameters bring. This gives us an extra boost and increases the score to 0.9931.
tuned_model = XGBClassifier(random_state=1,
colsample_bytree=1,
learning_rate=0.05,
max_depth=20,
min_child_weight=1,
n_estimators=100,
subsample=0.6)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(tuned_model, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC/AUC = ', scores.mean())
Mean ROC/AUC = 0.9931507936507937
If you use a specific model frequently, you might want to create a param_distributions
dictionary containing a wide range of all possible hyper-parameter values and then wrap up the code above in a function. This will allow you to quickly tune a model and find out which hyper-parameters need further tweaking. You can then doing your fine-tuning with GridSearchCV.
One common parameter you may need to tune is the scale_pos_weight
in XGBoost. This is of particular use on imbalanced datasets and can be calculated from the ratio of the negative class over the positive class. Oddly, it’s not always the exact value which gives the best result, so “bracketing” and using a value either side can be of use. Here’s a little function I knocked-up to do this.
def get_scale_pos_weight(target, square_root=False, gridsearch=False):
"""Return the scale_pos_weight parameter for the XGBoost model when data are imbalanced.
The scale_pos_weight parameter is calculated from the ratio of the negative class over
the positive class. The exact scale_pos_weight sometimes does not give the best result,
so by passing the gridsearch=True parameter you can return a list of values to test with
GridSearchCV. In addition, passing square_root=True changes the scale_pos_weight to the
square root value, which can sometimes be beneficial on extremely imbalanced data.
:param target: Pandas dataframe column containing the binary target
:param square_root: Optional boolean parameter to convert to square root on extremely unbalanced data
:param gridsearch: Optional boolean parameter to return a bracketed list for use in GridSearchCV
Usage:
scale_pos_weight = get_scale_pos_weight(df['target'], square_root=False, gridsearch=True)
"""
import math
scale_pos_weight = round((len(target) - sum(target)) / sum(target))
if square_root:
scale_pos_weight = round(math.sqrt(scale_pos_weight))
if gridsearch:
scale_pos_weight = [scale_pos_weight-2, scale_pos_weight-1, scale_pos_weight,
scale_pos_weight+1, scale_pos_weight+2]
return scale_pos_weight
If you’re looking for a faster solution to hyperparameter tuning, do check out Optuna. It’s a hyperparameter optimization framework that can be used to tune XGBoost and other models. It’s also very easy to use and can be used to tune XGBoost in just a few lines of code.
Matt Clarke, Tuesday, March 02, 2021