How to tune model hyper-parameters with grid search

Picture by Pixabay, Pexels.

20 minutes to read

Although scikit-learn’s machine learning estimator models can be used out-of-the-box with no tuning, you can usually generate further improvements with a little of tweaking. Each estimator class accepts arguments called hyper-parameters that allow you to make modifications to the way the model runs. Find the right combination of hyper-parameters, and you can get some nice improvements. These gains are rarely massive (for that you’ll usually need better features, ensemble models and stacking) but every little helps.

However, although there is a science to configuration of these model hyper-parameters, a brute force technique called grid search is generally applied to help you find the exact combination that brings you the best result. While you could get close to this by manually adjusting the hyper-parameters by hand, re-running the model and checking your ROC/AUC score, it would be extremely laborious and time consuming.

A practical example

Let’s take a look at applying the grid search technique to a real dataset and see what improvements we can generate from the baseline result with an unoptimised default model. For this, we’re going to ue the Wisconsin Breast Cancer dataset, which is supplied with scikit-learn. First, we’ll load up the Python packages we need, and then load the data into a Pandas dataframe so we can see what work is required prior to modeling.

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

data = load_breast_cancer()
df = pd.DataFrame(np.c_[data['data'], data['target']],
                  columns= np.append(data['feature_names'], ['target']))
df.head()

	mean radius	mean texture	mean perimeter	...	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	...	0.4601	0.11890
1	20.57	17.77	132.90	...	0.2750	0.08902
2	19.69	21.25	130.00	...	0.3613	0.08758
3	11.42	20.38	77.58	...	0.6638	0.17300
4	20.29	14.34	135.10	...	0.2364	0.07678

5 rows × 31 columns

To understand how imbalanced the dataset is, we next use the Pandas value_counts() function to return the number of each value found in the target column. The data aren’t too imbalanced, so that makes things a little easier.

df['target'].value_counts()

1.0    357
0.0    212
Name: target, dtype: int64

Prepare the data for modeling

As the data are fine for basic use as they are, we can now reload the data into an X and y dataset by passing the return_X_y=True parameter to the load_breast_cancer() function. Once we have the X and y, we then use the train_test_split() function to divide the data up into X_train and y_train and X_test and y_test.

X, y = load_breast_cancer(return_X_y=True, as_frame=True)

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=1)

Create a basic model

Next, we’ll create a basic model. This can be any of scikit-learn’s models, but I’ve gone for the XGBoost XGBClassifier as it can run on a GPU so hyper-parameter tuning flies by much faster than it normally would. As you’ll see below, I’ve not passed any hyperparameters to XGBClassifier() at all, leaving it to run in its default state. If you print model you can see all of the hyper-parameter settings the given model uses.

model = XGBClassifier()
model

XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, gamma=None,
              gpu_id=None, importance_type='gain', interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              random_state=None, reg_alpha=None, reg_lambda=None,
              scale_pos_weight=None, subsample=None, tree_method=None,
              validate_parameters=None, verbosity=None)

To determine the effectiveness of XGBoost on our dataset, we can use K fold cross validation. I’ve set this to use 10 folds (or splits) and to repeat the test three times. The random_state parameter simply makes the results reproducible if you re-run them later, which can save lots of confusion. Then, we use the cross_val_score() function to return the ROC/AUC score for each run. A score of 1.0 is a perfect prediction, so all of the scores we generate are pretty decent.

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)

for score in scores:
    print(score)

9893333333333334
0
0
9946666666666667
9893333333333334
96
9946666666666667
0
9857142857142858
0
992
9946666666666667
992
9973333333333334
0
0
0
9840000000000001
9971428571428571
9472222222222223
9893333333333334
9946666666666667
9946666666666667
984
9466666666666668
9893333333333334
0
0
0
0

The mean ROC/AUC score across all the folds was 0.99055, which is not bad for an unoptimised model. Next, we’ll tune the model and see what improvements we can generate.

print('Mean ROC/AUC = ', scores.mean())

Mean ROC/AUC =  0.9905582010582011

Tuning your model hyper-parameters

Although there are two main approaches to tuning your model’s hyper-parameters with grid search in scikit-learn: GridSearchCV and RandomisedSearchCV, GridSearchCV is the most commonly seen. When you provide a grid of parameter values GridSearchCV exhaustively tests each combination of parameters until it finds the one which yields the best score.

Obviously, this means that the more parameters you provide, the longer it takes to complete. With XGBoost, one useful trick is to configure the model to run on your GPU instead of your CPU as this reduces the run time dramatically. On large datasets with lots of parameters, it could easily take hours or days for a GridSearchCV task to run.

Identifying hyper-parameters

Every scikit-learn model has a load of different hyper-parameters you can tweak to improve your model’s performance. While there are some standard parameters across models, many are specific to the model you’re using. To find which hyper-parameters your model supports, and how they are currently configured, we can print model.get_params() to see a full list.

model.get_params()

{'objective': 'binary:logistic',
 'base_score': None,
 'booster': None,
 'colsample_bylevel': None,
 'colsample_bynode': None,
 'colsample_bytree': None,
 'gamma': None,
 'gpu_id': None,
 'importance_type': 'gain',
 'interaction_constraints': None,
 'learning_rate': None,
 'max_delta_step': None,
 'max_depth': None,
 'min_child_weight': None,
 'missing': nan,
 'monotone_constraints': None,
 'n_estimators': 100,
 'n_jobs': None,
 'num_parallel_tree': None,
 'random_state': None,
 'reg_alpha': None,
 'reg_lambda': None,
 'scale_pos_weight': None,
 'subsample': None,
 'tree_method': None,
 'validate_parameters': None,
 'verbosity': None}

Creating a param grid

While you could blindly pass a load of different values and brute force the right combination, it is probably going to save you time if you check the documentation for the model you’re using to find out what hyper-parameters they are and what they do. Then you can simply configure the ones that matter to your dataset.

We’re going to configure a few of these - at random, just to show how this works - with a range of values to see what works best. Simply create a variable for each hyper-parameter and assign it a Python list of values, then create a param_grid Python dictionary containing each labelled hyper-parameter.

There are two techniques you can use here: you can do each hyper-parameter on its own, and then save the value, or you can do the whole lot at once. If you have a massive dataset then doing it in steps may be your only option, but I find doing all of the parameters together seems to give better results - and it’s easy if you leave it running overnight.

The other thing you can do to get incremental improvements is find the value that represents an improvement then bracket either side. For example, if you find that a 2 gives a good score on a given hyper-parameter, try again with [1, 2, 3] and see if you get an improvement.

colsample_bytree = [0.3, 0.5, 1.0]
gamma = [0.1, 1, 1.5]
learning_rate = [0.001, 0.01]
min_child_weight = [1, 5, 10]
scale_pos_weight = [1, 2, 4]
subsample = [0.8, 0.9, 1.0]
n_estimators = [50, 100, 150]
max_depth = [5, 10]

param_grid = dict(
    colsample_bytree=colsample_bytree,
    gamma=gamma,
    learning_rate=learning_rate,
    min_child_weight=min_child_weight,
    scale_pos_weight=scale_pos_weight,
    subsample=subsample,
    n_estimators=n_estimators,
    max_depth=max_depth,
)

Once you have your param_grid the next step is to run GridSearchCV() on your model, pass in the parameters to test and define how you’ll determine what is “best”. We’re going to use ROC/AUC again. If you run this, GridSearchCV will now test all of the parameters in your param_grid and return the details on the combination with yields the highest ROC/AUC score.

model = XGBClassifier(random_state=1, verbosity=1)

grid_search = GridSearchCV(estimator=model,
                           param_grid=param_grid,
                           scoring='roc_auc',
                           )
best_model = grid_search.fit(X_train, y_train)
print('Optimum parameters', best_model.best_params_)

Optimum parameters {'colsample_bytree': 0.3, 'gamma': 0.1, 'learning_rate': 0.01, 'max_depth': 5, 'min_child_weight': 1, 'n_estimators': 150, 'scale_pos_weight': 1, 'subsample': 0.8}

Test the tuned model

Now we have some tuned hyper-parameters, we can pass them to a model and re-train it, and then compare the K fold cross validation score with the one we generated with the default parameters. Our very quick and dirty tune up has given us a bit of an extra boost, with the ROC/AUC score increasing from 0.9905 to 0.9928. This might not look like much, but we already had a good score. To get further improvement from this method you can bracket around the values again, try new parameters and keep tweaking until you get a further improvement.

tuned_model = XGBClassifier(random_state=1, 
                            colsample_bytree=0.3, 
                            gamma=1, 
                            learning_rate=0.01, 
                            max_depth=5, 
                            min_child_weight=1,
                            n_estimators=100,
                            scale_pos_weight=2, 
                            subsample=0.9)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(tuned_model, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC/AUC = ', scores.mean())

Mean ROC/AUC =  0.9928222222222224

Randomized search

The other way of performing a grid search is to use RandomizedSearchCV instead of GridSearchCV. The main issue with GridSearchCV is that it tries every combination, so it can be really, really slow when you provide lots of parameters to test and have a large dataset. RandomizedSearchCV works in a different way. Instead of trying every option, it just tries a sample from a param_distributions dictionary instead of a param_grid. This is way faster and runs in a split second on this dataset.

colsample_bytree = [0.1, 0.3, 0.5, 1.0]
gamma = [0, 0.1, 1]
learning_rate = [0.001, 0.05, 0.08, 0.1]
min_child_weight = [1, 5, 10, 20]
scale_pos_weight = [0.5, 1, 2, 4, 6]
subsample = [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
n_estimators = [25, 50, 100, 150]
max_depth = [3, 5, 10, 20, 40, 100]

param_distributions = dict(
    colsample_bytree=colsample_bytree,
    gamma=gamma,
    learning_rate=learning_rate,
    min_child_weight=min_child_weight,
    scale_pos_weight=scale_pos_weight,
    subsample=subsample,
    n_estimators=n_estimators,
    max_depth=max_depth,
)

model = XGBClassifier(random_state=1, verbosity=1)

grid_search = RandomizedSearchCV(estimator=model,
                                 param_distributions=param_distributions,
                                 scoring='roc_auc',
                                )
best_model = grid_search.fit(X_train, y_train)
print('Optimum parameters', best_model.best_params_)

Optimum parameters {'subsample': 0.6, 'scale_pos_weight': 4, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 20, 'learning_rate': 0.05, 'gamma': 1, 'colsample_bytree': 1.0}

Finally, we will take the values from RandomizedSearchCV and re-run our cross-fold validation to see what improvements the new parameters bring. This gives us an extra boost and increases the score to 0.9931.

tuned_model = XGBClassifier(random_state=1, 
                            colsample_bytree=1, 
                            learning_rate=0.05, 
                            max_depth=20, 
                            min_child_weight=1,
                            n_estimators=100,
                            subsample=0.6)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(tuned_model, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC/AUC = ', scores.mean())

Mean ROC/AUC =  0.9931507936507937

If you use a specific model frequently, you might want to create a param_distributions dictionary containing a wide range of all possible hyper-parameter values and then wrap up the code above in a function. This will allow you to quickly tune a model and find out which hyper-parameters need further tweaking. You can then doing your fine-tuning with GridSearchCV.

Tuning the XGBoost scale_pos_weight parameter

One common parameter you may need to tune is the scale_pos_weight in XGBoost. This is of particular use on imbalanced datasets and can be calculated from the ratio of the negative class over the positive class. Oddly, it’s not always the exact value which gives the best result, so “bracketing” and using a value either side can be of use. Here’s a little function I knocked-up to do this.

def get_scale_pos_weight(target, square_root=False, gridsearch=False):
    """Return the scale_pos_weight parameter for the XGBoost model when data are imbalanced.
    The scale_pos_weight parameter is calculated from the ratio of the negative class over
    the positive class. The exact scale_pos_weight sometimes does not give the best result,
    so by passing the gridsearch=True parameter you can return a list of values to test with
    GridSearchCV. In addition, passing square_root=True changes the scale_pos_weight to the
    square root value, which can sometimes be beneficial on extremely imbalanced data.
    
    :param target: Pandas dataframe column containing the binary target
    :param square_root: Optional boolean parameter to convert to square root on extremely unbalanced data
    :param gridsearch: Optional boolean parameter to return a bracketed list for use in GridSearchCV
    
    Usage:
        scale_pos_weight = get_scale_pos_weight(df['target'], square_root=False, gridsearch=True)
        
    """
    
    import math
    
    scale_pos_weight = round((len(target) - sum(target)) / sum(target))
    
    if square_root:
        scale_pos_weight = round(math.sqrt(scale_pos_weight))
    
    if gridsearch:
        scale_pos_weight = [scale_pos_weight-2, scale_pos_weight-1, scale_pos_weight, 
                            scale_pos_weight+1, scale_pos_weight+2]
    
    return scale_pos_weight

If you’re looking for a faster solution to hyperparameter tuning, do check out Optuna. It’s a hyperparameter optimization framework that can be used to tune XGBoost and other models. It’s also very easy to use and can be used to tune XGBoost in just a few lines of code.

Matt Clarke, Tuesday, March 02, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.