Although scikit-learn’s machine learning estimator models can be used out-of-the-box with no tuning, you can usually generate further improvements with a little of tweaking. Each estimator class accepts arguments called hyper-parameters that allow you to make modifications to the way the model runs. Find the right combination of hyper-parameters, and you can get some nice improvements. These gains are rarely massive (for that you’ll usually need better features, ensemble models and stacking) but every little helps.

However, although there is a science to configuration of these model hyper-parameters, a brute force technique called grid search is generally applied to help you find the exact combination that brings you the best result. While you could get close to this by manually adjusting the hyper-parameters by hand, re-running the model and checking your ROC/AUC score, it would be extremely laborious and time consuming.

Let’s take a look at applying the grid search technique to a real dataset and see what improvements we can generate from the baseline result with an unoptimised default model. For this, we’re going to ue the Wisconsin Breast Cancer dataset, which is supplied with scikit-learn. First, we’ll load up the Python packages we need, and then load the data into a Pandas dataframe so we can see what work is required prior to modeling.

```
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
data = load_breast_cancer()
df = pd.DataFrame(np.c_[data['data'], data['target']],
columns= np.append(data['feature_names'], ['target']))
df.head()
```

mean radius | mean texture | mean perimeter | ... | worst symmetry | worst fractal dimension | target | |
---|---|---|---|---|---|---|---|

0 | 17.99 | 10.38 | 122.80 | ... | 0.4601 | 0.11890 | 0.0 |

1 | 20.57 | 17.77 | 132.90 | ... | 0.2750 | 0.08902 | 0.0 |

2 | 19.69 | 21.25 | 130.00 | ... | 0.3613 | 0.08758 | 0.0 |

3 | 11.42 | 20.38 | 77.58 | ... | 0.6638 | 0.17300 | 0.0 |

4 | 20.29 | 14.34 | 135.10 | ... | 0.2364 | 0.07678 | 0.0 |

5 rows × 31 columns

To understand how imbalanced the dataset is, we next use the Pandas `value_counts()`

function to return the number
of each value found in the target column. The data aren’t too imbalanced, so that makes things a little easier.

```
df['target'].value_counts()
```

```
1.0 357
0.0 212
Name: target, dtype: int64
```

As the data are fine for basic use as they are, we can now reload the data into an `X`

and `y`

dataset by passing the `return_X_y=True`

parameter to the `load_breast_cancer()`

function. Once we have the `X`

and `y`

, we then use the `train_test_split()`

function to divide the data up into `X_train`

and `y_train`

and `X_test`

and `y_test`

.

```
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.30,
random_state=1)
```

Next, we’ll create a basic model. This can be any of scikit-learn’s models, but I’ve gone for the XGBoost
XGBClassifier as it can run on a GPU so hyper-parameter tuning flies by much faster than it normally would. As
you’ll see below, I’ve not passed any hyperparameters to `XGBClassifier()`

at all, leaving it to run in its default state. If you print `model`

you can see all of the hyper-parameter settings the given model uses.

```
model = XGBClassifier()
model
```

```
XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
colsample_bynode=None, colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain', interaction_constraints=None,
learning_rate=None, max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None, tree_method=None,
validate_parameters=None, verbosity=None)
```

To determine the effectiveness of XGBoost on our dataset, we can use K fold cross validation. I’ve set this to use 10 folds (or splits) and to repeat the test three times. The `random_state`

parameter simply makes the results reproducible if you re-run them later, which can save lots of confusion. Then, we use the `cross_val_score()`

function to return the ROC/AUC score for each run. A score of 1.0 is a perfect prediction, so all of the scores we generate are pretty decent.

```
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)
for score in scores:
print(score)
```

```
0.9893333333333334
1.0
1.0
0.9946666666666667
0.9893333333333334
0.96
0.9946666666666667
1.0
0.9857142857142858
1.0
0.992
0.9946666666666667
0.992
0.9973333333333334
1.0
1.0
1.0
0.9840000000000001
0.9971428571428571
0.9472222222222223
0.9893333333333334
0.9946666666666667
0.9946666666666667
0.984
0.9466666666666668
0.9893333333333334
1.0
1.0
1.0
1.0
```

The mean ROC/AUC score across all the folds was 0.99055, which is not bad for an unoptimised model. Next, we’ll tune the model and see what improvements we can generate.

```
print('Mean ROC/AUC = ', scores.mean())
```

```
Mean ROC/AUC = 0.9905582010582011
```

Although there are two main approaches to tuning your model’s hyper-parameters with grid search in scikit-learn: GridSearchCV and RandomisedSearchCV, GridSearchCV is the most commonly seen. When you provide a grid of parameter values GridSearchCV exhaustively tests each combination of parameters until it finds the one which yields the best score.

Obviously, this means that the more parameters you provide, the longer it takes to complete. With XGBoost, one useful trick is to configure the model to run on your GPU instead of your CPU as this reduces the run time dramatically. On large datasets with lots of parameters, it could easily take hours or days for a GridSearchCV task to run.

Every scikit-learn model has a load of different hyper-parameters you can tweak to improve your model’s performance. While there are some standard parameters across models, many are specific to the model you’re using. To find which hyper-parameters your model supports, and how they are currently configured, we can print `model.get_params()`

to see a full list.

```
model.get_params()
```

```
{'objective': 'binary:logistic',
'base_score': None,
'booster': None,
'colsample_bylevel': None,
'colsample_bynode': None,
'colsample_bytree': None,
'gamma': None,
'gpu_id': None,
'importance_type': 'gain',
'interaction_constraints': None,
'learning_rate': None,
'max_delta_step': None,
'max_depth': None,
'min_child_weight': None,
'missing': nan,
'monotone_constraints': None,
'n_estimators': 100,
'n_jobs': None,
'num_parallel_tree': None,
'random_state': None,
'reg_alpha': None,
'reg_lambda': None,
'scale_pos_weight': None,
'subsample': None,
'tree_method': None,
'validate_parameters': None,
'verbosity': None}
```

While you could blindly pass a load of different values and brute force the right combination, it is probably going to save you time if you check the documentation for the model you’re using to find out what hyper-parameters they are and what they do. Then you can simply configure the ones that matter to your dataset.

We’re going to configure a few of these - at random, just to show how this works - with a range of values to see what works best. Simply create a variable for each hyper-parameter and assign it a Python list of values, then create a `param_grid`

Python dictionary containing each labelled hyper-parameter.

There are two techniques you can use here: you can do each hyper-parameter on its own, and then save the value, or you can do the whole lot at once. If you have a massive dataset then doing it in steps may be your only option, but I find doing all of the parameters together seems to give better results - and it’s easy if you leave it running overnight.

The other thing you can do to get incremental improvements is find the value that represents an improvement then bracket either side. For example, if you find that a 2 gives a good score on a given hyper-parameter, try again with `[1, 2, 3]`

and see if you get an improvement.

```
colsample_bytree = [0.3, 0.5, 1.0]
gamma = [0.1, 1, 1.5]
learning_rate = [0.001, 0.01]
min_child_weight = [1, 5, 10]
scale_pos_weight = [1, 2, 4]
subsample = [0.8, 0.9, 1.0]
n_estimators = [50, 100, 150]
max_depth = [5, 10]
param_grid = dict(
colsample_bytree=colsample_bytree,
gamma=gamma,
learning_rate=learning_rate,
min_child_weight=min_child_weight,
scale_pos_weight=scale_pos_weight,
subsample=subsample,
n_estimators=n_estimators,
max_depth=max_depth,
)
```

Once you have your `param_grid`

the next step is to run `GridSearchCV()`

on your model, pass in the parameters to test and define how you’ll determine what is “best”. We’re going to use ROC/AUC again. If you run this, GridSearchCV will now test all of the parameters in your `param_grid`

and return the details on the combination with yields the highest ROC/AUC score.

```
model = XGBClassifier(random_state=1, verbosity=1)
grid_search = GridSearchCV(estimator=model,
param_grid=param_grid,
scoring='roc_auc',
)
best_model = grid_search.fit(X_train, y_train)
print('Optimum parameters', best_model.best_params_)
```

```
Optimum parameters {'colsample_bytree': 0.3, 'gamma': 0.1, 'learning_rate': 0.01, 'max_depth': 5, 'min_child_weight': 1, 'n_estimators': 150, 'scale_pos_weight': 1, 'subsample': 0.8}
```

Now we have some tuned hyper-parameters, we can pass them to a model and re-train it, and then compare the K fold cross validation score with the one we generated with the default parameters. Our very quick and dirty tune up has given us a bit of an extra boost, with the ROC/AUC score increasing from 0.9905 to 0.9928. This might not look like much, but we already had a good score. To get further improvement from this method you can bracket around the values again, try new parameters and keep tweaking until you get a further improvement.

```
tuned_model = XGBClassifier(random_state=1,
colsample_bytree=0.3,
gamma=1,
learning_rate=0.01,
max_depth=5,
min_child_weight=1,
n_estimators=100,
scale_pos_weight=2,
subsample=0.9)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(tuned_model, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC/AUC = ', scores.mean())
```

```
Mean ROC/AUC = 0.9928222222222224
```

The other way of performing a grid search is to use `RandomizedSearchCV`

instead of `GridSearchCV`

. The main issue with `GridSearchCV`

is that it tries every combination, so it can be really, really slow when you provide lots of parameters to test and have a large dataset. `RandomizedSearchCV`

works in a different way. Instead of trying every option, it just tries a sample from a `param_distributions`

dictionary instead of a `param_grid`

. This is way faster and runs in a split second on this dataset.

```
colsample_bytree = [0.1, 0.3, 0.5, 1.0]
gamma = [0, 0.1, 1]
learning_rate = [0.001, 0.05, 0.08, 0.1]
min_child_weight = [1, 5, 10, 20]
scale_pos_weight = [0.5, 1, 2, 4, 6]
subsample = [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
n_estimators = [25, 50, 100, 150]
max_depth = [3, 5, 10, 20, 40, 100]
param_distributions = dict(
colsample_bytree=colsample_bytree,
gamma=gamma,
learning_rate=learning_rate,
min_child_weight=min_child_weight,
scale_pos_weight=scale_pos_weight,
subsample=subsample,
n_estimators=n_estimators,
max_depth=max_depth,
)
model = XGBClassifier(random_state=1, verbosity=1)
grid_search = RandomizedSearchCV(estimator=model,
param_distributions=param_distributions,
scoring='roc_auc',
)
best_model = grid_search.fit(X_train, y_train)
print('Optimum parameters', best_model.best_params_)
```

```
Optimum parameters {'subsample': 0.6, 'scale_pos_weight': 4, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 20, 'learning_rate': 0.05, 'gamma': 1, 'colsample_bytree': 1.0}
```

Finally, we will take the values from `RandomizedSearchCV`

and re-run our cross-fold validation to see what improvements the new parameters bring. This gives us an extra boost and increases the score to 0.9931.

```
tuned_model = XGBClassifier(random_state=1,
colsample_bytree=1,
learning_rate=0.05,
max_depth=20,
min_child_weight=1,
n_estimators=100,
subsample=0.6)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(tuned_model, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC/AUC = ', scores.mean())
```

```
Mean ROC/AUC = 0.9931507936507937
```

If you use a specific model frequently, you might want to create a `param_distributions`

dictionary containing a wide range of all possible hyper-parameter values and then wrap up the code above in a function. This will allow you to quickly tune a model and find out which hyper-parameters need further tweaking. You can then doing your fine-tuning with GridSearchCV.

One common parameter you may need to tune is the `scale_pos_weight`

in XGBoost. This is of particular use on imbalanced datasets and can be calculated from the ratio of the negative class over the positive class. Oddly, it’s not always the exact value which gives the best result, so “bracketing” and using a value either side can be of use. Here’s a little function I knocked-up to do this.

```
def get_scale_pos_weight(target, square_root=False, gridsearch=False):
"""Return the scale_pos_weight parameter for the XGBoost model when data are imbalanced.
The scale_pos_weight parameter is calculated from the ratio of the negative class over
the positive class. The exact scale_pos_weight sometimes does not give the best result,
so by passing the gridsearch=True parameter you can return a list of values to test with
GridSearchCV. In addition, passing square_root=True changes the scale_pos_weight to the
square root value, which can sometimes be beneficial on extremely imbalanced data.
:param target: Pandas dataframe column containing the binary target
:param square_root: Optional boolean parameter to convert to square root on extremely unbalanced data
:param gridsearch: Optional boolean parameter to return a bracketed list for use in GridSearchCV
Usage:
scale_pos_weight = get_scale_pos_weight(df['target'], square_root=False, gridsearch=True)
"""
import math
scale_pos_weight = round((len(target) - sum(target)) / sum(target))
if square_root:
scale_pos_weight = round(math.sqrt(scale_pos_weight))
if gridsearch:
scale_pos_weight = [scale_pos_weight-2, scale_pos_weight-1, scale_pos_weight,
scale_pos_weight+1, scale_pos_weight+2]
return scale_pos_weight
```

If you’re looking for a faster solution to hyperparameter tuning, do check out Optuna. It’s a hyperparameter optimization framework that can be used to tune XGBoost and other models. It’s also very easy to use and can be used to tune XGBoost in just a few lines of code.

Matt Clarke, Tuesday, March 02, 2021