How to tune an XGBRegressor model with Optuna

Picture by Expect Best, Pexels.

16 minutes to read

The XGBRegressor regression model in XGBoost is one of the most effective regression models used in machine learning. As with the other XGBoost models, XGBRegressor is a gradient boosting model that uses a tree-based learning algorithm to make predictions. It’s really powerful but also has a lot of hyperparameters that can be tuned to improve the model’s performance even more than with the default settings.

In this tutorial, you’ll learn how to tune the hyperparameters of an XGBRegressor model with Optuna. Optuna is rapidly taking over from GridSearchCV and RandomizedSearchCV as the preferred method for hyperparameter tuning. It’s a lot more efficient and can be used to tune any model, not just XGBoost models. The downside is that it’s a lot more complex to use but can quickly yield faster results.

Install the packages

To get started, you’ll need to install Optuna and XGBoost. Open a Jupyter notebook and run the following commands to install the packages using the Pip package manager:

!pip3 install optuna
!pip3 install xgboost

Load the packages

Next, you’ll need to import the packages you’ll be using in this tutorial. You’ll use Pandas to load the dataset, Numpy to perform some operations on the dataset, the StandardScaler to scale the dataset, and XGBoost to train the model. You’ll also import the mean_squared_error function from the Scikit-Learn metrics module to evaluate the model. Finally, you’ll need Optuna to tune the hyperparameters of the model.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost
import optuna

Load the data

To keep things simple we’ll create a regression model that predicts the price of a house based on a few features. You can load the remote dataset directly from my GitHub repository using the Pandas read_csv() function. If you print the head of the dataset you’ll see the features and the target variable. If you use the info() method you’ll see the columns in the dataset and their data types or dtypes.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/housing.csv')
df.head().T

	0	1	2	3	4
longitude	-122.23	-122.22	-122.24	-122.25	-122.25
latitude	37.88	37.86	37.85	37.85	37.85
housing_median_age	41	21	52	52	52
total_rooms	880	7099	1467	1274	1627
total_bedrooms	129	1106	190	235	280
population	322	2401	496	558	565
households	126	1138	177	219	259
median_income	8.3252	8.3014	7.2574	5.6431	3.8462
median_house_value	452600	358500	352100	341300	342200
ocean_proximity	NEAR BAY	NEAR BAY	NEAR BAY	NEAR BAY	NEAR BAY

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

Engineer additional features

In this particular dataset there’s a useful feature we need to engineer. This measures the distance from the property to the ocean, since beachside properties have a higher value. We’ll tidy the data and then use the get_dummies() function to one-hot encode the data so they can be used by the model.

df['ocean_proximity'].value_counts()

_1h_ocean     9136
inland        6551
near_ocean    2658
near_bay      2290
island           5
Name: ocean_proximity, dtype: int64

df['ocean_proximity'] = df['ocean_proximity'].str.lower().replace('[^0-9a-zA-Z]+', '_', regex=True)
encodings = pd.get_dummies(df['ocean_proximity'], prefix='proximity')
df = pd.concat([df, encodings], axis=1)

Impute missing values

Next, we’ll impute any missing values in the dataset. We’ll use zero imputation for this and will simply fill in the missing values with zeros.

df = df.fillna(0)

Define the target variable

Now we need to define the target variable. In this case, we’re trying to predict the median house value, so we’ll set that as the target variable and assign it to y. We’ll also drop the target variable and a couple of categorical variables from the dataset and assign the remaining columns to X.

X = df.drop(['median_house_value', 'ocean_proximity'], axis=1)
y = df['median_house_value']

Split the data into training and test sets

We’ll split the data into training and test sets next. We’ll use 70% of the data for training and 30% for testing. We’ll also set the random_state parameter to 1 so that the results are reproducible.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Scale the data

We’ll use the StandardScaler class from scikit-learn to scale the data. We’ll fit the scaler to the training data and then use it to transform both the training and test sets.

# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Train the model

Now have prepared our data we can train the base model. We’ll just use the defaults and won’t pass in any optional hyperparameters, then we’ll fit the model to the training data.

# Define the model
model = xgboost.XGBRegressor()

# Fit the model
model.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

Make predictions and evaluate the model

Now we can make predictions on the test set and evaluate the model. We’ll use the mean_squared_error function from scikit-learn to calculate the mean squared error and we’ll calculate the root mean squared error by taking the square root of the mean squared error. We get back an RMSE of 48,130. This is a good score, but we can probably do a bit better by tuning the hyperparameters.

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print('MSE: ', mean_squared_error(y_test, y_pred))
print('RMSE: ', np.sqrt(mean_squared_error(y_test, y_pred)))

MSE:  2316583030.746331
RMSE:  48130.894763616554

Use Optuna to tune the XGBRegressor model

Next we’ll use Optuna to tune the hyperparameters of the XGBRegressor model. Optuna lets you tune the hyperparameters of any model, not just XGBoost models. The first step in the process is to define an objective function. The objective function is the function that Optuna will try to optimize. In our case, we’re trying to minimize the mean squared error.

At the top of the function we’ll define a dictionary called param that contains the hyperparameters we want to tune. Depending on the model you’re using, you’ll need to tune different hyperparameters. You can find the hyperparameters for the XGBRegressor model in the XGBoost documentation. For each hyperparameter, we’ll use either a suggest_int() or a suggest_float() function to define the range of values we want to try.

Next, we’ll use the XGBRegressor() function to create a model with the hyperparameters we want to tune. We’ll then use the fit() function to train the model and the predict() function to make predictions. Finally, we’ll use the mean_squared_error() function to evaluate the model and return the result.

def objective(trial):
    param = {
        'max_depth': trial.suggest_int('max_depth', 1, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 1.0),
        'n_estimators': trial.suggest_int('n_estimators', 50, 1000),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'gamma': trial.suggest_float('gamma', 0.01, 1.0),
        'subsample': trial.suggest_float('subsample', 0.01, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.01, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.01, 1.0),
        'random_state': trial.suggest_int('random_state', 1, 1000)
    }
    model = xgboost.XGBRegressor(**param)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return mean_squared_error(y_test, y_pred)

Create the Optuna study

Now that we’ve defined the objective function, we can create the Optuna study. The study is the object that Optuna uses to run the optimization process. We’ll use the create_study() function to create the study and the optimize()function to run the optimization process.

The optimize() function takes the objective function as an argument and the number of trials to run as a keyword argument. The number of trials is the number of times Optuna will try to optimize the objective function. The more trials you run, the more likely you are to find the optimal hyperparameters, but the longer it will take to run the optimization process.

# Create the study
study = optuna.create_study(direction='minimize', study_name='regression')
study.optimize(objective, n_trials=100)

Print the results of the optimization process

After the optimization process is complete, you can print the results of the optimization process. The best_params attribute of the study object contains the optimal hyperparameters. The best_value attribute contains the value of the objective function for the optimal hyperparameters. The best_trial attribute contains the results of the trial that produced the optimal hyperparameters.

# Print the best parameters
print('Best parameters', study.best_params)

Best parameters {'max_depth': 9, 'learning_rate': 0.05948019648043046, 'n_estimators': 601, 'min_child_weight': 1, 'gamma': 0.1784858704993009, 'subsample': 0.7898711401609732, 'colsample_bytree': 0.5222222699977733, 'reg_alpha': 0.18106701148336968, 'reg_lambda': 0.825342759728178, 'random_state': 181}

# Print the best value
print('Best value', study.best_value)

Best value 1989472979.3372276

# Print the best trial
print('Best trial', study.best_trial)

Best trial FrozenTrial(number=62, values=[1989472979.3372276], datetime_start=datetime.datetime(2022, 10, 13, 15, 37, 11, 547173), datetime_complete=datetime.datetime(2022, 10, 13, 15, 37, 14, 445737), params={'max_depth': 9, 'learning_rate': 0.05948019648043046, 'n_estimators': 601, 'min_child_weight': 1, 'gamma': 0.1784858704993009, 'subsample': 0.7898711401609732, 'colsample_bytree': 0.5222222699977733, 'reg_alpha': 0.18106701148336968, 'reg_lambda': 0.825342759728178, 'random_state': 181}, distributions={'max_depth': IntDistribution(high=10, log=False, low=1, step=1), 'learning_rate': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'n_estimators': IntDistribution(high=1000, log=False, low=50, step=1), 'min_child_weight': IntDistribution(high=10, log=False, low=1, step=1), 'gamma': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'subsample': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'colsample_bytree': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'reg_alpha': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'reg_lambda': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'random_state': IntDistribution(high=1000, log=False, low=1, step=1)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=62, state=TrialState.COMPLETE, value=None)

Use the optimal hyperparameters to train the model

Now that we have the optimal hyperparameters, we can use them to train the model. We’ll use the XGBRegressor() function to create a model with the optimal hyperparameters by passing in **study.best_params.

We’ll then use the fit() function to train the model and the predict() function to make predictions. Finally, we’ll use the mean_squared_error() function to evaluate the model and print the results. We get back an RMSE of 44,603 which is a significant improvement over the RMSE of the base model.

model = xgboost.XGBRegressor(**study.best_params)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print('MSE: ', mean_squared_error(y_test, y_pred))
print('RMSE: ', np.sqrt(mean_squared_error(y_test, y_pred)))

MSE:  1989472979.3372276
RMSE:  44603.50859895696

Save the model using Pickle

Finally, we’ll use Pickle to save our machine learning model. This will allow us to use the model in the future without having to retrain it.

filename = "xgbregressor.pkl"
pickle.dump(model, open(filename, "wb"))

Matt Clarke, Thursday, October 13, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.