The XGBRegressor regression model in XGBoost is one of the most effective regression models used in machine learning. As with the other XGBoost models, XGBRegressor is a gradient boosting model that uses a tree-based learning algorithm to make predictions. It’s really powerful but also has a lot of hyperparameters that can be tuned to improve the model’s performance even more than with the default settings.
In this tutorial, you’ll learn how to tune the hyperparameters of an XGBRegressor model with Optuna. Optuna is rapidly taking over from GridSearchCV and RandomizedSearchCV as the preferred method for hyperparameter tuning. It’s a lot more efficient and can be used to tune any model, not just XGBoost models. The downside is that it’s a lot more complex to use but can quickly yield faster results.
To get started, you’ll need to install Optuna and XGBoost. Open a Jupyter notebook and run the following commands to install the packages using the Pip package manager:
!pip3 install optuna
!pip3 install xgboost
Next, you’ll need to import the packages you’ll be using in this tutorial. You’ll use Pandas to load the dataset, Numpy to perform some operations on the dataset, the StandardScaler to scale the dataset, and XGBoost to train the model. You’ll also import the mean_squared_error function from the Scikit-Learn metrics module to evaluate the model. Finally, you’ll need Optuna to tune the hyperparameters of the model.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost
import optuna
To keep things simple we’ll create a regression model that predicts the price of a house based
on a few features. You can load the remote dataset directly from my GitHub repository using the Pandas read_csv()
function. If you print the head of the dataset you’ll see the features and the target variable. If you use the info()
method you’ll see the columns in the dataset and their data types or dtypes.
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/housing.csv')
df.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
longitude | -122.23 | -122.22 | -122.24 | -122.25 | -122.25 |
latitude | 37.88 | 37.86 | 37.85 | 37.85 | 37.85 |
housing_median_age | 41 | 21 | 52 | 52 | 52 |
total_rooms | 880 | 7099 | 1467 | 1274 | 1627 |
total_bedrooms | 129 | 1106 | 190 | 235 | 280 |
population | 322 | 2401 | 496 | 558 | 565 |
households | 126 | 1138 | 177 | 219 | 259 |
median_income | 8.3252 | 8.3014 | 7.2574 | 5.6431 | 3.8462 |
median_house_value | 452600 | 358500 | 352100 | 341300 | 342200 |
ocean_proximity | NEAR BAY | NEAR BAY | NEAR BAY | NEAR BAY | NEAR BAY |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
In this particular dataset there’s a useful feature we need to engineer. This measures the distance from the property to the ocean, since beachside properties have a higher value. We’ll tidy the data and then use the get_dummies()
function to one-hot encode the data so they can be used by the model.
df['ocean_proximity'].value_counts()
_1h_ocean 9136
inland 6551
near_ocean 2658
near_bay 2290
island 5
Name: ocean_proximity, dtype: int64
df['ocean_proximity'] = df['ocean_proximity'].str.lower().replace('[^0-9a-zA-Z]+', '_', regex=True)
encodings = pd.get_dummies(df['ocean_proximity'], prefix='proximity')
df = pd.concat([df, encodings], axis=1)
Next, we’ll impute any missing values in the dataset. We’ll use zero imputation for this and will simply fill in the missing values with zeros.
df = df.fillna(0)
Now we need to define the target variable. In this case, we’re trying to predict the median house value, so we’ll set that as the target variable and assign it to y
. We’ll also drop the target variable and a couple of categorical variables from the dataset and assign the remaining columns to X
.
X = df.drop(['median_house_value', 'ocean_proximity'], axis=1)
y = df['median_house_value']
We’ll split the data into training and test sets next. We’ll use 70% of the data for training and 30% for testing. We’ll also set the random_state
parameter to 1 so that the results are reproducible.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
We’ll use the StandardScaler
class from scikit-learn to scale the data. We’ll fit the scaler to the training data and then use it to transform both the training and test sets.
# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Now have prepared our data we can train the base model. We’ll just use the defaults and won’t pass in any optional hyperparameters, then we’ll fit the model to the training data.
# Define the model
model = xgboost.XGBRegressor()
# Fit the model
model.fit(X_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
Now we can make predictions on the test set and evaluate the model. We’ll use the mean_squared_error
function from scikit-learn to calculate the mean squared error and we’ll calculate the root mean squared error by taking the square root of the mean squared error. We get back an RMSE of 48,130. This is a good score, but we can probably do a bit better by tuning the hyperparameters.
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
print('MSE: ', mean_squared_error(y_test, y_pred))
print('RMSE: ', np.sqrt(mean_squared_error(y_test, y_pred)))
MSE: 2316583030.746331
RMSE: 48130.894763616554
Next we’ll use Optuna to tune the hyperparameters of the XGBRegressor model. Optuna lets you tune the hyperparameters of any model, not just XGBoost models. The first step in the process is to define an objective function. The objective function is the function that Optuna will try to optimize. In our case, we’re trying to minimize the mean squared error.
At the top of the function we’ll define a dictionary called param
that contains the hyperparameters we want to tune. Depending on the model you’re using, you’ll need to tune different hyperparameters. You can find the hyperparameters for the XGBRegressor model in the XGBoost documentation. For each hyperparameter, we’ll use either a suggest_int()
or
a suggest_float()
function to define the range of values we want to try.
Next, we’ll use the XGBRegressor()
function to create a model with the hyperparameters we want to tune. We’ll then use the fit()
function to train the model and the predict()
function to make predictions. Finally, we’ll use the mean_squared_error()
function to evaluate the model and return the result.
def objective(trial):
param = {
'max_depth': trial.suggest_int('max_depth', 1, 10),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 1.0),
'n_estimators': trial.suggest_int('n_estimators', 50, 1000),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
'gamma': trial.suggest_float('gamma', 0.01, 1.0),
'subsample': trial.suggest_float('subsample', 0.01, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.01, 1.0),
'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 1.0),
'reg_lambda': trial.suggest_float('reg_lambda', 0.01, 1.0),
'random_state': trial.suggest_int('random_state', 1, 1000)
}
model = xgboost.XGBRegressor(**param)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
return mean_squared_error(y_test, y_pred)
Now that we’ve defined the objective function, we can create the Optuna study. The study is the object that Optuna uses to run the optimization process. We’ll use the create_study()
function to create the study and the optimize()
function to run the optimization process.
The optimize()
function takes the objective function as an argument and
the number of trials to run as a keyword argument. The number of trials is the number of times Optuna will try to optimize the objective function. The more trials you run, the more likely you are to find the optimal hyperparameters, but the longer it will take to run the optimization process.
# Create the study
study = optuna.create_study(direction='minimize', study_name='regression')
study.optimize(objective, n_trials=100)
After the optimization process is complete, you can print the results of the optimization process. The best_params
attribute of the study object contains the optimal hyperparameters. The best_value
attribute contains the value of the objective function for the optimal hyperparameters. The best_trial
attribute contains the results of the trial
that produced the optimal hyperparameters.
# Print the best parameters
print('Best parameters', study.best_params)
Best parameters {'max_depth': 9, 'learning_rate': 0.05948019648043046, 'n_estimators': 601, 'min_child_weight': 1, 'gamma': 0.1784858704993009, 'subsample': 0.7898711401609732, 'colsample_bytree': 0.5222222699977733, 'reg_alpha': 0.18106701148336968, 'reg_lambda': 0.825342759728178, 'random_state': 181}
# Print the best value
print('Best value', study.best_value)
Best value 1989472979.3372276
# Print the best trial
print('Best trial', study.best_trial)
Best trial FrozenTrial(number=62, values=[1989472979.3372276], datetime_start=datetime.datetime(2022, 10, 13, 15, 37, 11, 547173), datetime_complete=datetime.datetime(2022, 10, 13, 15, 37, 14, 445737), params={'max_depth': 9, 'learning_rate': 0.05948019648043046, 'n_estimators': 601, 'min_child_weight': 1, 'gamma': 0.1784858704993009, 'subsample': 0.7898711401609732, 'colsample_bytree': 0.5222222699977733, 'reg_alpha': 0.18106701148336968, 'reg_lambda': 0.825342759728178, 'random_state': 181}, distributions={'max_depth': IntDistribution(high=10, log=False, low=1, step=1), 'learning_rate': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'n_estimators': IntDistribution(high=1000, log=False, low=50, step=1), 'min_child_weight': IntDistribution(high=10, log=False, low=1, step=1), 'gamma': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'subsample': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'colsample_bytree': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'reg_alpha': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'reg_lambda': FloatDistribution(high=1.0, log=False, low=0.01, step=None), 'random_state': IntDistribution(high=1000, log=False, low=1, step=1)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=62, state=TrialState.COMPLETE, value=None)
Now that we have the optimal hyperparameters, we can use them to train the model. We’ll use the XGBRegressor()
function to create a model with the optimal hyperparameters by passing in **study.best_params
.
We’ll then use the fit()
function to train the model and the predict()
function to make predictions. Finally, we’ll use the mean_squared_error()
function to evaluate the model and print the results. We get back an RMSE of 44,603 which is a significant improvement over the RMSE of the base model.
model = xgboost.XGBRegressor(**study.best_params)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('MSE: ', mean_squared_error(y_test, y_pred))
print('RMSE: ', np.sqrt(mean_squared_error(y_test, y_pred)))
MSE: 1989472979.3372276
RMSE: 44603.50859895696
Finally, we’ll use Pickle to save our machine learning model. This will allow us to use the model in the future without having to retrain it.
filename = "xgbregressor.pkl"
pickle.dump(model, open(filename, "wb"))
Matt Clarke, Thursday, October 13, 2022