How to create and tune an AdaBoost classification model

Picture by Marta Branco, Pexels.

13 minutes to read

AdaBoost is a boosting algorithm that combines multiple weak learners into a strong learner. It is a sequential technique that works by fitting a classifier on the original dataset and then fitting additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

Weak learners are classifiers that are slightly better than random guessing. For example, a decision tree whose depth is less than that of the trees in the final ensemble. The idea behind boosting is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, including XGBoost, LightGBM, and CatBoost. AdaBoost is one of the first boosting methods and is still widely used.

What are the benefits of using AdaBoost?

It is a fast algorithm.
It is versatile and can be used with a variety of base classifiers.
It is a powerful algorithm that can achieve high accuracy.

What are the drawbacks of using AdaBoost?

It is sensitive to noisy data and outliers.
It is not suitable for large datasets as it is relatively slower compared to other boosting algorithms.

Install the packages

For this project we’ll be using scikit-learn and the Optuna hyperparameter optimization library. Open a Jupyter notebook and install the scikit-learn and Optuna packages using the following commands:

!pip3 install scikit-learn
!pip3 install optuna

Load the packages

Next, we’ll load the packages that we’ll be using in this project. We’ll be using the wine dataset from scikit-learn as this avoids the need to do any data preprocessing. We’ll also be using the train_test_split function to split the data into training and testing sets. We’ll use the AdaBoostClassifier class to create the AdaBoost model. Finally, we’ll use the accuracy_score function to evaluate the model and Pickle to save the model.

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_wine
from sklearn.ensemble import AdaBoostClassifier
import optuna
from optuna.samplers import TPESampler
import pickle

Load the data

We’ll load the wine dataset from scikit-learn and split the data into training and testing sets. The return_X_y parameter is set to True so that we get back a tuple containing the data and the target labels. We’ll also set the as_frame parameter to True so that we get the data as a Pandas DataFrame.

X, y = load_wine(return_X_y=True, as_frame=True)
X.sample(5)

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
162	12.85	3.27	2.58	22.0	106.0	1.65	0.60	0.60	0.96	5.58	0.87	2.11	570.0
102	12.34	2.45	2.46	21.0	98.0	2.56	2.11	0.34	1.31	2.80	0.80	3.38	438.0
108	12.22	1.29	1.94	19.0	92.0	2.36	2.04	0.39	2.08	2.70	0.86	3.02	312.0
14	14.38	1.87	2.38	12.0	102.0	3.30	3.64	0.29	2.96	7.50	1.20	3.00	1547.0
126	12.43	1.53	2.29	21.5	86.0	2.74	3.15	0.39	1.77	3.94	0.69	2.84	352.0

Examine the target variable

The y variable contains the target labels. We can use the value_counts method to examine the distribution of the target labels. This well tell us if the dataset is balanced or not.

y.value_counts()

  71
  59
  48
Name: target, dtype: int64

Split the data into training and test sets

Next we need to use the X and y data to create training and testing sets. We’ll use 30% of the data for testing by setting the test_size parameter to 0.3. We’ll also set the random_state parameter to 1 so that we get the same split each time we run the code.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Create the AdaBoost classification model

Now we can create our AdaBoost classification model using the AdaBoostClassifier class. We’ll set the n_estimators parameter to 100 and the random_state parameter to 1 so that we get the same results each time we run the code. We’ll then run the fit method to train the model on the training data.

model = AdaBoostClassifier(n_estimators=100, random_state=1)
model.fit(X_train, y_train)

AdaBoostClassifier(n_estimators=100, random_state=1)

Make predictions

We can now use the predict method to make predictions on the test data by passing the X_test data to the method. We will assign the predictions to the y_pred variable. These get returned as a NumPy array.

y_pred = model.predict(X_test)
y_pred

array([2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2,
       2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1,
       1, 1, 1, 1, 1, 1, 1, 2, 2, 1])

Evaluate the model

We can use the accuracy_score function to evaluate the model. This compares the predicted values with the actual values and returns the accuracy score as a percentage. We’ll also use the classification_report function to get a more detailed breakdown of the model’s performance. The initial results aren’t particularly great, suggesting that we might be able to get a big improvement by tuning AdaBoost’s hyperparameters.

print('Accuracy score', accuracy_score(y_test, y_pred))

Accuracy score 0.5370370370370371

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        23
           1       0.42      0.89      0.58        19
           2       0.86      1.00      0.92        12

    accuracy                           0.54        54
   macro avg       0.43      0.63      0.50        54
weighted avg       0.34      0.54      0.41        54

Use Optuna to tune the AdaBoost hyperparameters

Like other models, AdaBoost has hyperparameters that can be tuned to improve the model’s performance. In the past, we have used a grid search to tune the hyperparameters. However, this can be a time-consuming process. Optuna is a hyperparameter optimization library that can be used to tune the hyperparameters of machine learning models and has the benefit of being significantly faster.

To use Optuna, we need to define an objective function that tests a range of model hyperparameter values and fits the model to the training data. As our objective is to maximize the model’s accuracy, we’ll use the accuracy_score function to evaluate the model. The hyperparameters that can be tuned for AdaBoost are:

n_estimators: The number of weak learners to train iteratively.
learning_rate: Controls the contribution of each classifier. There is a trade-off between learning_rate and n_estimators.

def objective(trial):
    n_estimators = trial.suggest_int("n_estimators", 50, 500)
    learning_rate = trial.suggest_float("learning_rate", 0.001, 1.0, log=True)
    model = AdaBoostClassifier(n_estimators=n_estimators, learning_rate=learning_rate, random_state=1)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return accuracy_score(y_test, y_pred)

Run the Optuna study

Next we’ll create an Optuna study and run it using the TPESampler sampler. We’ll set the direction parameter to maximize as we want to maximize the model’s accuracy. We’ll also set the n_trials parameter to 100 so that Optuna will test 100 different combinations of hyperparameter values. We’ll then print the best hyperparameter values and the best accuracy score. Once the study has run, we can print the findings.

sampler = TPESampler(seed=1)
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=100)

print("Number of finished trials: ", len(study.trials))
print("Best trial:")
trial = study.best_trial
print("  Value: ", trial.value)
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

Number of finished trials:  100
Best trial:
  Value:  0.9814814814814815
  Params: 
    n_estimators: 117
    learning_rate: 0.47285383296767425

Re-fit the model using the best parameters

We can now re-fit the model using the best hyperparameter values. We’ll use the best_params attribute to get the best hyperparameter values. We’ll then use the fit method to re-fit the model on the training data. We can then use the predict method to make predictions on the test data and use the accuracy_score function to evaluate the model. As you can see, the accuracy score has improved quite significantly from just 0.537 to 0.981.

model = AdaBoostClassifier(n_estimators=trial.params["n_estimators"], learning_rate=trial.params["learning_rate"], random_state=1)
model.fit(X_train, y_train)

AdaBoostClassifier(learning_rate=0.47285383296767425, n_estimators=117,
                   random_state=1)

y_pred = model.predict(X_test)

print(accuracy_score(y_test, y_pred))

0.9814814814814815

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98        23
           1       1.00      0.95      0.97        19
           2       1.00      1.00      1.00        12

    accuracy                           0.98        54
   macro avg       0.99      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54

Save the model

Finally, we can save the AdaBoost model using the pickle module. We’ll define the filename for our model as adaboost_model.pkl and then open the file in write bytes mode. We’ll then use the dump function to save the model to the file. This means we can load the model at a later date without having to re-train it.

filename = "adaboost_model.pkl"
pickle.dump(model, open(filename, "wb"))

Matt Clarke, Thursday, October 13, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.