AdaBoost is a boosting algorithm that combines multiple weak learners into a strong learner. It is a sequential technique that works by fitting a classifier on the original dataset and then fitting additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
Weak learners are classifiers that are slightly better than random guessing. For example, a decision tree whose depth is less than that of the trees in the final ensemble. The idea behind boosting is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, including XGBoost, LightGBM, and CatBoost. AdaBoost is one of the first boosting methods and is still widely used.
For this project we’ll be using scikit-learn and the Optuna hyperparameter optimization library. Open a Jupyter notebook and install the scikit-learn and Optuna packages using the following commands:
!pip3 install scikit-learn
!pip3 install optuna
Next, we’ll load the packages that we’ll be using in this project. We’ll be using the wine dataset from scikit-learn as this avoids the need to do any data preprocessing. We’ll also be using the train_test_split function to split the data into training and testing sets. We’ll use the AdaBoostClassifier class to create the AdaBoost model. Finally, we’ll use the accuracy_score function to evaluate the model and Pickle to save the model.
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_wine
from sklearn.ensemble import AdaBoostClassifier
import optuna
from optuna.samplers import TPESampler
import pickle
We’ll load the wine dataset from scikit-learn and split the data into training and testing sets. The return_X_y
parameter is set to True so that we get back a tuple containing the data and the target labels. We’ll also set the
as_frame
parameter to True so that we get the data as a Pandas DataFrame.
X, y = load_wine(return_X_y=True, as_frame=True)
X.sample(5)
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
162 | 12.85 | 3.27 | 2.58 | 22.0 | 106.0 | 1.65 | 0.60 | 0.60 | 0.96 | 5.58 | 0.87 | 2.11 | 570.0 |
102 | 12.34 | 2.45 | 2.46 | 21.0 | 98.0 | 2.56 | 2.11 | 0.34 | 1.31 | 2.80 | 0.80 | 3.38 | 438.0 |
108 | 12.22 | 1.29 | 1.94 | 19.0 | 92.0 | 2.36 | 2.04 | 0.39 | 2.08 | 2.70 | 0.86 | 3.02 | 312.0 |
14 | 14.38 | 1.87 | 2.38 | 12.0 | 102.0 | 3.30 | 3.64 | 0.29 | 2.96 | 7.50 | 1.20 | 3.00 | 1547.0 |
126 | 12.43 | 1.53 | 2.29 | 21.5 | 86.0 | 2.74 | 3.15 | 0.39 | 1.77 | 3.94 | 0.69 | 2.84 | 352.0 |
The y
variable contains the target labels. We can use the value_counts
method to examine the distribution of the
target labels. This well tell us if the dataset is balanced or not.
y.value_counts()
1 71
0 59
2 48
Name: target, dtype: int64
Next we need to use the X
and y
data to create training and testing sets. We’ll use 30% of the data for testing by
setting the test_size
parameter to 0.3. We’ll also set the random_state
parameter to 1 so that we get the same
split each time we run the code.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
Now we can create our AdaBoost classification model using the AdaBoostClassifier
class. We’ll set the n_estimators
parameter to 100 and the random_state
parameter to 1 so that we get the same results each time we run the code.
We’ll then run the fit
method to train the model on the training data.
model = AdaBoostClassifier(n_estimators=100, random_state=1)
model.fit(X_train, y_train)
AdaBoostClassifier(n_estimators=100, random_state=1)
We can now use the predict
method to make predictions on the test data by passing the X_test
data to the method.
We will assign the predictions to the y_pred
variable. These get returned as a NumPy array.
y_pred = model.predict(X_test)
y_pred
array([2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2,
2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 1])
We can use the accuracy_score
function to evaluate the model. This compares the predicted values with the actual
values and returns the accuracy score as a percentage. We’ll also use the classification_report
function to get a
more detailed breakdown of the model’s performance. The initial results aren’t particularly great, suggesting that we might be able to get a big improvement by tuning AdaBoost’s hyperparameters.
print('Accuracy score', accuracy_score(y_test, y_pred))
Accuracy score 0.5370370370370371
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.00 0.00 0.00 23
1 0.42 0.89 0.58 19
2 0.86 1.00 0.92 12
accuracy 0.54 54
macro avg 0.43 0.63 0.50 54
weighted avg 0.34 0.54 0.41 54
Like other models, AdaBoost has hyperparameters that can be tuned to improve the model’s performance. In the past, we have used a grid search to tune the hyperparameters. However, this can be a time-consuming process. Optuna is a hyperparameter optimization library that can be used to tune the hyperparameters of machine learning models and has the benefit of being significantly faster.
To use Optuna, we need to define an objective function that tests a range of model hyperparameter values and fits
the model to the training data. As our objective is to maximize the model’s accuracy, we’ll use the accuracy_score
function to evaluate the model. The hyperparameters that can be tuned for AdaBoost are:
def objective(trial):
n_estimators = trial.suggest_int("n_estimators", 50, 500)
learning_rate = trial.suggest_float("learning_rate", 0.001, 1.0, log=True)
model = AdaBoostClassifier(n_estimators=n_estimators, learning_rate=learning_rate, random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
return accuracy_score(y_test, y_pred)
Next we’ll create an Optuna study and run it using the TPESampler
sampler. We’ll set the direction
parameter to
maximize
as we want to maximize the model’s accuracy. We’ll also set the n_trials
parameter to 100 so that Optuna
will test 100 different combinations of hyperparameter values. We’ll then print the best hyperparameter values and the
best accuracy score. Once the study has run, we can print the findings.
sampler = TPESampler(seed=1)
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=100)
print("Number of finished trials: ", len(study.trials))
print("Best trial:")
trial = study.best_trial
print(" Value: ", trial.value)
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
Number of finished trials: 100
Best trial:
Value: 0.9814814814814815
Params:
n_estimators: 117
learning_rate: 0.47285383296767425
We can now re-fit the model using the best hyperparameter values. We’ll use the best_params
attribute to get the
best hyperparameter values. We’ll then use the fit
method to re-fit the model on the training data. We can then
use the predict
method to make predictions on the test data and use the accuracy_score
function to evaluate the
model. As you can see, the accuracy score has improved quite significantly from just 0.537 to 0.981.
model = AdaBoostClassifier(n_estimators=trial.params["n_estimators"], learning_rate=trial.params["learning_rate"], random_state=1)
model.fit(X_train, y_train)
AdaBoostClassifier(learning_rate=0.47285383296767425, n_estimators=117,
random_state=1)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.9814814814814815
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.96 1.00 0.98 23
1 1.00 0.95 0.97 19
2 1.00 1.00 1.00 12
accuracy 0.98 54
macro avg 0.99 0.98 0.98 54
weighted avg 0.98 0.98 0.98 54
Finally, we can save the AdaBoost model using the pickle
module. We’ll define the filename for our model as
adaboost_model.pkl
and then open the file in write bytes mode. We’ll then use the dump
function to save the
model to the file. This means we can load the model at a later date without having to re-train it.
filename = "adaboost_model.pkl"
pickle.dump(model, open(filename, "wb"))
Matt Clarke, Thursday, October 13, 2022