How to use bagging, boosting, and stacking in ensembles

Stacking... Iva Rajović, Unsplash.

13 minutes to read

Ensemble models combine the predicitions of several different models to produce a single prediction, often with better results than can be achieved with a single model alone. There are several different methods for creating ensemble models, but they fall into three main categories: bagging, boosting, and stacking (or voting).

Here, we’re going to use scikit-learn to examine these three ensemble modeling approaches to show how they work. Although this will be a fairly simple introduction, you should then be able to apply the same methodologies to other machine learning classification models you build.

Load the packages

For this project we’ll be using Pandas and a range of scikit-learn packages. These include various packages from the model_selection module, plus models from the tree, svm and ensemble modules, plus the popular XGBClassifier from XGBoost for good measure. Any packages you don’t have can be installed by entering pip3 install package-name in your terminal.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier

pd.set_option('max_columns', 6)

Load the data

You can use any classification dataset you like for this project. I’m using the breast cancer dataset here, as it’s built into scikit-learn and doesn’t require any special feature engineering or cleaning before it can be used, so we can focus instead on the models themselves. Once the X and y data have been loaded, I’ve passed them into the scikit-learn train_test_split() function to create our training and test datasets.

from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=1)

Bagging

The first ensemble modeling technique we’ll take a look at is calling bagging. This is short for “bootstrap aggregation”. It uses a parallel set of estimators, each of which overfit the data, and then creates an average of the results to obtain a better result.

Random Forest classifier

The Random Forest is one of the most widely used bagging classifiers. This creates an ensemble of randomised decision trees - hence Random Forest model. Random Forest models are useful because they are fast to train and when generating predictions, partly thanks to the ability for them to be created in parallel, rather than iteratively. The multiple trees they use internally can also provide probabilistic classifications (via the predict_proba() function, so you get a score indicating the strength of the prediction.

cv = KFold(n_splits=10)
model = RandomForestClassifier(n_estimators=100, max_features=3)
score = cross_val_score(model, X_train, y_train, cv=cv)
score.mean()

0.9548717948717949

Bagged Decision Tree classifier

The BaggingClassifier is what’s known as a meta-estimator. It allows you to create an ensemble model using any scikit-learn compatible classifier, simply by passing an instantiated scikit-learn classifier to the base_estimator argument. This could be anything - DecisionTreeClassifier(), Perceptron(), or XGBClassifier. Along with this is passed the n_estimators value to define the number of estimators to create.

cv = KFold(n_splits=10)
estimator = DecisionTreeClassifier()
model = BaggingClassifier(base_estimator=estimator, n_estimators=100)
score = cross_val_score(model, X_train, y_train, cv=cv)
score.mean()

0.9548076923076924

Extra Trees Classifier

Like the Random Forest classifier, the Extra Trees classifier uses a subset of random features. However, it uses the max_features argument to tell it select the best X features from those available. This reduces variance but slightly increases bias.

cv = KFold(n_splits=10)
model = ExtraTreesClassifier(n_estimators=100, max_features=3)
score = cross_val_score(model, X_train, y_train, cv=cv)
score.mean()

0.9623717948717948

Boosting

Next, we’ll check out the popular boosting methodology. Boosting is similar to bagging in that it uses a majority vote, or average numeric prediction, process, and because it combines individual classifiers of the same type. However, where bagging classifers get trained in parallel, boosting classifiers are trained iteratively.

Boosting models use the misclassified data from previous iterations of training steps to influence the next training steps by passing weightings to the next model, which helps guide them in the right direction by learning from previous mistakes. It’s a clever technique and can often be very effective.

AdaBoost classifier

The AdaBoost classifier is the main boosting classifier in the scikit-learn package. This algorithm first initialises weights on the training indices, then it updates them using the results of the errors identified. It then uses the n_estimators argument to iterate over the data X times, generally improving performance as it goes. The learning_rate argument (which takes a float between 0 and 1) controls how aggressively the weights get updated between iterations. In the below example, moving this up from 0.1 to 1.0 gives us an extra increase in our score.

cv = KFold(n_splits=10)
model = AdaBoostClassifier(n_estimators=150, learning_rate=0.1)
score = cross_val_score(model, X_train, y_train, cv=cv)
score.mean()

0.9599358974358975

cv = KFold(n_splits=10)
model = AdaBoostClassifier(n_estimators=150, learning_rate=1.0)
score = cross_val_score(model, X_train, y_train, cv=cv)
score.mean()

0.9649358974358974

Gradient Boosting Machines

The Gradient Boosting Machine, or Gradient Boosting Classifier, is implemented in scikit-learn in the GradientBoostingClassifier package. This method can handle many data types and is fairly robust to outliers. Like the other boosting models here, it’s sequential rather than parallel, so can slow down on larger datasets - Random Forests are often faster.

Internally, gradient boosting usually uses decision trees and creates a prediction model based on an ensemble of what are known as “weak learners”. As with the others, with each iteration it updates weights using an optimisation algorithm on its internal cost function. Like AdaBoostClassifier it can be tweaked using the learning_rate and some other parameters to improve performance.

cv = KFold(n_splits=10)
model = GradientBoostingClassifier(n_estimators=150, learning_rate=0.01)
score = cross_val_score(model, X_train, y_train, cv=cv)
score.mean()

0.9471794871794872

cv = KFold(n_splits=10)
model = GradientBoostingClassifier(n_estimators=150, learning_rate=1.0)
score = cross_val_score(model, X_train, y_train, cv=cv)
score.mean()

0.9548717948717949

XGBoost classifier

Although it’s not officially part of scikit-learn yet, XGBClassifier from the XGBoost package is arguably one of the most popular and powerful ensemble models. XGBClassifier is an implementation of the Extreme Gradient Boosting algorithm, which is a favourite of most competitive machine learning enthusiasts. It seems to be very popular in ecommerce and marketing research and I use it heavily in my own work.

XGBoost uses a technique called “gradient descent” to optimise performance. Rather than assigning weights, like Adaboost, XGBoost examines the errors (known as “residuals”) on each training iteration and creates an internal regression model) using them. It creates a new model on each iteration which includes gradients, which helps reduce errors.

Importantly, as XGBoost can run on multiple CPU cores, and on GPUs, it can run much quicker than other boosting algorithms. It has loads of tuning parameters, or hyperparameters, that can be adjusted using GridSearchCV to help further optimise its performance. This again can be sped up considerably if you utilise GPU acceleration.

cv = KFold(n_splits=10)
model = XGBClassifier(n_estimators=150)
score = cross_val_score(model, X_train, y_train, cv=cv)
score.mean()

0.959871794871795

Stacking or voting

Finally, we’ll look at the stacking or voting classifier, which is available in scikit-learn via the VotingClassifer() package. The voting classifier shares some similarities with the models above, but gives you far more control, since you can use the classifiers of your choice in a special ensemble model known as a stack.

Like the BaggingClassifier() we saw earlier, the VotingClassifier() is a meta model. As the name suggests, the voting classifier runs the separately defined models in your stack of classifiers and returns a final prediction based on the majority vote of all of the models combined. This is known as a “stacked generalisation”, and is based on a weighted combination of all of the individual learners within the stack, combined with a final “combiner” algorithm to generate the end prediction.

There’s an important trick to using voting or stacking classifiers, which lies in the selection of the internal models used. Rather than using several models that work in similar manners and are producing similar results, the best results are often achieved by using models that work differently.

cv = KFold(n_splits=10)

models = []
models.append(('xgb', XGBClassifier(n_estimators=150)))
models.append(('svc', SVC()))
models.append(('extra', ExtraTreesClassifier(n_estimators=100, max_features=3)))

model = VotingClassifier(models)
score = cross_val_score(model, X_train, y_train, cv=cv)
score.mean()

0.9698076923076921

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.