Ensemble models combine the predicitions of several different models to produce a single prediction, often with better results than can be achieved with a single model alone. There are several different methods for creating ensemble models, but they fall into three main categories: bagging, boosting, and stacking (or voting).
Here, we’re going to use scikit-learn to examine these three ensemble modeling approaches to show how they work. Although this will be a fairly simple introduction, you should then be able to apply the same methodologies to other machine learning classification models you build.
For this project we’ll be using Pandas and a range of scikit-learn packages. These include various packages from the
model_selection module, plus models from the
ensemble modules, plus the popular
XGBClassifier from XGBoost for good measure. Any packages you don’t have can be installed by entering
pip3 install package-name in your terminal.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.linear_model import LogisticRegression from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import VotingClassifier from xgboost import XGBClassifier
You can use any classification dataset you like for this project. I’m using the breast cancer dataset here, as it’s built into scikit-learn and doesn’t require any special feature engineering or cleaning before it can be used, so we can focus instead on the models themselves. Once the
y data have been loaded, I’ve passed them into the scikit-learn
train_test_split() function to create our training and test datasets.
from sklearn.datasets import load_breast_cancer X, y = load_breast_cancer(return_X_y=True, as_frame=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
The first ensemble modeling technique we’ll take a look at is calling bagging. This is short for “bootstrap aggregation”. It uses a parallel set of estimators, each of which overfit the data, and then creates an average of the results to obtain a better result.
The Random Forest is one of the most widely used bagging classifiers. This creates an ensemble of randomised decision trees - hence Random Forest model. Random Forest models are useful because they are fast to train and when generating predictions, partly thanks to the ability for them to be created in parallel, rather than iteratively. The multiple trees they use internally can also provide probabilistic classifications (via the
predict_proba() function, so you get a score indicating the strength of the prediction.
cv = KFold(n_splits=10) model = RandomForestClassifier(n_estimators=100, max_features=3) score = cross_val_score(model, X_train, y_train, cv=cv) score.mean()
BaggingClassifier is what’s known as a meta-estimator. It allows you to create an ensemble model using any scikit-learn compatible classifier, simply by passing an instantiated scikit-learn classifier to the
base_estimator argument. This could be anything -
XGBClassifier. Along with this is passed the
n_estimators value to define the number of estimators to create.
cv = KFold(n_splits=10) estimator = DecisionTreeClassifier() model = BaggingClassifier(base_estimator=estimator, n_estimators=100) score = cross_val_score(model, X_train, y_train, cv=cv) score.mean()
Like the Random Forest classifier, the Extra Trees classifier uses a subset of random features. However, it uses the
max_features argument to tell it select the best X features from those available. This reduces variance but slightly increases bias.
cv = KFold(n_splits=10) model = ExtraTreesClassifier(n_estimators=100, max_features=3) score = cross_val_score(model, X_train, y_train, cv=cv) score.mean()
Next, we’ll check out the popular boosting methodology. Boosting is similar to bagging in that it uses a majority vote, or average numeric prediction, process, and because it combines individual classifiers of the same type. However, where bagging classifers get trained in parallel, boosting classifiers are trained iteratively.
Boosting models use the misclassified data from previous iterations of training steps to influence the next training steps by passing weightings to the next model, which helps guide them in the right direction by learning from previous mistakes. It’s a clever technique and can often be very effective.
The AdaBoost classifier is the main boosting classifier in the scikit-learn package. This algorithm first initialises weights on the training indices, then it updates them using the results of the errors identified. It then uses the
n_estimators argument to iterate over the data X times, generally improving performance as it goes. The
learning_rate argument (which takes a float between 0 and 1) controls how aggressively the weights get updated between iterations. In the below example, moving this up from 0.1 to 1.0 gives us an extra increase in our score.
cv = KFold(n_splits=10) model = AdaBoostClassifier(n_estimators=150, learning_rate=0.1) score = cross_val_score(model, X_train, y_train, cv=cv) score.mean()
cv = KFold(n_splits=10) model = AdaBoostClassifier(n_estimators=150, learning_rate=1.0) score = cross_val_score(model, X_train, y_train, cv=cv) score.mean()
The Gradient Boosting Machine, or Gradient Boosting Classifier, is implemented in scikit-learn in the
GradientBoostingClassifier package. This method can handle many data types and is fairly robust to outliers. Like the other boosting models here, it’s sequential rather than parallel, so can slow down on larger datasets - Random Forests are often faster.
Internally, gradient boosting usually uses decision trees and creates a prediction model based on an ensemble of what are known as “weak learners”. As with the others, with each iteration it updates weights using an optimisation algorithm on its internal cost function. Like
AdaBoostClassifier it can be tweaked using the
learning_rate and some other parameters to improve performance.
cv = KFold(n_splits=10) model = GradientBoostingClassifier(n_estimators=150, learning_rate=0.01) score = cross_val_score(model, X_train, y_train, cv=cv) score.mean()
cv = KFold(n_splits=10) model = GradientBoostingClassifier(n_estimators=150, learning_rate=1.0) score = cross_val_score(model, X_train, y_train, cv=cv) score.mean()
Although it’s not officially part of scikit-learn yet,
XGBClassifier from the XGBoost package is arguably one of the most popular and powerful ensemble models.
XGBClassifier is an implementation of the Extreme Gradient Boosting algorithm, which is a favourite of most competitive machine learning enthusiasts. It seems to be very popular in ecommerce and marketing research and I use it heavily in my own work.
XGBoost uses a technique called “gradient descent” to optimise performance. Rather than assigning weights, like Adaboost, XGBoost examines the errors (known as “residuals”) on each training iteration and creates an internal regression model) using them. It creates a new model on each iteration which includes gradients, which helps reduce errors.
Importantly, as XGBoost can run on multiple CPU cores, and on GPUs, it can run much quicker than other boosting algorithms. It has loads of tuning parameters, or hyperparameters, that can be adjusted using
GridSearchCV to help further optimise its performance. This again can be sped up considerably if you utilise GPU acceleration.
cv = KFold(n_splits=10) model = XGBClassifier(n_estimators=150) score = cross_val_score(model, X_train, y_train, cv=cv) score.mean()
Finally, we’ll look at the stacking or voting classifier, which is available in scikit-learn via the
VotingClassifer() package. The voting classifier shares some similarities with the models above, but gives you far more control, since you can use the classifiers of your choice in a special ensemble model known as a stack.
BaggingClassifier() we saw earlier, the
VotingClassifier() is a meta model. As the name suggests, the voting classifier runs the separately defined models in your stack of classifiers and returns a final prediction based on the majority vote of all of the models combined. This is known as a “stacked generalisation”, and is based on a weighted combination of all of the individual learners within the stack, combined with a final “combiner” algorithm to generate the end prediction.
There’s an important trick to using voting or stacking classifiers, which lies in the selection of the internal models used. Rather than using several models that work in similar manners and are producing similar results, the best results are often achieved by using models that work differently.
cv = KFold(n_splits=10) models =  models.append(('xgb', XGBClassifier(n_estimators=150))) models.append(('svc', SVC())) models.append(('extra', ExtraTreesClassifier(n_estimators=100, max_features=3))) model = VotingClassifier(models) score = cross_val_score(model, X_train, y_train, cv=cv) score.mean()
Matt Clarke, Saturday, March 13, 2021