The XGBoost or Extreme Gradient Boosting algorithm is a decision tree based machine learning algorithm which uses a process called boosting to help improve performance. Since it’s introduction, it’s become of one of the most effective machine learning algorithms and regularly produces results that outperform most other algorithms, such as logistic regression, the random forest model and regular decision trees.
XGBoost has frameworks for various languages, including Python, and it integrates nicely with the commonly used scikit-learn machine learning framework used by Python data scientists. It can be used to solve classification and regression problems, so is suitable for the vast majority of common data science challenges.
In this tutorial, I’ll show you how you can create a really basic XGBoost model to solve a classification problem, including all the Python code required. Let’s get started.
Classification algorithms, or classifiers as they’re also known, fall into the supervised learning branch of machine learning. As the name suggests, these predictive models are designed to determine the class to which a given subject belongs. Since they use supervised learning, they require labeled training data that includes a column containing their class.
The basic classification modeling process involves obtaining a dataset, creating features of independent variables, and using them to predict a dependent variable or target class. Most classification datasets require some preparation before they can be used by classifiers, and also usually require the creation of additional features through a process called feature engineering. However, in this project we’ll be use an example dataset from the Python sklearn package that is ready to use as it is.
After importing the data into the model, it is trained on a subset of the full dataset. The model will learn to identify which of the independent variables or features is correlated with the target variable or class, and will iterate over the data, progressively becoming more accurate at making predictions. Once trained, the classification model can be evaluated to assess its accuracy and used to make predictions on unlabeled data.
First, open a Jupyter notebook and import the packages below. If you don’t have XGBoost installed, you can install it via the PyPi package repository by entering the command
pip3 install xgboost in your terminal or
!pip3 install xgboost in a cell in your Jupyter notebook. We’ll need to use the Pandas package, plus the
accuracy_score components from
sklearn, as well as the wine dataset.
As we’re building a classification model, it’s the
XGBClassifier class we need to load from
XGBClassifier is one of the most effective classification algorithms, and often produces state-of-the-art predictions and commonly wins many competitive machine learning competitions.
!pip3 install xgboost
from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.datasets import load_wine
Next, we’ll import the dataset into a Pandas dataframe. Ordinarily, you’d load your own data and undertake some
feature engineering and data cleansing to prepare it for your model. However, the datasets including within
sklearn are designed for rapid model testing, so don’t need any preprocessing. If you print a
sample() of the
y dataframes, you’ll be able to check out the features included.
X, y = load_wine(return_X_y=True, as_frame=True)
X dataframe contains the features we’ll be using to train our XGBoost model and is normally referred to with a capital
X. This “feature set” includes a range of chemical characteristics of various types of wine. We want our model to examine these characteristics and learn how they are associated with the target variable, which is referred to with a lowercase
y. It’s the
y column that contains the labels we will use to train our classifier.
The aim of our classifier will be to predict the class of each wine from one of three possible classes: 0, 1, or 2 from the chemical characteristics of each wine. In this dataset, the
y data is stored separately to
X, but usually these would be merged in a single dataframe.
This is a slightly different approach to binary classification problems. Binary classification predicts one of two possible outcomes, while multiclass classification predicts several classes.
123 1 16 0 20 0 119 1 41 0 Name: target, dtype: int64
The next step is to take our
y datasets and split them up randomly into a training dataset and a test (or validation) dataset to train and test the classifier. We can do this via the
train_test_split() function from scikit-learn. Using the
test_size argument we can assign 30% of the data to be used for validation, with the other 70% used for training. The
random_state argument takes any integer value and means we get reproducible results each time we run the model.
Following the split, our training data is stored in
y_train our test data is stored in
y_test. As the name suggests, it’s the
X_train data that will be used to train the model. The
X_test data is not used during training (or is “held out”), instead being used after training to evaluate the model and assess its accuracy using some special performance evaluation metrics.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
Now the data have been prepared we can define the configuration of our
XGBClassifier model. There are loads of
options you can pass to models which can be tweaked or “tuned” to help generate more accurate results - a process
called hyperparameter tuning.
For now, we’ll fit a so-called “base model”, which has barely any configuration options. This will allow us to see what performance is like straight out of the box. After defining the model parameters, we assign the output to an object called
model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
Next, we’ll use the
fit() function of our
model object to train the model on our training data. The model is being given a randomly selected 70% portion of the whole dataset we loaded above, with the
y data separated. When the
fit() function is run, the XGBoost algorithm will examine the data and look for correlations between the features and the target variable. It will re-run the training process over and over again until it gets more accurate at making predictions.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, eval_metric='mlogloss', gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=16, num_parallel_tree=1, objective='multi:softprob', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=1, tree_method='exact', use_label_encoder=False, validate_parameters=1, verbosity=None)
Finally, we can use our model trained on the training data to make predictions on the test or validation dataset using the
predict() function. This takes only the
X data. We’re not providing (or “holding out”) the
y data containing the answer as we want to assess how well our model makes predictions on data it has not previously seen.
The convention when generating predictions is to assign the array or matrix returned to a variable called
y_pred. Printing this shows the predictions themselves. With a little bit of extra Python code, you can join the
y_pred predictions back to the
X data to see the features the model used and the predictions made, as well as the actual
y_pred = model.predict(X_test)
array([2, 1, 0, 1, 0, 2, 1, 0, 2, 1, 0, 0, 1, 0, 1, 1, 2, 0, 1, 0, 0, 1, 2, 0, 0, 2, 0, 0, 0, 2, 1, 2, 2, 0, 1, 1, 1, 0, 1, 0, 0, 1, 2, 0, 0, 0, 1, 0, 0, 0, 1, 2, 2, 0])
To evaluate the performance of our model on predicting the class of wines it has not previously seen, we can use the
accuracy_score() function. This takes two values: the original
y_test data containing the actual result and the
y_pred predictions array containing the predicted result.
The accuracy score of the model is calculated by dividing the number of correct predictions by the number of total predictions. We get back an accuracy score of 0.96 or 96%, which is pretty impressive for an un-tuned model. If you want to, you can also save your model using Pickle to allow it to be re-used without the need for further training.
accuracy = accuracy_score(y_test, y_pred) accuracy
Matt Clarke, Saturday, May 29, 2021