How to create a classification model using XGBoost in Python

Picture by Andrea Piacquadio, Pexels.

11 minutes to read

The XGBoost or Extreme Gradient Boosting algorithm is a decision tree based machine learning algorithm which uses a process called boosting to help improve performance. Since it’s introduction, it’s become of one of the most effective machine learning algorithms and regularly produces results that outperform most other algorithms, such as logistic regression, the random forest model and regular decision trees.

XGBoost has frameworks for various languages, including Python, and it integrates nicely with the commonly used scikit-learn machine learning framework used by Python data scientists. It can be used to solve classification and regression problems, so is suitable for the vast majority of common data science challenges.

In this tutorial, I’ll show you how you can create a really basic XGBoost model to solve a classification problem, including all the Python code required. Let’s get started.

Understanding classification models

Classification algorithms, or classifiers as they’re also known, fall into the supervised learning branch of machine learning. As the name suggests, these predictive models are designed to determine the class to which a given subject belongs. Since they use supervised learning, they require labeled training data that includes a column containing their class.

The basic classification modeling process involves obtaining a dataset, creating features of independent variables, and using them to predict a dependent variable or target class. Most classification datasets require some preparation before they can be used by classifiers, and also usually require the creation of additional features through a process called feature engineering. However, in this project we’ll be use an example dataset from the Python sklearn package that is ready to use as it is.

After importing the data into the model, it is trained on a subset of the full dataset. The model will learn to identify which of the independent variables or features is correlated with the target variable or class, and will iterate over the data, progressively becoming more accurate at making predictions. Once trained, the classification model can be evaluated to assess its accuracy and used to make predictions on unlabeled data.

Load the packages

First, open a Jupyter notebook and import the packages below. If you don’t have XGBoost installed, you can install it via the PyPi package repository by entering the command pip3 install xgboost in your terminal or !pip3 install xgboost in a cell in your Jupyter notebook. We’ll need to use the Pandas package, plus the train_test_split and accuracy_score components from sklearn, as well as the wine dataset.

As we’re building a classification model, it’s the XGBClassifier class we need to load from xgboost. XGBClassifier is one of the most effective classification algorithms, and often produces state-of-the-art predictions and commonly wins many competitive machine learning competitions.

!pip3 install xgboost

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

Load the data

Next, we’ll import the dataset into a Pandas dataframe. Ordinarily, you’d load your own data and undertake some feature engineering and data cleansing to prepare it for your model. However, the datasets including within sklearn are designed for rapid model testing, so don’t need any preprocessing. If you print a sample() of the X and y dataframes, you’ll be able to check out the features included.

X, y = load_wine(return_X_y=True, as_frame=True)

The X dataframe contains the features we’ll be using to train our XGBoost model and is normally referred to with a capital X. This “feature set” includes a range of chemical characteristics of various types of wine. We want our model to examine these characteristics and learn how they are associated with the target variable, which is referred to with a lowercase y. It’s the y column that contains the labels we will use to train our classifier.

X.sample(5)

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
27	13.30	1.72	2.14	17.0	94.0	2.40	2.19	0.27	1.35	3.95	1.02	2.77	1285.0
65	12.37	1.21	2.56	18.1	98.0	2.42	2.65	0.37	2.08	4.60	1.19	2.30	678.0
97	12.29	1.41	1.98	16.0	85.0	2.55	2.50	0.29	1.77	2.90	1.23	2.74	428.0
79	12.70	3.87	2.40	23.0	101.0	2.83	2.55	0.43	1.95	2.57	1.19	3.13	463.0
161	13.69	3.26	2.54	20.0	107.0	1.83	0.56	0.50	0.80	5.88	0.96	1.82	680.0

The aim of our classifier will be to predict the class of each wine from one of three possible classes: 0, 1, or 2 from the chemical characteristics of each wine. In this dataset, the y data is stored separately to X, but usually these would be merged in a single dataframe.

This is a slightly different approach to binary classification problems. Binary classification predicts one of two possible outcomes, while multiclass classification predicts several classes.

y.sample(5)

  1
   0
   0
  1
   0
Name: target, dtype: int64

Split into training and test datasets

The next step is to take our X and y datasets and split them up randomly into a training dataset and a test (or validation) dataset to train and test the classifier. We can do this via the train_test_split() function from scikit-learn. Using the test_size argument we can assign 30% of the data to be used for validation, with the other 70% used for training. The random_state argument takes any integer value and means we get reproducible results each time we run the model.

Following the split, our training data is stored in X_train and y_train our test data is stored in X_test and y_test. As the name suggests, it’s the X_train data that will be used to train the model. The X_test data is not used during training (or is “held out”), instead being used after training to evaluate the model and assess its accuracy using some special performance evaluation metrics.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Fit the model

Now the data have been prepared we can define the configuration of our XGBClassifier model. There are loads of options you can pass to models which can be tweaked or “tuned” to help generate more accurate results - a process called hyperparameter tuning.

For now, we’ll fit a so-called “base model”, which has barely any configuration options. This will allow us to see what performance is like straight out of the box. After defining the model parameters, we assign the output to an object called model.

model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

Next, we’ll use the fit() function of our model object to train the model on our training data. The model is being given a randomly selected 70% portion of the whole dataset we loaded above, with the X and y data separated. When the fit() function is run, the XGBoost algorithm will examine the data and look for correlations between the features and the target variable. It will re-run the training process over and over again until it gets more accurate at making predictions.

model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='mlogloss',
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=16,
              num_parallel_tree=1, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

Generate predictions

Finally, we can use our model trained on the training data to make predictions on the test or validation dataset using the predict() function. This takes only the X data. We’re not providing (or “holding out”) the y data containing the answer as we want to assess how well our model makes predictions on data it has not previously seen.

The convention when generating predictions is to assign the array or matrix returned to a variable called y_pred. Printing this shows the predictions themselves. With a little bit of extra Python code, you can join the y_pred predictions back to the X data to see the features the model used and the predictions made, as well as the actual y value.

y_pred = model.predict(X_test)

y_pred

array([2, 1, 0, 1, 0, 2, 1, 0, 2, 1, 0, 0, 1, 0, 1, 1, 2, 0, 1, 0, 0, 1,
       2, 0, 0, 2, 0, 0, 0, 2, 1, 2, 2, 0, 1, 1, 1, 0, 1, 0, 0, 1, 2, 0,
       0, 0, 1, 0, 0, 0, 1, 2, 2, 0])

Evaluate model performance

To evaluate the performance of our model on predicting the class of wines it has not previously seen, we can use the accuracy_score() function. This takes two values: the original y_test data containing the actual result and the y_pred predictions array containing the predicted result.

The accuracy score of the model is calculated by dividing the number of correct predictions by the number of total predictions. We get back an accuracy score of 0.96 or 96%, which is pretty impressive for an un-tuned model. If you want to, you can also save your model using Pickle to allow it to be re-used without the need for further training.

accuracy = accuracy_score(y_test, y_pred)
accuracy

0.9629629629629629

Matt Clarke, Saturday, May 29, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.