The Decision Tree or DT is one of the most well known and most widely used supervised machine learning algorithms and can be applied to both regression and classification. As the name suggests, a decision tree model aims to predict the value of a target variable using decision rules it infers from the training data.
Since the decision tree is a tree-based model, you can also export the decision tree itself to help you explain the model’s output. Model explainability or model interpretability is often a useful feature to have, especially in business, where some stakeholders might be untrusting of a black box algorithm where nobody really understands what it’s doing or how it works.
In this simple Python example, we’ll build a basic decision tree classification model using scikit-learn. We’ll be using the built-in DecisionTreeClassifier to create a decision tree model for multi-class classification. We’ll fit it to our data, evaluate the model performance, and then do some basic model hyperparameter tuning. Let’s get started.
To get started open a Jupyter notebook and import the packages below from scikit-learn. We’ll be using the sklearn
DecisionTreeClassifier to create our decision tree classification model, the
train_test_split function to split the data into our training and test datasets, and the
classification_report functions to evaluate our model’s performance. To save the hassle of data cleansing and feature engineering we’ll use one of scikit-learn’s test datasets on the chemistry of different wines.
from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report from sklearn.datasets import load_wine from sklearn import tree import matplotlib.pyplot as plt
Next, we’ll use the
load_wine() function to load the wine dataset and return the
X training data and the
y target variable. Passing
as_frame=True will return the data as a Pandas dataframe so we can inspect it more easily. Load up a
sample() of five rows to see what we’re working with. The
y data is a series containing an index and an integer value denoting the class of each sample in the dataset.
If you’re using your own data, you’ll need to first prepare your data for use in the model by preprocessing it. As you can see from the below data, all values in the training data need to be converted to a numeric format - either a float or an integer.
You will need to encode categorical variables them to make them numeric. You might also want to apply some feature engineering techniques to guide the model by creating potentially useful values that can be used to help predict the target variable.
X, y = load_wine(return_X_y=True, as_frame=True)
119 1 134 2 78 1 168 2 52 0 Name: target, dtype: int64
Now we have our
y data we can pass it to the scikit-learn
train_test_split() function to create the training and test data we need to train and evaluate our classification model. We’ll set the
test_size value to 0.3 so that 30% of our data is used for testing and the rest for training. Setting the
random_state value ensures results are more reproducible between runs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
Next, we’ll create our decision tree model using
DecisionTreeClassifier. To ensure reproducible results between runs, we’ll pass a value to the
random_state argument. If you don’t do this, you’ll get a slightly different result each time you run the model. To keep things simple, we’re first going to fit a base model with no additional optional arguments passed, apart from the
model = DecisionTreeClassifier(random_state=1)
Now we’ll “fit” the data to our model by passing in the
X_train data, which includes only the features for the training portion of 70% of the overall dataset, and the separate
y_train target variables for the same dataset. The decision tree algorithm will use the training data to work out what decision rules are required to get the best results at predicting the target variable.
Now we’ve built our model by fitting it on the training data, we can pass it the
X_test data. It’s never seen this data before, as it’s only been trained using the 70% portion of training data, so we’ll now see how well the model “generalises” and is capable of predicting the target variable on unseen data. Here’s the
X_test data we’re passing to the model.
To generate predictions from the model, we’ll append the
predict() function to our
DecisionTreeClassifier, which we stored in a variable called
model, along with the
X_test dataframe. We’ll assign the output to a variable called
y_pred, to denote that these are the prediction
y target variables. If you print
y_pred you’ll see that a
numpy array is returned that contains 0, 1, or 2, which maps to the class for each predicted value.
y_pred = model.predict(X_test)
array([2, 1, 0, 1, 0, 2, 1, 0, 2, 1, 0, 1, 1, 0, 1, 1, 2, 0, 1, 0, 0, 1, 2, 0, 0, 2, 0, 0, 0, 2, 1, 2, 2, 0, 1, 1, 1, 1, 1, 0, 0, 2, 2, 0, 0, 0, 1, 0, 0, 0, 1, 2, 2, 0])
Now we’ve generated our predictions, we need to evaluate the model’s performance to see if it was any good. There are many different evaluation metrics that you can use to assess the performance of a machine learning model. We’ll first use the
accuracy_score() metric. To use this you need to pass it the actual values stored in
y_test and the predicted values stored in
accuracy_score() metric will then return a percentage score showing the proportion of results that we predicted correctly. Our base model decision tree classifier scores 94.4%.
accuracy = accuracy_score(y_test, y_pred) accuracy
Another useful metric for evaluating a classification model is the classification report, which can be accessed via the
classification_report() function. This works in the same way as the other scikit-learn model evaluation metric functions, so you just need to pass it the same
y_pred data. To make it print the results neatly, you’ll need to wrap the output in a
print() statement, even inside a Jupyter notebook.
precision recall f1-score support 0 0.96 0.96 0.96 23 1 0.94 0.89 0.92 19 2 0.92 1.00 0.96 12 accuracy 0.94 54 macro avg 0.94 0.95 0.95 54 weighted avg 0.94 0.94 0.94 54
The classification report can seem a bit confusing the first time you use it. The first column, which contains the 0, 1, and 2 values, shows the target variables predicted along with their evaluation metric scores. It’s fairly normal for a model to be good at predicting some target variables but less good at predicting others, and that’s what we’re seeing here. Four metrics are returned - precision, recall, F1 score, and support. Here’s what they mean.
|Precision||The precision model evaluation metric is the ratio of true positives over true positives plus false positives, i.e.
|Recall||The recall model evaluation metric is the ratio of true positives over true positives plus false negatives, i.e.
|F1 score||The F1 score (or F-beta score, as it's also known) is a weighted harmonic mean of the precision and recal scores, where an F-beta score of 1 is best and 0 is worst.|
|Support||The support value shows the number of occurrences of each class in the `y_true` (or `y_test`) data.|
One neat feature of the decision tree model is that the results are interpretable, since it’s just a tree-based model of decision rules. To generate the tree used within the decision tree model, you can pass the
model object to
tree.plot_model(). To increase the size of the image
plot_tree() generates, I’d recommend using Matplotlib to force the size of the figure to something more readable using
[Text(251.10000000000002, 605.7257142857143, 'X <= 0.975\ngini = 0.656\nsamples = 124\nvalue = [36, 52, 36]'), Text(167.4, 512.537142857143, 'gini = 0.0\nsamples = 30\nvalue = [0, 0, 30]'), Text(334.8, 512.537142857143, 'X <= 760.0\ngini = 0.543\nsamples = 94\nvalue = [36, 52, 6]'), Text(167.4, 419.34857142857146, 'X <= 1.505\ngini = 0.223\nsamples = 56\nvalue = [1, 49, 6]'), Text(83.7, 326.16, 'gini = 0.0\nsamples = 5\nvalue = [0, 0, 5]'), Text(251.10000000000002, 326.16, 'X <= 9.5\ngini = 0.076\nsamples = 51\nvalue = [1, 49, 1]'), Text(167.4, 232.9714285714286, 'X <= 3.925\ngini = 0.039\nsamples = 50\nvalue = [1, 49, 0]'), Text(83.7, 139.7828571428571, 'gini = 0.0\nsamples = 45\nvalue = [0, 45, 0]'), Text(251.10000000000002, 139.7828571428571, 'X <= 4.14\ngini = 0.32\nsamples = 5\nvalue = [1, 4, 0]'), Text(167.4, 46.594285714285775, 'gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]'), Text(334.8, 46.594285714285775, 'gini = 0.0\nsamples = 4\nvalue = [0, 4, 0]'), Text(334.8, 232.9714285714286, 'gini = 0.0\nsamples = 1\nvalue = [0, 0, 1]'), Text(502.20000000000005, 419.34857142857146, 'X <= 12.66\ngini = 0.145\nsamples = 38\nvalue = [35, 3, 0]'), Text(418.5, 326.16, 'gini = 0.0\nsamples = 3\nvalue = [0, 3, 0]'), Text(585.9, 326.16, 'gini = 0.0\nsamples = 35\nvalue = [35, 0, 0]')]
In this very simple example we’ve used Python’s scikit-learn package to create a decision tree classifier. While this approach works, it’s a simplification of what you’d do when creating a more robust model. In a real world scenario you’ll get better results from testing a range of different classification algorithms and then using an approach called model selection with cross validation.
This approach to model selection will help you identify the model best suited to your problem and which provides the highest level of accuracy (or whatever metric you choose to evaluate using). Applying model selection can often give you significant boosts in performance.
Once you’ve selected your classification algorithm, the final step is to tune your model using a technique called hyperparameter tuning. This tweaks the model’s specific settings to fine tune the performance. While it rarely leads to big improvements, it can give you a little extra boost and is usually worthwhile, especially if you set it up and leave it to run overnight.
Matt Clarke, Sunday, May 01, 2022