The random forest model or random decision forest model is a supervised machine learning algorithm that can be used for classification or regression problems. It’s what’s known as an ensemble learning method and works by creating many decision trees and then taking a consensus vote for classification models, or a mean of the predictions for regression models.
Along with the decision tree model itself, random forests are one of the most widely used classification and regression models used in data science. Ensemble methods, such as random forests, often give better results than using individual tree-based machine learning models, but share the same drawback in that they can also overfit to the training data and fail to generalise when presented with data they’ve never seen before.
Performance-wise, random forests usually outperform individual decision trees, but rarely trump gradient boosted tree algorithms, such as XGBoost. In this simple example I’ll show you how to you can create a basic random forest classification model using scikit-learn in Python via the RandomForestClassifier algorithm. It should be plenty to get you started building a model using your own data.
First, open a Jupyter notebook and import the packages below. We’re using the RandomForestClassifier
package from the sklearn.ensemble
module to create the random forest classifier model. We’re loading some test data from the sklearn.datasets
module based on wine chemistry, which we’re splitting into training and test data using train_test_split
. Finally, we’re using the accuracy_score
and classification_report
packages from the sklearn.metrics
module to evaluate the performance of the model we create.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_wine
You can use any data you want. For speed we’ll use the wine dataset from scikit-learn as it doesn’t require any data cleansing or feature engineering. If you’re using your own dataset, you’ll need to encode categorical variables to convert them to the numeric form required for modelling.
The X
dataframe contains our training and test data, minus the target variable we’re aiming to predict, which is stored in y
. If you run y.value_counts()
you’ll see that the data contains three classes of the target variable 0, 1, and 2. Therefore, our model is going to examine the data on wine chemistry and try to predict to which class each wine belongs.
X, y = load_wine(return_X_y=True, as_frame=True)
X.sample(5)
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
164 | 13.78 | 2.76 | 2.30 | 22.0 | 90.0 | 1.35 | 0.68 | 0.41 | 1.03 | 9.58 | 0.70 | 1.68 | 615.0 |
42 | 13.88 | 1.89 | 2.59 | 15.0 | 101.0 | 3.25 | 3.56 | 0.17 | 1.70 | 5.43 | 0.88 | 3.56 | 1095.0 |
112 | 11.76 | 2.68 | 2.92 | 20.0 | 103.0 | 1.75 | 2.03 | 0.60 | 1.05 | 3.80 | 1.23 | 2.50 | 607.0 |
153 | 13.23 | 3.30 | 2.28 | 18.5 | 98.0 | 1.80 | 0.83 | 0.61 | 1.87 | 10.52 | 0.56 | 1.51 | 675.0 |
176 | 13.17 | 2.59 | 2.37 | 20.0 | 120.0 | 1.65 | 0.68 | 0.53 | 1.46 | 9.30 | 0.60 | 1.62 | 840.0 |
y.sample(5)
33 0
74 1
96 1
29 0
53 0
Name: target, dtype: int64
y.value_counts()
1 71
0 59
2 48
Name: target, dtype: int64
To prepare our data we now need to create four datasets - two for training and two for testing. We can do this by passing the X
data and y
data to train_test_split()
. The test_size
argument is set to 0.3, which puts a randomly assigned 30% of the overall data in the test data (X_test
and y_test
) and the rest in the training data (X_train
and y_train
). The random_state
argument is set to 1 to ensure reproducible results between model runs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
Next we’ll fit a very simple base random forest model using RandomForestClassifier
. Like other scikit-learn models, this has lots of arguments you can pass in and tune, but we’ll only add one - the n_estimators
argument, which we’ll set to 100 for demonstration purposes. As the name suggests, this will create a random forest containing 100 decision trees. We’ll then fit that model to our training data and assign the output to model
.
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
RandomForestClassifier()
Now we have the model trained, we’ll pass it the X_test
data and get it to make some predictions. Since we trained the model using the X_train
data, the model has never seen these values. We’ll generate predictions using predict()
and store them in y_pred
and print the Numpy array to inspect the predictions.
y_pred = model.predict(X_test)
y_pred
array([2, 1, 0, 1, 0, 2, 1, 0, 2, 1, 0, 0, 1, 0, 1, 1, 2, 0, 1, 0, 0, 1,
2, 0, 0, 2, 0, 0, 0, 2, 1, 2, 2, 0, 1, 1, 1, 1, 1, 0, 0, 1, 2, 0,
0, 0, 1, 0, 0, 0, 1, 2, 2, 0])
There are various ways to evaluate the performance of a classification model. To keep things simple we’ll use accuracy, via the accuracy_score()
metric. The model scores 98.14% which is pretty good, and is a significant improvement over the low 90% figures typically obtained from a basic decision tree model.
accuracy = accuracy_score(y_test, y_pred)
accuracy
0.9814814814814815
To get a better understanding of the model’s performance, we can use the classification_report
. Four metrics are returned - precision, recall, F1 score, and support. They’re explained in the table beneath.
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.96 1.00 0.98 23
1 1.00 0.95 0.97 19
2 1.00 1.00 1.00 12
accuracy 0.98 54
macro avg 0.99 0.98 0.98 54
weighted avg 0.98 0.98 0.98 54
Metric | Definition |
---|---|
Precision | The precision model evaluation metric is the ratio of true positives over true positives plus false positives, i.e. precision = tp / (tp + fp) . Precision shows the model's ability not to label a negative sample as positive. |
Recall | The recall model evaluation metric is the ratio of true positives over true positives plus false negatives, i.e. precision = tp / (tp + fp) . Recall shows the model's ability to detect the positive samples. |
F1 score | The F1 score (or F-beta score, as it's also known) is a weighted harmonic mean of the precision and recal scores, where an F-beta score of 1 is best and 0 is worst. |
Support | The support value shows the number of occurrences of each class in the `y_true` (or `y_test`) data. |
Although a random forest will typically outperform a decision tree when it comes to accuracy, the downside is that random forests are much less interpretable. With a regular decision tree model, you can print the decision tree itself to see what decisions the model used to reach its predictions. However, with a random forest, you’ll have numerous individual decision trees that are used to make predictions from which the eventual final prediction is made either via a consensus or average.
To examine a specific tree from the random forest simply change the value in the square brackets after estimators_
to any integer between 0 and the maximum number of trees used in your model. This is set to 100 by default, so a base random forest comprises 100 decision trees.
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
fig = plt.figure(figsize=(12, 12))
plot_tree(model.estimators_[0],
filled=True,
impurity=True,
rounded=True)
[Text(263.0571428571429, 605.7257142857143, 'X[10] <= 0.855\ngini = 0.652\nsamples = 74\nvalue = [29, 49, 46]'),
Text(95.65714285714286, 512.537142857143, 'X[6] <= 1.235\ngini = 0.168\nsamples = 27\nvalue = [1, 3, 40]'),
Text(47.82857142857143, 419.34857142857146, 'gini = 0.0\nsamples = 23\nvalue = [0, 0, 39]'),
Text(143.4857142857143, 419.34857142857146, 'X[5] <= 1.52\ngini = 0.56\nsamples = 4\nvalue = [1, 3, 1]'),
Text(95.65714285714286, 326.16, 'gini = 0.0\nsamples = 1\nvalue = [0, 0, 1]'),
Text(191.31428571428572, 326.16, 'X[2] <= 2.34\ngini = 0.375\nsamples = 3\nvalue = [1, 3, 0]'),
Text(143.4857142857143, 232.9714285714286, 'X[3] <= 16.75\ngini = 0.5\nsamples = 2\nvalue = [1, 1, 0]'),
Text(95.65714285714286, 139.7828571428571, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),
Text(191.31428571428572, 139.7828571428571, 'gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]'),
Text(239.14285714285714, 232.9714285714286, 'gini = 0.0\nsamples = 1\nvalue = [0, 2, 0]'),
Text(430.45714285714286, 512.537142857143, 'X[1] <= 1.62\ngini = 0.541\nsamples = 47\nvalue = [28, 46, 6]'),
Text(334.8, 419.34857142857146, 'X[6] <= 3.125\ngini = 0.077\nsamples = 16\nvalue = [1, 24, 0]'),
Text(286.9714285714286, 326.16, 'gini = 0.0\nsamples = 14\nvalue = [0, 23, 0]'),
Text(382.62857142857143, 326.16, 'X[10] <= 1.065\ngini = 0.5\nsamples = 2\nvalue = [1, 1, 0]'),
Text(334.8, 232.9714285714286, 'gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]'),
Text(430.45714285714286, 232.9714285714286, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),
Text(526.1142857142858, 419.34857142857146, 'X[0] <= 12.875\ngini = 0.587\nsamples = 31\nvalue = [27, 22, 6]'),
Text(478.2857142857143, 326.16, 'gini = 0.0\nsamples = 11\nvalue = [0, 21, 0]'),
Text(573.9428571428572, 326.16, 'X[10] <= 0.995\ngini = 0.337\nsamples = 20\nvalue = [27, 1, 6]'),
Text(526.1142857142858, 232.9714285714286, 'X[6] <= 1.32\ngini = 0.54\nsamples = 5\nvalue = [3, 1, 6]'),
Text(478.2857142857143, 139.7828571428571, 'gini = 0.0\nsamples = 2\nvalue = [0, 0, 6]'),
Text(573.9428571428572, 139.7828571428571, 'X[12] <= 753.5\ngini = 0.375\nsamples = 3\nvalue = [3, 1, 0]'),
Text(526.1142857142858, 46.594285714285775, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),
Text(621.7714285714286, 46.594285714285775, 'gini = 0.0\nsamples = 2\nvalue = [3, 0, 0]'),
Text(621.7714285714286, 232.9714285714286, 'gini = 0.0\nsamples = 15\nvalue = [24, 0, 0]')]
That’s how to create a really simple random forest model in Python using scikit-learn, but there are a number of other things you can do to make your model more robust and improve its performance. Firstly, you’ll probably want to apply the model selection process and use cross validation to identify the model best suited to the task, rather than simply selecting one over another based on a hunch. Not all models perform equally on the same dataset, so you may get a significant performance boost by trying a range of them to see which one is best.
Secondly, you’ll want to conduct hyperparameter tuning after you’ve selected your chosen model. Hyperparameter tuning is a brute force process through which lots of different settings are adjusted in order to find the one that generates the best performance for the model.
It rarely brings massive gains, but it should give you a little extra performance for not a lot of effort. However, it’s time-consuming and processor intensive, so you’ll likely want to run hyperparameter tuning overnight and then use Pickle to save your machine learning model.
Matt Clarke, Sunday, May 01, 2022