How to create a random forest classification model using scikit-learn

Learn how to create a random forest classification model using scikit-learn in Python with the sklearn RandomForestClassifier in this basic tutorial with example code.

How to create a random forest classification model using scikit-learn
Picture by Johannes Plenio, Pexels.
13 minutes to read

The random forest model or random decision forest model is a supervised machine learning algorithm that can be used for classification or regression problems. It’s what’s known as an ensemble learning method and works by creating many decision trees and then taking a consensus vote for classification models, or a mean of the predictions for regression models.

Along with the decision tree model itself, random forests are one of the most widely used classification and regression models used in data science. Ensemble methods, such as random forests, often give better results than using individual tree-based machine learning models, but share the same drawback in that they can also overfit to the training data and fail to generalise when presented with data they’ve never seen before.

Performance-wise, random forests usually outperform individual decision trees, but rarely trump gradient boosted tree algorithms, such as XGBoost. In this simple example I’ll show you how to you can create a basic random forest classification model using scikit-learn in Python via the RandomForestClassifier algorithm. It should be plenty to get you started building a model using your own data.

Load the packages

First, open a Jupyter notebook and import the packages below. We’re using the RandomForestClassifier package from the sklearn.ensemble module to create the random forest classifier model. We’re loading some test data from the sklearn.datasets module based on wine chemistry, which we’re splitting into training and test data using train_test_split. Finally, we’re using the accuracy_score and classification_report packages from the sklearn.metrics module to evaluate the performance of the model we create.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_wine

Load the data

You can use any data you want. For speed we’ll use the wine dataset from scikit-learn as it doesn’t require any data cleansing or feature engineering. If you’re using your own dataset, you’ll need to encode categorical variables to convert them to the numeric form required for modelling.

The X dataframe contains our training and test data, minus the target variable we’re aiming to predict, which is stored in y. If you run y.value_counts() you’ll see that the data contains three classes of the target variable 0, 1, and 2. Therefore, our model is going to examine the data on wine chemistry and try to predict to which class each wine belongs.

X, y = load_wine(return_X_y=True, as_frame=True)
X.sample(5)
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline
164 13.78 2.76 2.30 22.0 90.0 1.35 0.68 0.41 1.03 9.58 0.70 1.68 615.0
42 13.88 1.89 2.59 15.0 101.0 3.25 3.56 0.17 1.70 5.43 0.88 3.56 1095.0
112 11.76 2.68 2.92 20.0 103.0 1.75 2.03 0.60 1.05 3.80 1.23 2.50 607.0
153 13.23 3.30 2.28 18.5 98.0 1.80 0.83 0.61 1.87 10.52 0.56 1.51 675.0
176 13.17 2.59 2.37 20.0 120.0 1.65 0.68 0.53 1.46 9.30 0.60 1.62 840.0
y.sample(5)
33    0
74    1
96    1
29    0
53    0
Name: target, dtype: int64
y.value_counts()
1    71
0    59
2    48
Name: target, dtype: int64

Split the data into the train and test datasets

To prepare our data we now need to create four datasets - two for training and two for testing. We can do this by passing the X data and y data to train_test_split(). The test_size argument is set to 0.3, which puts a randomly assigned 30% of the overall data in the test data (X_test and y_test) and the rest in the training data (X_train and y_train). The random_state argument is set to 1 to ensure reproducible results between model runs.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Create and fit the random forest model

Next we’ll fit a very simple base random forest model using RandomForestClassifier. Like other scikit-learn models, this has lots of arguments you can pass in and tune, but we’ll only add one - the n_estimators argument, which we’ll set to 100 for demonstration purposes. As the name suggests, this will create a random forest containing 100 decision trees. We’ll then fit that model to our training data and assign the output to model.

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
RandomForestClassifier()

Generate predictions from the model

Now we have the model trained, we’ll pass it the X_test data and get it to make some predictions. Since we trained the model using the X_train data, the model has never seen these values. We’ll generate predictions using predict() and store them in y_pred and print the Numpy array to inspect the predictions.

y_pred = model.predict(X_test)
y_pred
array([2, 1, 0, 1, 0, 2, 1, 0, 2, 1, 0, 0, 1, 0, 1, 1, 2, 0, 1, 0, 0, 1,
       2, 0, 0, 2, 0, 0, 0, 2, 1, 2, 2, 0, 1, 1, 1, 1, 1, 0, 0, 1, 2, 0,
       0, 0, 1, 0, 0, 0, 1, 2, 2, 0])

Evaluate the model’s performance

There are various ways to evaluate the performance of a classification model. To keep things simple we’ll use accuracy, via the accuracy_score() metric. The model scores 98.14% which is pretty good, and is a significant improvement over the low 90% figures typically obtained from a basic decision tree model.

accuracy = accuracy_score(y_test, y_pred)
accuracy
0.9814814814814815

To get a better understanding of the model’s performance, we can use the classification_report. Four metrics are returned - precision, recall, F1 score, and support. They’re explained in the table beneath.

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        23
           1       1.00      0.95      0.97        19
           2       1.00      1.00      1.00        12

    accuracy                           0.98        54
   macro avg       0.99      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54
Metric Definition
Precision The precision model evaluation metric is the ratio of true positives over true positives plus false positives, i.e. precision = tp / (tp + fp). Precision shows the model's ability not to label a negative sample as positive.
Recall The recall model evaluation metric is the ratio of true positives over true positives plus false negatives, i.e. precision = tp / (tp + fp). Recall shows the model's ability to detect the positive samples.
F1 score The F1 score (or F-beta score, as it's also known) is a weighted harmonic mean of the precision and recal scores, where an F-beta score of 1 is best and 0 is worst.
Support The support value shows the number of occurrences of each class in the `y_true` (or `y_test`) data.

Plot an individual decision tree from the random forest model

Although a random forest will typically outperform a decision tree when it comes to accuracy, the downside is that random forests are much less interpretable. With a regular decision tree model, you can print the decision tree itself to see what decisions the model used to reach its predictions. However, with a random forest, you’ll have numerous individual decision trees that are used to make predictions from which the eventual final prediction is made either via a consensus or average.

To examine a specific tree from the random forest simply change the value in the square brackets after estimators_ to any integer between 0 and the maximum number of trees used in your model. This is set to 100 by default, so a base random forest comprises 100 decision trees.

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
fig = plt.figure(figsize=(12, 12))
plot_tree(model.estimators_[0], 
          filled=True, 
          impurity=True, 
          rounded=True)
[Text(263.0571428571429, 605.7257142857143, 'X[10] <= 0.855\ngini = 0.652\nsamples = 74\nvalue = [29, 49, 46]'),
 Text(95.65714285714286, 512.537142857143, 'X[6] <= 1.235\ngini = 0.168\nsamples = 27\nvalue = [1, 3, 40]'),
 Text(47.82857142857143, 419.34857142857146, 'gini = 0.0\nsamples = 23\nvalue = [0, 0, 39]'),
 Text(143.4857142857143, 419.34857142857146, 'X[5] <= 1.52\ngini = 0.56\nsamples = 4\nvalue = [1, 3, 1]'),
 Text(95.65714285714286, 326.16, 'gini = 0.0\nsamples = 1\nvalue = [0, 0, 1]'),
 Text(191.31428571428572, 326.16, 'X[2] <= 2.34\ngini = 0.375\nsamples = 3\nvalue = [1, 3, 0]'),
 Text(143.4857142857143, 232.9714285714286, 'X[3] <= 16.75\ngini = 0.5\nsamples = 2\nvalue = [1, 1, 0]'),
 Text(95.65714285714286, 139.7828571428571, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),
 Text(191.31428571428572, 139.7828571428571, 'gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]'),
 Text(239.14285714285714, 232.9714285714286, 'gini = 0.0\nsamples = 1\nvalue = [0, 2, 0]'),
 Text(430.45714285714286, 512.537142857143, 'X[1] <= 1.62\ngini = 0.541\nsamples = 47\nvalue = [28, 46, 6]'),
 Text(334.8, 419.34857142857146, 'X[6] <= 3.125\ngini = 0.077\nsamples = 16\nvalue = [1, 24, 0]'),
 Text(286.9714285714286, 326.16, 'gini = 0.0\nsamples = 14\nvalue = [0, 23, 0]'),
 Text(382.62857142857143, 326.16, 'X[10] <= 1.065\ngini = 0.5\nsamples = 2\nvalue = [1, 1, 0]'),
 Text(334.8, 232.9714285714286, 'gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]'),
 Text(430.45714285714286, 232.9714285714286, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),
 Text(526.1142857142858, 419.34857142857146, 'X[0] <= 12.875\ngini = 0.587\nsamples = 31\nvalue = [27, 22, 6]'),
 Text(478.2857142857143, 326.16, 'gini = 0.0\nsamples = 11\nvalue = [0, 21, 0]'),
 Text(573.9428571428572, 326.16, 'X[10] <= 0.995\ngini = 0.337\nsamples = 20\nvalue = [27, 1, 6]'),
 Text(526.1142857142858, 232.9714285714286, 'X[6] <= 1.32\ngini = 0.54\nsamples = 5\nvalue = [3, 1, 6]'),
 Text(478.2857142857143, 139.7828571428571, 'gini = 0.0\nsamples = 2\nvalue = [0, 0, 6]'),
 Text(573.9428571428572, 139.7828571428571, 'X[12] <= 753.5\ngini = 0.375\nsamples = 3\nvalue = [3, 1, 0]'),
 Text(526.1142857142858, 46.594285714285775, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),
 Text(621.7714285714286, 46.594285714285775, 'gini = 0.0\nsamples = 2\nvalue = [3, 0, 0]'),
 Text(621.7714285714286, 232.9714285714286, 'gini = 0.0\nsamples = 15\nvalue = [24, 0, 0]')]

Random Forest decision tree

Next steps

That’s how to create a really simple random forest model in Python using scikit-learn, but there are a number of other things you can do to make your model more robust and improve its performance. Firstly, you’ll probably want to apply the model selection process and use cross validation to identify the model best suited to the task, rather than simply selecting one over another based on a hunch. Not all models perform equally on the same dataset, so you may get a significant performance boost by trying a range of them to see which one is best.

Secondly, you’ll want to conduct hyperparameter tuning after you’ve selected your chosen model. Hyperparameter tuning is a brute force process through which lots of different settings are adjusted in order to find the one that generates the best performance for the model.

It rarely brings massive gains, but it should give you a little extra performance for not a lot of effort. However, it’s time-consuming and processor intensive, so you’ll likely want to run hyperparameter tuning overnight and then use Pickle to save your machine learning model.

Matt Clarke, Sunday, May 01, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.