How to interpret the confusion matrix

The confusion matrix can tell you more about your model than the accuracy score. We build a model to examine the breast cancer dataset to see how it works.

How to interpret the confusion matrix
10 minutes to read

As a practical demonstration of how the confusion matrix works, lets load up the Wisconsin Breast Cancer dataset, create a classification model and examine the confusion matrix to see how it works.

The Wisconsin Breast Cancer dataset is one of the standard datasets provided with scikit-learn so is very easy to access. It contains medical data on breast tissues and includes a diagnostic column (y) that states whether the diagnosis of breast tissue indicated a benign (harmless) or malignant (cancerous) growth. We’ll use the data to train a model and make a prediction.

1. Load your data

First, load up the packages required, then obtain the data object from load_breast_cancer() into the data variable. As the data (X) and target (y) data are stored separately within the scikit-learn Bunch object, you’ll need to grab the data and target and append them together, using the feature_names as the column headers.

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

data = load_breast_cancer()
df = pd.DataFrame(np.c_[data['data'], data['target']],
                  columns= np.append(data['feature_names'], ['target']))
df.head()
mean radius mean texture mean perimeter ... worst symmetry worst fractal dimension target
0 17.99 10.38 122.80 ... 0.4601 0.11890 0.0
1 20.57 17.77 132.90 ... 0.2750 0.08902 0.0
2 19.69 21.25 130.00 ... 0.3613 0.08758 0.0
3 11.42 20.38 77.58 ... 0.6638 0.17300 0.0
4 20.29 14.34 135.10 ... 0.2364 0.07678 0.0

5 rows × 31 columns

2. Create your training and test datasets

Ordinarily, you’d do some exploratory data analysis (EDA), correct and cleanse any issues with the data and then spend a good chunk of time on feature engineering. However, we’ll skip this step for simplicity and just jump straight to the model building. We’ll use a quick and dirty approach here, without crossfold validation or tuning to make things easier. Firstly, we need to define the model features we want to use and set the target variable. The quick way to do this when dealing with scikit-learns built-in datasets is to load the data and pass in the optional return_X_y=True parameter.

X, y = load_breast_cancer(return_X_y=True, as_frame=True)

Next, we’ll use scikit-learn’s train_test_split() function to create a training and test dataset. We’ll assign 30% of our data to the test dataset and use 70% of the data for training. The random_state=0 flag means we can get reproducible results if we re-run the code later.

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=1)

3. Train your classification model

Now, we’ll use XGBoost’s XGBClassifier to create a classification model and we’ll fit the model using the data stored in X_train and y_train. Finally, we’ll use the X_test data that the model has not seen and use the accuracy_score() function from sklearn.metrics to measure the model’s accuracy.

This gives us a score of 0.941 (or 94.1%) which looks pretty good. However, to really understand this we need to dig a little deeper.

classifier = XGBClassifier()
model = classifier.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
0.9415204678362573

4. Examining the confusion matrix

The confusion matrix gives you the background behind your model’s accuracy score. It can tell you what it got right and where it went wrong and understanding it can really help you make further improvements. To obtain the confusion matrix data, run the code below.

For our data, which had two classes, the confusion matrix returns four values. Rather unhelpfully, these aren’t labelled, which is possibly the reason why so many people struggle to comprehend the confusion matrix at first. The matrix includes four values: true positive, false positive, true negative and false negative.

cm_data = confusion_matrix(y_test, y_pred, labels=np.unique(y_test))
cm_data
array([[ 56,   7],
       [  3, 105]])

A true positive means we predicted the tissue was malignant and were correct; true negative means we predicted the tissue was benign and were correct but, more worryingly, false positive means we may have told a patient they had a malignant growth when they didn’t and false negative means we told them their growth was benign, when it was actually cancerous. Therefore, despite the overall accuracy score, it’s clear we don’t really want potentially life-threatening false negatives and need to avoid false positives.

The confusion matrix is often shown in a table like the one below, which you’d think might make it easier to understand. However, there’s often little consistency between Python book authors in the side upon which the TP and TN fall, so it’s not uncommon for the output of the confusion matrix to be misinterpreted and the values jumbled up!

Predicted Predicted
Actual True Positive (TP) False Negative (FN)
Actual False Positive (FP) True Negative (TN)

To make the data less confusing and easier to interpret, we’ll write a little function to output the confusion matrix data to a Pandas DataFrame and add some descriptive labels to help us understand it. As you can see by comparing the two matrices, had we used our model to make a diagnosis, we’d have had 105 true negatives, 56 true positives and 3 false negatives (meaning we failed to diagnosis three patients) and we mistakenly told seven they had a malignant growth when they didn’t.

def deconfusion_matrix(y_test, y_pred):
    cm = confusion_matrix(y_test, y_pred)
    
    tp = cm[0][0]
    fp = cm[0][1]
    fn = cm[1][0]
    tn = cm[1][1]
    
    cm = {
        'Predicted (Positive)': [tp, fp],
        'Predicted (Negative)': [fn, tn],
    }

    df = pd.DataFrame(cm, columns = ['Predicted (Positive)', 'Predicted (Negative)'], 
                      index=['Actual (Positive)', 'Actual (Negative)'])
    
    return df
    
dcm = deconfusion_matrix(y_test, y_pred)
dcm
Predicted (Positive) Predicted (Negative)
Actual (Positive) 56 3
Actual (Negative) 7 105

Another useful thing you can do with the data from the confusion matrix is append a ravel() function and assign the output values to tn, fp, fn, tp to store the values in these variables to check your results.

tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()

5. Other metrics to use

The confusion matrix is definitely very useful and shows you why accuracy alone isn’t the ideal metric on which to measure your model’s overall performance. Two further metrics you definitely should be examining when assessing your models are precision and recall.

Precision is calculated as the number of true positives divided by (true positives + false positives) and recall is calculated as the number of true positives divided by (true positives + false negatives). A third metric, the F1 score gives you a result based on both accuracy and recall. It’s calculated as 2 x * ((Precision x Recall) / (Precision + Recall)). You can access these from scikit-learn like this.

from sklearn.metrics import precision_score, recall_score, f1_score 

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)
Precision: 0.9375
Recall: 0.9722222222222222
F1 score: 0.9545454545454546

With close monitoring of the F1 score and precision and recall, you can adjust your model so that it gives you the closest results for your needs. Accuracy alone isn’t always ideal. You need to consider where it goes wrong and where the impact of that is going to cause the greatest harm. Obviously, when it comes to medical data you really don’t want to be misdiagnosing patients.

Matt Clarke, Monday, March 01, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.