As models require numeric data and don’t like NaN
, null
, or inf
values, if you find these within your dataset you’ll need to deal with them before passing the data to your model. The approach that you take to dealing with missing data depends on how much is missing and how you might be able to fill in any gaps.
For example, if you only have a small percentage of missing values in a column of your Pandas DataFrame then you might choose to drop the affected rows or fill in the gaps with another number. However, if the vast majority of the data are missing then it might be better just to drop the entire column.
“There are two types of people in the world: 1. Those who can impute missing values.”
Pretty much every dataset you’ll handle in the real world will contain some missing values, so your first step is going to be giving it a quick assessment to find out how it looks. There are a couple of useful tricks for doing this in Pandas which make this fairly quick and easy. First, we need to find a dataset to use which contains some missing values. We’ll use the wine recognition dataset which is built into the scikit-learn package and load the data and target into a Pandas DataFrame.
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X.head()
mean radius | mean texture | mean perimeter | ... | worst concave points | worst symmetry | worst fractal dimension | |
---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | ... | 0.2654 | 0.4601 | 0.11890 |
1 | 20.57 | 17.77 | 132.90 | ... | 0.1860 | 0.2750 | 0.08902 |
2 | 19.69 | 21.25 | 130.00 | ... | 0.2430 | 0.3613 | 0.08758 |
3 | 11.42 | 20.38 | 77.58 | ... | 0.2575 | 0.6638 | 0.17300 |
4 | 20.29 | 14.34 | 135.10 | ... | 0.1625 | 0.2364 | 0.07678 |
5 rows × 30 columns
Unlike a real world dataset, this one doesn’t actually contain any missing values. To workaround this unusual problem, we’ll add some missing values so we can look at some of the techniques for identifying their presence. To do this I’ll pick some random columns and rows within the DataFrame and add some NaN
with Numpy and iloc
.
X.isnull().sum(axis = 0)
mean radius 0
mean texture 0
mean perimeter 0
mean area 0
mean smoothness 0
mean compactness 0
mean concavity 0
mean concave points 0
mean symmetry 0
mean fractal dimension 0
radius error 0
texture error 0
perimeter error 0
area error 0
smoothness error 0
compactness error 0
concavity error 0
concave points error 0
symmetry error 0
fractal dimension error 0
worst radius 0
worst texture 0
worst perimeter 0
worst area 0
worst smoothness 0
worst compactness 0
worst concavity 0
worst concave points 0
worst symmetry 0
worst fractal dimension 0
dtype: int64
X['mean radius'].iloc[0:5] = np.nan
X['mean concavity'].iloc[4:16] = np.nan
X['texture error'].iloc[60:65] = np.nan
X['area error'].iloc[10:15] = np.nan
X['worst radius'].iloc[30:73] = np.nan
X['worst compactness'].iloc[9:19] = np.nan
X['worst symmetry'].iloc[25:35] = np.nan
X['mean area'].iloc[9:15] = np.nan
X['mean radius'].iloc[34:43] = np.nan
Now we can run a couple of quick one-liners to assess what’s now missing from our dataset. The command df.isnull().sum(axis = 0)
will tell us how many missing values are found in each column within the DataFrame, while df.isna().mean()
will give us the percentage of values that are missing.
# Count the total number of missing values by column
X.isnull().sum(axis = 0)
mean radius 14
mean texture 0
mean perimeter 0
mean area 6
mean smoothness 0
mean compactness 0
mean concavity 12
mean concave points 0
mean symmetry 0
mean fractal dimension 0
radius error 0
texture error 5
perimeter error 0
area error 5
smoothness error 0
compactness error 0
concavity error 0
concave points error 0
symmetry error 0
fractal dimension error 0
worst radius 43
worst texture 0
worst perimeter 0
worst area 0
worst smoothness 0
worst compactness 10
worst concavity 0
worst concave points 0
worst symmetry 10
worst fractal dimension 0
dtype: int64
# Calculate the percentage of missing values by column
X.isna().mean()
mean radius 0.024605
mean texture 0.000000
mean perimeter 0.000000
mean area 0.010545
mean smoothness 0.000000
mean compactness 0.000000
mean concavity 0.021090
mean concave points 0.000000
mean symmetry 0.000000
mean fractal dimension 0.000000
radius error 0.000000
texture error 0.008787
perimeter error 0.000000
area error 0.008787
smoothness error 0.000000
compactness error 0.000000
concavity error 0.000000
concave points error 0.000000
symmetry error 0.000000
fractal dimension error 0.000000
worst radius 0.075571
worst texture 0.000000
worst perimeter 0.000000
worst area 0.000000
worst smoothness 0.000000
worst compactness 0.017575
worst concavity 0.000000
worst concave points 0.000000
worst symmetry 0.017575
worst fractal dimension 0.000000
dtype: float64
# Show the rows with missing values
X[X.isna().any(axis=1)]
mean radius | mean texture | mean perimeter | ... | worst concave points | worst symmetry | worst fractal dimension | |
---|---|---|---|---|---|---|---|
0 | NaN | 10.38 | 122.80 | ... | 0.26540 | 0.4601 | 0.11890 |
1 | NaN | 17.77 | 132.90 | ... | 0.18600 | 0.2750 | 0.08902 |
2 | NaN | 21.25 | 130.00 | ... | 0.24300 | 0.3613 | 0.08758 |
3 | NaN | 20.38 | 77.58 | ... | 0.25750 | 0.6638 | 0.17300 |
4 | NaN | 14.34 | 135.10 | ... | 0.16250 | 0.2364 | 0.07678 |
... | ... | ... | ... | ... | ... | ... | ... |
68 | 9.029 | 17.33 | 58.79 | ... | 0.17500 | 0.4228 | 0.11750 |
69 | 12.780 | 16.49 | 81.37 | ... | 0.05882 | 0.2383 | 0.06410 |
70 | 18.940 | 21.31 | 123.60 | ... | 0.17890 | 0.2551 | 0.06589 |
71 | 8.888 | 14.64 | 58.79 | ... | 0.04786 | 0.2254 | 0.10840 |
72 | 17.200 | 24.52 | 114.20 | ... | 0.18990 | 0.3313 | 0.13390 |
67 rows × 30 columns
Now we can identify and measure the proportion of missing values in our dataset we next need to look at how we handle them. There are various approaches to doing this, but they broadly fall into two main ones: deleting them or filling them in with a random value (usually 0 or -1), or calculating a number to add (which is known as imputation or imputing).
If you’re lazy, you might just choose to drop all of the columns with missing values, which you can do using df.dropna(axis='columns')
. However, a smarter approach is to only drop the affected rows or only drop those columns where the percentage of missing values exceeds a particular threshold. The command df.dropna(thresh=len(df)*0.9, axis='columns')
tells Pandas to drop the columns where more than 10% of the values are missing. The downside of both approaches is that they throw away potentially useful data. You could also use df_drop_missing_rows = X.dropna(axis='rows')
to drop rows with missing values.
# Drop columns with missing data
df_drop_missing_cols = X.dropna(axis='columns')
To fill the values in with zeros or -1 you can use the fillna()
function within Pandas. Using df.fillna(value=0, inplace=True)
will fill with a zero, while df.fillna(value=-1, inplace=True)
will fill any missing values in the whole DataFrame with -1. This can often give better results than using zero, so is well worth trying.
# Fill missing values with zero
df_zero_imputation = X.fillna(value=0)
# Fill missing values with -1
df_minus_one_imputation = X.fillna(value=-1)
Before we move on to a more sophisticated type of imputation, lets create a simple function which trains an XGBoost classifier model to assess the impact of these approaches so we can compare them and the ones that follow. We’ll use this on each dataframe and return an F1 score that we can use to compare them.
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from xgboost import XGBClassifier
def check_performance(technique, X_train, X_test, y_train, y_test):
classifier = XGBClassifier()
model = classifier.fit(X_train, y_train)
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred)
print(technique,' = ', f1)
By running our test classifier model with no imputation we get a baseline result of 0.958. Dropping columns which contain missing values gives us an improvement to 0.9633, but filling with both zero and minus one gives us the same score of 0.968, so this popular technique has proven the best so far, but all three methods worked better than doing nothing.
# No imputation
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.30,
random_state=1)
check_performance('No imputation', X_train, X_test, y_train, y_test)
No imputation = 0.9589041095890412
# Drop missing columns
X_train, X_test, y_train, y_test = train_test_split(df_drop_missing_cols,
y,
test_size=0.30,
random_state=1)
check_performance('Drop missing cols', X_train, X_test, y_train, y_test)
Drop missing cols = 0.963302752293578
# Zero imputation
X_train, X_test, y_train, y_test = train_test_split(df_zero_imputation,
y,
test_size=0.30,
random_state=1)
check_performance('Zero imputation', X_train, X_test, y_train, y_test)
Zero imputation = 0.9680365296803655
# Minus one imputation
X_train, X_test, y_train, y_test = train_test_split(df_minus_one_imputation,
y,
test_size=0.30,
random_state=1)
check_performance('Minus one imputation', X_train, X_test, y_train, y_test)
Minus one imputation = 0.9680365296803655
While filling in NaN
values with 0 or -1 is quick and easy and works fine, you may be able to gain a performance improvement if you use your skills to fill in (or impute) a more realistic value. Making stuff up to fill in the gaps might seem a bit weird at first, but when you think about it, it does actually make sense.
Rather than just sticking in any old value, the usual approach to imputation is to add one of three values based on the rest of the DataFrame’s column: the mean (or average), the median (the middle value) or the mode (the most frequently encountered value). Mean imputation proved to be the least effective, with an F1 of 0.9545, with both median and mode imputation scoring 0.968 - the same score as the zero imputation we used above.
# Mean imputation
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.30,
random_state=1)
imputer = SimpleImputer(strategy='mean')
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train))
X_test_imputed = pd.DataFrame(imputer.transform(X_test))
check_performance('Mean imputation', X_train_imputed, X_test_imputed, y_train, y_test)
SimpleImputer mean = 0.9545454545454546
# Median imputation
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.30,
random_state=1)
imputer = SimpleImputer(strategy='median')
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train))
X_test_imputed = pd.DataFrame(imputer.transform(X_test))
check_performance('Median imputation', X_train_imputed, X_test_imputed, y_train, y_test)
SimpleImputer median = 0.9680365296803655
# Mode imputation
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.30,
random_state=1)
imputer = SimpleImputer(strategy='most_frequent')
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train))
X_test_imputed = pd.DataFrame(imputer.transform(X_test))
check_performance('Mode imputation', X_train_imputed, X_test_imputed, y_train, y_test)
SimpleImputer most_frequent = 0.9680365296803655
It’s probable that you’ll get slightly different results with different datasets and different models, but it’s a good way of showing you how missing values affect data and what sort of performance improvements you can gain, simply by deleting, zero filling or imputing missing data.
Imputation strategy | F1 score |
---|---|
No imputaton | 0.9589 |
Drop missing columns | 0.9633 |
Zero imputation | 0.968 |
Minus one imputation | 0.968 |
Mean imputation | 0.9545 |
Median imputation | 0.968 |
Mode imputation | 0.968 |
Matt Clarke, Monday, March 01, 2021