How to impute missing numeric values in your dataset

17 minutes to read

As models require numeric data and don’t like NaN, null, or inf values, if you find these within your dataset you’ll need to deal with them before passing the data to your model. The approach that you take to dealing with missing data depends on how much is missing and how you might be able to fill in any gaps.

For example, if you only have a small percentage of missing values in a column of your Pandas DataFrame then you might choose to drop the affected rows or fill in the gaps with another number. However, if the vast majority of the data are missing then it might be better just to drop the entire column.

“There are two types of people in the world: 1. Those who can impute missing values.”

Identifying where data are missing

Pretty much every dataset you’ll handle in the real world will contain some missing values, so your first step is going to be giving it a quick assessment to find out how it looks. There are a couple of useful tricks for doing this in Pandas which make this fairly quick and easy. First, we need to find a dataset to use which contains some missing values. We’ll use the wine recognition dataset which is built into the scikit-learn package and load the data and target into a Pandas DataFrame.

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X.head()

	mean radius	mean texture	mean perimeter	...	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	...	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	...	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	...	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	...	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	...	0.1625	0.2364	0.07678

5 rows × 30 columns

Unlike a real world dataset, this one doesn’t actually contain any missing values. To workaround this unusual problem, we’ll add some missing values so we can look at some of the techniques for identifying their presence. To do this I’ll pick some random columns and rows within the DataFrame and add some NaN with Numpy and iloc.

X.isnull().sum(axis = 0)

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64

X['mean radius'].iloc[0:5] = np.nan
X['mean concavity'].iloc[4:16] = np.nan
X['texture error'].iloc[60:65] = np.nan
X['area error'].iloc[10:15] = np.nan
X['worst radius'].iloc[30:73] = np.nan
X['worst compactness'].iloc[9:19] = np.nan
X['worst symmetry'].iloc[25:35] = np.nan
X['mean area'].iloc[9:15] = np.nan
X['mean radius'].iloc[34:43] = np.nan

Now we can run a couple of quick one-liners to assess what’s now missing from our dataset. The command df.isnull().sum(axis = 0) will tell us how many missing values are found in each column within the DataFrame, while df.isna().mean() will give us the percentage of values that are missing.

# Count the total number of missing values by column
X.isnull().sum(axis = 0)

mean radius                14
mean texture                0
mean perimeter              0
mean area                   6
mean smoothness             0
mean compactness            0
mean concavity             12
mean concave points         0
mean symmetry               0
mean fractal dimension      0
radius error                0
texture error               5
perimeter error             0
area error                  5
smoothness error            0
compactness error           0
concavity error             0
concave points error        0
symmetry error              0
fractal dimension error     0
worst radius               43
worst texture               0
worst perimeter             0
worst area                  0
worst smoothness            0
worst compactness          10
worst concavity             0
worst concave points        0
worst symmetry             10
worst fractal dimension     0
dtype: int64

# Calculate the percentage of missing values by column
X.isna().mean()

mean radius                0.024605
mean texture               0.000000
mean perimeter             0.000000
mean area                  0.010545
mean smoothness            0.000000
mean compactness           0.000000
mean concavity             0.021090
mean concave points        0.000000
mean symmetry              0.000000
mean fractal dimension     0.000000
radius error               0.000000
texture error              0.008787
perimeter error            0.000000
area error                 0.008787
smoothness error           0.000000
compactness error          0.000000
concavity error            0.000000
concave points error       0.000000
symmetry error             0.000000
fractal dimension error    0.000000
worst radius               0.075571
worst texture              0.000000
worst perimeter            0.000000
worst area                 0.000000
worst smoothness           0.000000
worst compactness          0.017575
worst concavity            0.000000
worst concave points       0.000000
worst symmetry             0.017575
worst fractal dimension    0.000000
dtype: float64

# Show the rows with missing values
X[X.isna().any(axis=1)]

	mean radius	mean texture	mean perimeter	...	worst concave points	worst symmetry	worst fractal dimension
0	NaN	10.38	122.80	...	0.26540	0.4601	0.11890
1	NaN	17.77	132.90	...	0.18600	0.2750	0.08902
2	NaN	21.25	130.00	...	0.24300	0.3613	0.08758
3	NaN	20.38	77.58	...	0.25750	0.6638	0.17300
4	NaN	14.34	135.10	...	0.16250	0.2364	0.07678
...	...	...	...	...	...	...	...
68	9.029	17.33	58.79	...	0.17500	0.4228	0.11750
69	12.780	16.49	81.37	...	0.05882	0.2383	0.06410
70	18.940	21.31	123.60	...	0.17890	0.2551	0.06589
71	8.888	14.64	58.79	...	0.04786	0.2254	0.10840
72	17.200	24.52	114.20	...	0.18990	0.3313	0.13390

67 rows × 30 columns

Dropping missing values

Now we can identify and measure the proportion of missing values in our dataset we next need to look at how we handle them. There are various approaches to doing this, but they broadly fall into two main ones: deleting them or filling them in with a random value (usually 0 or -1), or calculating a number to add (which is known as imputation or imputing).

If you’re lazy, you might just choose to drop all of the columns with missing values, which you can do using df.dropna(axis='columns'). However, a smarter approach is to only drop the affected rows or only drop those columns where the percentage of missing values exceeds a particular threshold. The command df.dropna(thresh=len(df)*0.9, axis='columns') tells Pandas to drop the columns where more than 10% of the values are missing. The downside of both approaches is that they throw away potentially useful data. You could also use df_drop_missing_rows = X.dropna(axis='rows') to drop rows with missing values.

# Drop columns with missing data
df_drop_missing_cols = X.dropna(axis='columns')

To fill the values in with zeros or -1 you can use the fillna() function within Pandas. Using df.fillna(value=0, inplace=True) will fill with a zero, while df.fillna(value=-1, inplace=True) will fill any missing values in the whole DataFrame with -1. This can often give better results than using zero, so is well worth trying.

# Fill missing values with zero
df_zero_imputation = X.fillna(value=0)

# Fill missing values with -1
df_minus_one_imputation = X.fillna(value=-1)

Before we move on to a more sophisticated type of imputation, lets create a simple function which trains an XGBoost classifier model to assess the impact of these approaches so we can compare them and the ones that follow. We’ll use this on each dataframe and return an F1 score that we can use to compare them.

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from xgboost import XGBClassifier

def check_performance(technique, X_train, X_test, y_train, y_test):
    classifier = XGBClassifier()
    model = classifier.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    f1 = f1_score(y_test, y_pred)
    
    print(technique,' = ', f1)

By running our test classifier model with no imputation we get a baseline result of 0.958. Dropping columns which contain missing values gives us an improvement to 0.9633, but filling with both zero and minus one gives us the same score of 0.968, so this popular technique has proven the best so far, but all three methods worked better than doing nothing.

# No imputation
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=1)
check_performance('No imputation', X_train, X_test, y_train, y_test)

No imputation  =  0.9589041095890412

# Drop missing columns
X_train, X_test, y_train, y_test = train_test_split(df_drop_missing_cols, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=1)
check_performance('Drop missing cols', X_train, X_test, y_train, y_test)

Drop missing cols  =  0.963302752293578

# Zero imputation
X_train, X_test, y_train, y_test = train_test_split(df_zero_imputation, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=1)
check_performance('Zero imputation', X_train, X_test, y_train, y_test)

Zero imputation  =  0.9680365296803655

# Minus one imputation
X_train, X_test, y_train, y_test = train_test_split(df_minus_one_imputation, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=1)
check_performance('Minus one imputation', X_train, X_test, y_train, y_test)

Minus one imputation  =  0.9680365296803655

Filling in the gaps with imputation

While filling in NaN values with 0 or -1 is quick and easy and works fine, you may be able to gain a performance improvement if you use your skills to fill in (or impute) a more realistic value. Making stuff up to fill in the gaps might seem a bit weird at first, but when you think about it, it does actually make sense.

Rather than just sticking in any old value, the usual approach to imputation is to add one of three values based on the rest of the DataFrame’s column: the mean (or average), the median (the middle value) or the mode (the most frequently encountered value). Mean imputation proved to be the least effective, with an F1 of 0.9545, with both median and mode imputation scoring 0.968 - the same score as the zero imputation we used above.

# Mean imputation
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=1)

imputer = SimpleImputer(strategy='mean')
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train))
X_test_imputed = pd.DataFrame(imputer.transform(X_test))

check_performance('Mean imputation', X_train_imputed, X_test_imputed, y_train, y_test)

SimpleImputer mean  =  0.9545454545454546

# Median imputation
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=1)

imputer = SimpleImputer(strategy='median')
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train))
X_test_imputed = pd.DataFrame(imputer.transform(X_test))

check_performance('Median imputation', X_train_imputed, X_test_imputed, y_train, y_test)

SimpleImputer median  =  0.9680365296803655

# Mode imputation
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=1)

imputer = SimpleImputer(strategy='most_frequent')
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train))
X_test_imputed = pd.DataFrame(imputer.transform(X_test))

check_performance('Mode imputation', X_train_imputed, X_test_imputed, y_train, y_test)

SimpleImputer most_frequent  =  0.9680365296803655

It’s probable that you’ll get slightly different results with different datasets and different models, but it’s a good way of showing you how missing values affect data and what sort of performance improvements you can gain, simply by deleting, zero filling or imputing missing data.

Imputation strategy	F1 score
No imputaton	0.9589
Drop missing columns	0.9633
Zero imputation	0.968
Minus one imputation	0.968
Mean imputation	0.9545
Median imputation	0.968
Mode imputation	0.968

Matt Clarke, Monday, March 01, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.