Real world data is rarely clean, and you’ll often encounter missing values when working with Pandas dataframes. Missing values can lead to errors in your code, and can cause models to fail or be inaccurate, so it’s vital that you check for them.
In this project, I’ll show you how you can use the Pandas isna()
function to check for missing values in your dataframe, and in specific columns, and how to create a helper function that determines the percentage of missing values overall, so you can decide on the best approach to dealing with them.
To get started, open a Jupyter notebook and import the Pandas package. You’ll likely already have Pandas installed, but if you don’t you can install it by entering !pip3 install pandas
in a Jupyter cell or pip3 install pandas
in your terminal.
import pandas as pd
Next, we’ll load a dataset into Pandas that includes some missing values. I’ve used a dataset on food consumption from OpenMV, but you can use any data you like.
If you run df.info()
on the dataframe you’ll see that it contains an object
column containing the country name, and then a load of numeric data columns containing information on food consumption.
df = pd.read_csv('food-consumption.csv')
df.head()
Country | Real coffee | Instant coffee | Tea | Sweetener | Biscuits | Powder soup | Tin soup | Potatoes | Frozen fish | ... | Apples | Oranges | Tinned fruit | Jam | Garlic | Butter | Margarine | Olive oil | Yoghurt | Crisp bread | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Germany | 90 | 49 | 88 | 19.0 | 57.0 | 51 | 19 | 21 | 27 | ... | 81 | 75 | 44 | 71 | 22 | 91 | 85 | 74 | 30.0 | 26 |
1 | Italy | 82 | 10 | 60 | 2.0 | 55.0 | 41 | 3 | 2 | 4 | ... | 67 | 71 | 9 | 46 | 80 | 66 | 24 | 94 | 5.0 | 18 |
2 | France | 88 | 42 | 63 | 4.0 | 76.0 | 53 | 11 | 23 | 11 | ... | 87 | 84 | 40 | 45 | 88 | 94 | 47 | 36 | 57.0 | 3 |
3 | Holland | 96 | 62 | 98 | 32.0 | 62.0 | 67 | 43 | 7 | 14 | ... | 83 | 89 | 61 | 81 | 15 | 31 | 97 | 13 | 53.0 | 15 |
4 | Belgium | 94 | 38 | 48 | 11.0 | 74.0 | 37 | 23 | 9 | 13 | ... | 76 | 76 | 42 | 57 | 29 | 84 | 80 | 83 | 20.0 | 5 |
5 rows × 21 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 16 non-null object
1 Real coffee 16 non-null int64
2 Instant coffee 16 non-null int64
3 Tea 16 non-null int64
4 Sweetener 15 non-null float64
5 Biscuits 15 non-null float64
6 Powder soup 16 non-null int64
7 Tin soup 16 non-null int64
8 Potatoes 16 non-null int64
9 Frozen fish 16 non-null int64
10 Frozen veggies 16 non-null int64
11 Apples 16 non-null int64
12 Oranges 16 non-null int64
13 Tinned fruit 16 non-null int64
14 Jam 16 non-null int64
15 Garlic 16 non-null int64
16 Butter 16 non-null int64
17 Margarine 16 non-null int64
18 Olive oil 16 non-null int64
19 Yoghurt 15 non-null float64
20 Crisp bread 16 non-null int64
dtypes: float64(3), int64(17), object(1)
memory usage: 2.8+ KB
The easiest way to check for missing values in a Pandas dataframe is via the isna()
function. The isna()
function returns a boolean (True or False) value if the Pandas column value is missing, so if you run df.isna()
you’ll get back a dataframe showing you a load of boolean values.
df.isna().head()
Country | Real coffee | Instant coffee | Tea | Sweetener | Biscuits | Powder soup | Tin soup | Potatoes | Frozen fish | ... | Apples | Oranges | Tinned fruit | Jam | Garlic | Butter | Margarine | Olive oil | Yoghurt | Crisp bread | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
5 rows × 21 columns
That’s not usually very useful, so instead we’ll calculate the sum()
of missing values by running df.isna().sum()
. This returns the columns in our Pandas dataframe along with the number of missing values detected in each one, so 0 means there are no missing values, and 1 means there is a single missing value.
df.isna().sum()
Country 0
Real coffee 0
Instant coffee 0
Tea 0
Sweetener 1
Biscuits 1
Powder soup 0
Tin soup 0
Potatoes 0
Frozen fish 0
Frozen veggies 0
Apples 0
Oranges 0
Tinned fruit 0
Jam 0
Garlic 0
Butter 0
Margarine 0
Olive oil 0
Yoghurt 1
Crisp bread 0
dtype: int64
You can also run this code on a single Pandas column, or series, or a subset of columns, if you don’t need to check the entire dataframe.
df['Sweetener'].isna().sum()
1
df[['Sweetener', 'Frozen fish']].isna().sum()
Sweetener 1
Frozen fish 0
dtype: int64
To take this a step further, we can create a reusable helper function you can use in your projects. We’ll call this show_missing()
and we’ll loop through each column in the dataframe and return some useful data on each one.
We’ll detect the Pandas dtypes, count the number of values, count the number of unique values, count the number of missing values, and calculate the percentage of values that are missing overall.
def show_missing(df):
"""Return a Pandas dataframe describing the contents of a source dataframe including missing values."""
variables = []
dtypes = []
count = []
unique = []
missing = []
pc_missing = []
for item in df.columns:
variables.append(item)
dtypes.append(df[item].dtype)
count.append(len(df[item]))
unique.append(len(df[item].unique()))
missing.append(df[item].isna().sum())
pc_missing.append(round((df[item].isna().sum() / len(df[item])) * 100, 2))
output = pd.DataFrame({
'variable': variables,
'dtype': dtypes,
'count': count,
'unique': unique,
'missing': missing,
'pc_missing': pc_missing
})
return output
show_missing(df)
variable | dtype | count | unique | missing | pc_missing | |
---|---|---|---|---|---|---|
0 | Country | object | 16 | 16 | 0 | 0.00 |
1 | Real coffee | int64 | 16 | 14 | 0 | 0.00 |
2 | Instant coffee | int64 | 16 | 15 | 0 | 0.00 |
3 | Tea | int64 | 16 | 15 | 0 | 0.00 |
4 | Sweetener | float64 | 16 | 14 | 1 | 6.25 |
5 | Biscuits | float64 | 16 | 14 | 1 | 6.25 |
6 | Powder soup | int64 | 16 | 14 | 0 | 0.00 |
7 | Tin soup | int64 | 16 | 13 | 0 | 0.00 |
8 | Potatoes | int64 | 16 | 11 | 0 | 0.00 |
9 | Frozen fish | int64 | 16 | 15 | 0 | 0.00 |
10 | Frozen veggies | int64 | 16 | 13 | 0 | 0.00 |
11 | Apples | int64 | 16 | 14 | 0 | 0.00 |
12 | Oranges | int64 | 16 | 15 | 0 | 0.00 |
13 | Tinned fruit | int64 | 16 | 15 | 0 | 0.00 |
14 | Jam | int64 | 16 | 16 | 0 | 0.00 |
15 | Garlic | int64 | 16 | 13 | 0 | 0.00 |
16 | Butter | int64 | 16 | 15 | 0 | 0.00 |
17 | Margarine | int64 | 16 | 13 | 0 | 0.00 |
18 | Olive oil | int64 | 16 | 15 | 0 | 0.00 |
19 | Yoghurt | float64 | 16 | 14 | 1 | 6.25 |
20 | Crisp bread | int64 | 16 | 15 | 0 | 0.00 |
Dealing with missing values is a subject in its own right. There are various approaches you can use which range from imputing missing values to dropping them entirely.
Matt Clarke, Friday, September 02, 2022