How to use isna() to check for missing values in a Pandas dataframe

Picture by Pixabay, Pexels.

10 minutes to read

Data Science Pandas

Real world data is rarely clean, and you’ll often encounter missing values when working with Pandas dataframes. Missing values can lead to errors in your code, and can cause models to fail or be inaccurate, so it’s vital that you check for them.

In this project, I’ll show you how you can use the Pandas isna() function to check for missing values in your dataframe, and in specific columns, and how to create a helper function that determines the percentage of missing values overall, so you can decide on the best approach to dealing with them.

Import the packages

To get started, open a Jupyter notebook and import the Pandas package. You’ll likely already have Pandas installed, but if you don’t you can install it by entering !pip3 install pandas in a Jupyter cell or pip3 install pandas in your terminal.

import pandas as pd

Load the data

Next, we’ll load a dataset into Pandas that includes some missing values. I’ve used a dataset on food consumption from OpenMV, but you can use any data you like.

If you run df.info() on the dataframe you’ll see that it contains an object column containing the country name, and then a load of numeric data columns containing information on food consumption.

df = pd.read_csv('food-consumption.csv')
df.head()

	Country	Real coffee	Instant coffee	Tea	Sweetener	Biscuits	Powder soup	Tin soup	Potatoes	Frozen fish	...	Apples	Oranges	Tinned fruit	Jam	Garlic	Butter	Margarine	Olive oil	Yoghurt	Crisp bread
0	Germany	90	49	88	19.0	57.0	51	19	21	27	...	81	75	44	71	22	91	85	74	30.0	26
1	Italy	82	10	60	2.0	55.0	41	3	2	4	...	67	71	9	46	80	66	24	94	5.0	18
2	France	88	42	63	4.0	76.0	53	11	23	11	...	87	84	40	45	88	94	47	36	57.0	3
3	Holland	96	62	98	32.0	62.0	67	43	7	14	...	83	89	61	81	15	31	97	13	53.0	15
4	Belgium	94	38	48	11.0	74.0	37	23	9	13	...	76	76	42	57	29	84	80	83	20.0	5

5 rows × 21 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 Country         16 non-null     object 
 Real coffee     16 non-null     int64  
 Instant coffee  16 non-null     int64  
 Tea             16 non-null     int64  
 Sweetener       15 non-null     float64
 Biscuits        15 non-null     float64
 Powder soup     16 non-null     int64  
 Tin soup        16 non-null     int64  
 Potatoes        16 non-null     int64  
 Frozen fish     16 non-null     int64  
Frozen veggies  16 non-null     int64  
Apples          16 non-null     int64  
Oranges         16 non-null     int64  
Tinned fruit    16 non-null     int64  
Jam             16 non-null     int64  
Garlic          16 non-null     int64  
Butter          16 non-null     int64  
Margarine       16 non-null     int64  
Olive oil       16 non-null     int64  
Yoghurt         15 non-null     float64
Crisp bread     16 non-null     int64  
dtypes: float64(3), int64(17), object(1)
memory usage: 2.8+ KB

Check for missing values

The easiest way to check for missing values in a Pandas dataframe is via the isna() function. The isna() function returns a boolean (True or False) value if the Pandas column value is missing, so if you run df.isna() you’ll get back a dataframe showing you a load of boolean values.

df.isna().head()

	Country	Real coffee	Instant coffee	Tea	Sweetener	Biscuits	Powder soup	Tin soup	Potatoes	Frozen fish	...	Apples	Oranges	Tinned fruit	Jam	Garlic	Butter	Margarine	Olive oil	Yoghurt	Crisp bread
0	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

5 rows × 21 columns

That’s not usually very useful, so instead we’ll calculate the sum() of missing values by running df.isna().sum(). This returns the columns in our Pandas dataframe along with the number of missing values detected in each one, so 0 means there are no missing values, and 1 means there is a single missing value.

df.isna().sum()

Country           0
Real coffee       0
Instant coffee    0
Tea               0
Sweetener         1
Biscuits          1
Powder soup       0
Tin soup          0
Potatoes          0
Frozen fish       0
Frozen veggies    0
Apples            0
Oranges           0
Tinned fruit      0
Jam               0
Garlic            0
Butter            0
Margarine         0
Olive oil         0
Yoghurt           1
Crisp bread       0
dtype: int64

You can also run this code on a single Pandas column, or series, or a subset of columns, if you don’t need to check the entire dataframe.

df['Sweetener'].isna().sum()

df[['Sweetener', 'Frozen fish']].isna().sum()

Sweetener      1
Frozen fish    0
dtype: int64

Create a helper function to summarise missing data

To take this a step further, we can create a reusable helper function you can use in your projects. We’ll call this show_missing() and we’ll loop through each column in the dataframe and return some useful data on each one.

We’ll detect the Pandas dtypes, count the number of values, count the number of unique values, count the number of missing values, and calculate the percentage of values that are missing overall.

def show_missing(df):
    """Return a Pandas dataframe describing the contents of a source dataframe including missing values."""
    
    variables = []
    dtypes = []
    count = []
    unique = []
    missing = []
    pc_missing = []
    
    for item in df.columns:
        variables.append(item)
        dtypes.append(df[item].dtype)
        count.append(len(df[item]))
        unique.append(len(df[item].unique()))
        missing.append(df[item].isna().sum())
        pc_missing.append(round((df[item].isna().sum() / len(df[item])) * 100, 2))

    output = pd.DataFrame({
        'variable': variables, 
        'dtype': dtypes,
        'count': count,
        'unique': unique,
        'missing': missing, 
        'pc_missing': pc_missing
    })    
        
    return output

show_missing(df)

	variable	dtype	count	unique	missing	pc_missing
0	Country	object	16	16	0	0.00
1	Real coffee	int64	16	14	0	0.00
2	Instant coffee	int64	16	15	0	0.00
3	Tea	int64	16	15	0	0.00
4	Sweetener	float64	16	14	1	6.25
5	Biscuits	float64	16	14	1	6.25
6	Powder soup	int64	16	14	0	0.00
7	Tin soup	int64	16	13	0	0.00
8	Potatoes	int64	16	11	0	0.00
9	Frozen fish	int64	16	15	0	0.00
10	Frozen veggies	int64	16	13	0	0.00
11	Apples	int64	16	14	0	0.00
12	Oranges	int64	16	15	0	0.00
13	Tinned fruit	int64	16	15	0	0.00
14	Jam	int64	16	16	0	0.00
15	Garlic	int64	16	13	0	0.00
16	Butter	int64	16	15	0	0.00
17	Margarine	int64	16	13	0	0.00
18	Olive oil	int64	16	15	0	0.00
19	Yoghurt	float64	16	14	1	6.25
20	Crisp bread	int64	16	15	0	0.00

Dealing with missing values is a subject in its own right. There are various approaches you can use which range from imputing missing values to dropping them entirely.

Matt Clarke, Friday, September 02, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.