How to use the Pandas filter() function

Learn how to use the Pandas filter() function to filter or subset a dataframe based on the column names using their names, substring matches, or regular expressions.

How to use the Pandas filter() function
Picture by Mart Production, Pexels.
7 minutes to read

The Pandas filter() function is used to filter a dataframe based on the column names, rather than the column values, and is useful in creating a subset dataframe containing only those columns in which you are interested.

While the Pandas query() function might look similar, query() filters a dataframe based on the column values, not the column names, so it’s quite different really.

In this tutorial, we’ll go over the basic features of the Pandas filter() method so you can see how easy it is to use. Since filter() takes a dataframe as its input and returns a transformed dataframe as its output, it can be used with the modern method chaining technique.

Using the Pandas filter() method

The Pandas filter() method is very simple. It’s just another way of sub-setting a dataframe to include only those columns whose names match your search criteria. It takes four arguments:

Argument Description
items The items argument takes a list containing one or more column names. When provided, the function will return a new dataframe containing only those columns specified in the list.
like The like argument takes a string value and will return a dataframe containing only those columns that include the substring in the column name.
regex The regex argument is similar to like, but instead of returning column names that contain a substring, it returns those columns whose names match a Python regular expression.
axis The axis argument allows you to specify whether you want to filter the dataframe based on the column names or the index names.

Create a Pandas dataframe

To get started working through these simple examples, open a Jupyter notebook and import the Pandas library using the import pandas as pd naming convention. Then, either import your own data to filter, or create a Pandas dataframe containing some dummy data.

import pandas as pd
df = pd.DataFrame(
    [('Pterophyllum altum', 'Cichlidae', 15, 12.3),
        ('Pterophyllum scalare', 'Cichlidae', 10, 9.2),
        ('Pterophyllum leopoldi', 'Cichlidae', 8, 6.5), 
        ('Corydoras paleatus', 'Callichthyidae', 5, 0.5),
        ('Corydoras aeneus', 'Callichthyidae', 7, 2.5),
        ('Corydoras trilineatus', 'Callichthyidae', 5, 1.5),
        ('Corydoras habrosus', 'Callichthyidae', 3, 0.5),
    ],
    columns=['species', 'family', 'length', 'weight']
)
df
species family length weight
0 Pterophyllum altum Cichlidae 15 12.3
1 Pterophyllum scalare Cichlidae 10 9.2
2 Pterophyllum leopoldi Cichlidae 8 6.5
3 Corydoras paleatus Callichthyidae 5 0.5
4 Corydoras aeneus Callichthyidae 7 2.5
5 Corydoras trilineatus Callichthyidae 5 1.5
6 Corydoras habrosus Callichthyidae 3 0.5

Use filter() to select specific columns

At its most basic, the filter() method takes an argument called items which accepts a list of Pandas column names you want to select. Calling df.filter(items=['species', 'family']) does exactly the same as df[['species', 'family']].

When you call filter() it returns a transformed dataframe and doesn’t change the original, so if you want to retain the transformed dataframe you’ll need to save it to a new variable or overwrite the original. For example, df_subset = df.filter(items=['species', 'family']).

df.filter(items=['species', 'family'])
species family
0 Pterophyllum altum Cichlidae
1 Pterophyllum scalare Cichlidae
2 Pterophyllum leopoldi Cichlidae
3 Corydoras paleatus Callichthyidae
4 Corydoras aeneus Callichthyidae
5 Corydoras trilineatus Callichthyidae
6 Corydoras habrosus Callichthyidae

Use filter() to select column names containing a string

The like argument is used to select only those columns that contain a specific value. For example, to return a subset dataframe containing only the column names that contain an e we can pass in the argument like='e' and set axis=1.

df.filter(like='e', axis=1)
species length weight
0 Pterophyllum altum 15 12.3
1 Pterophyllum scalare 10 9.2
2 Pterophyllum leopoldi 8 6.5
3 Corydoras paleatus 5 0.5
4 Corydoras aeneus 7 2.5
5 Corydoras trilineatus 5 1.5
6 Corydoras habrosus 3 0.5

Use filter() to select column names based on a regex

The like argument returns column names that match a substring value. However, the regex argument allows you to pass in a Python regular expression. For example, to return only the columns that start with len we can use the regex ^len.

df.filter(regex='^len', axis=1)
length
0 15
1 10
2 8
3 5
4 7
5 5
6 3

Use the filter() function with method chaining

As mentioned above, since the filter() function is appended to a Pandas dataframe object to obtain its input data and returns a dataframe as its output, it lends itself to being used in method chaining. Each step in the chain transforms the dataframe and passes its output to the next step, creating a list of operations that are performed sequentially.

df_subset = (
    df
    .filter(items=['species', 'family', 'weight'])
    .query('family == "Cichlidae"')
)

df_subset
species family weight
0 Pterophyllum altum Cichlidae 12.3
1 Pterophyllum scalare Cichlidae 9.2
2 Pterophyllum leopoldi Cichlidae 6.5

Matt Clarke, Monday, January 16, 2023

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.