How to use the Pandas filter() function

Picture by Mart Production, Pexels.

7 minutes to read

Data Science Pandas

The Pandas filter() function is used to filter a dataframe based on the column names, rather than the column values, and is useful in creating a subset dataframe containing only those columns in which you are interested.

While the Pandas query() function might look similar, query() filters a dataframe based on the column values, not the column names, so it’s quite different really.

In this tutorial, we’ll go over the basic features of the Pandas filter() method so you can see how easy it is to use. Since filter() takes a dataframe as its input and returns a transformed dataframe as its output, it can be used with the modern method chaining technique.

Using the Pandas filter() method

The Pandas filter() method is very simple. It’s just another way of sub-setting a dataframe to include only those columns whose names match your search criteria. It takes four arguments:

Argument	Description
`items`	The `items` argument takes a list containing one or more column names. When provided, the function will return a new dataframe containing only those columns specified in the list.
`like`	The `like` argument takes a string value and will return a dataframe containing only those columns that include the substring in the column name.
`regex`	The `regex` argument is similar to `like`, but instead of returning column names that contain a substring, it returns those columns whose names match a Python regular expression.
`axis`	The `axis` argument allows you to specify whether you want to filter the dataframe based on the column names or the index names.

Create a Pandas dataframe

To get started working through these simple examples, open a Jupyter notebook and import the Pandas library using the import pandas as pd naming convention. Then, either import your own data to filter, or create a Pandas dataframe containing some dummy data.

import pandas as pd

df = pd.DataFrame(
    [('Pterophyllum altum', 'Cichlidae', 15, 12.3),
        ('Pterophyllum scalare', 'Cichlidae', 10, 9.2),
        ('Pterophyllum leopoldi', 'Cichlidae', 8, 6.5), 
        ('Corydoras paleatus', 'Callichthyidae', 5, 0.5),
        ('Corydoras aeneus', 'Callichthyidae', 7, 2.5),
        ('Corydoras trilineatus', 'Callichthyidae', 5, 1.5),
        ('Corydoras habrosus', 'Callichthyidae', 3, 0.5),
    ],
    columns=['species', 'family', 'length', 'weight']
)
df

	species	family	length	weight
0	Pterophyllum altum	Cichlidae	15	12.3
1	Pterophyllum scalare	Cichlidae	10	9.2
2	Pterophyllum leopoldi	Cichlidae	8	6.5
3	Corydoras paleatus	Callichthyidae	5	0.5
4	Corydoras aeneus	Callichthyidae	7	2.5
5	Corydoras trilineatus	Callichthyidae	5	1.5
6	Corydoras habrosus	Callichthyidae	3	0.5

Use filter() to select specific columns

At its most basic, the filter() method takes an argument called items which accepts a list of Pandas column names you want to select. Calling df.filter(items=['species', 'family']) does exactly the same as df[['species', 'family']].

When you call filter() it returns a transformed dataframe and doesn’t change the original, so if you want to retain the transformed dataframe you’ll need to save it to a new variable or overwrite the original. For example, df_subset = df.filter(items=['species', 'family']).

df.filter(items=['species', 'family'])

	species	family
0	Pterophyllum altum	Cichlidae
1	Pterophyllum scalare	Cichlidae
2	Pterophyllum leopoldi	Cichlidae
3	Corydoras paleatus	Callichthyidae
4	Corydoras aeneus	Callichthyidae
5	Corydoras trilineatus	Callichthyidae
6	Corydoras habrosus	Callichthyidae

Use filter() to select column names containing a string

The like argument is used to select only those columns that contain a specific value. For example, to return a subset dataframe containing only the column names that contain an e we can pass in the argument like='e' and set axis=1.

df.filter(like='e', axis=1)

	species	length	weight
0	Pterophyllum altum	15	12.3
1	Pterophyllum scalare	10	9.2
2	Pterophyllum leopoldi	8	6.5
3	Corydoras paleatus	5	0.5
4	Corydoras aeneus	7	2.5
5	Corydoras trilineatus	5	1.5
6	Corydoras habrosus	3	0.5

Use filter() to select column names based on a regex

The like argument returns column names that match a substring value. However, the regex argument allows you to pass in a Python regular expression. For example, to return only the columns that start with len we can use the regex ^len.

df.filter(regex='^len', axis=1)

	length
0	15
1	10
2	8
3	5
4	7
5	5
6	3

Use the filter() function with method chaining

As mentioned above, since the filter() function is appended to a Pandas dataframe object to obtain its input data and returns a dataframe as its output, it lends itself to being used in method chaining. Each step in the chain transforms the dataframe and passes its output to the next step, creating a list of operations that are performed sequentially.

df_subset = (
    df
    .filter(items=['species', 'family', 'weight'])
    .query('family == "Cichlidae"')
)

df_subset

	species	family	weight
0	Pterophyllum altum	Cichlidae	12.3
1	Pterophyllum scalare	Cichlidae	9.2
2	Pterophyllum leopoldi	Cichlidae	6.5

Matt Clarke, Monday, January 16, 2023

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.