The Pandas filter()
function is used to filter a dataframe based on the column names, rather than the column values, and is useful in creating a subset dataframe containing only those columns in which you are interested.
While the Pandas query()
function might look similar, query()
filters a dataframe based on the column values, not the column names, so it’s quite different really.
In this tutorial, we’ll go over the basic features of the Pandas filter()
method so you can see how easy it is to use. Since filter()
takes a dataframe as its input and returns a transformed dataframe as its output, it can be used with the modern method chaining technique.
The Pandas filter()
method is very simple. It’s just another way of sub-setting a dataframe to include only those columns whose names match your search criteria. It takes four arguments:
Argument | Description |
---|---|
items |
The items argument takes a list containing one or more column names. When provided, the function will return a new dataframe containing only those columns specified in the list. |
like |
The like argument takes a string value and will return a dataframe containing only those columns that include the substring in the column name. |
regex |
The regex argument is similar to like , but instead of returning column names that contain a substring, it returns those columns whose names match a Python regular expression. |
axis |
The axis argument allows you to specify whether you want to filter the dataframe based on the column names or the index names. |
To get started working through these simple examples, open a Jupyter notebook and import the Pandas library using the import pandas as pd
naming convention. Then, either import your own data to filter, or create a Pandas dataframe containing some dummy data.
import pandas as pd
df = pd.DataFrame(
[('Pterophyllum altum', 'Cichlidae', 15, 12.3),
('Pterophyllum scalare', 'Cichlidae', 10, 9.2),
('Pterophyllum leopoldi', 'Cichlidae', 8, 6.5),
('Corydoras paleatus', 'Callichthyidae', 5, 0.5),
('Corydoras aeneus', 'Callichthyidae', 7, 2.5),
('Corydoras trilineatus', 'Callichthyidae', 5, 1.5),
('Corydoras habrosus', 'Callichthyidae', 3, 0.5),
],
columns=['species', 'family', 'length', 'weight']
)
df
species | family | length | weight | |
---|---|---|---|---|
0 | Pterophyllum altum | Cichlidae | 15 | 12.3 |
1 | Pterophyllum scalare | Cichlidae | 10 | 9.2 |
2 | Pterophyllum leopoldi | Cichlidae | 8 | 6.5 |
3 | Corydoras paleatus | Callichthyidae | 5 | 0.5 |
4 | Corydoras aeneus | Callichthyidae | 7 | 2.5 |
5 | Corydoras trilineatus | Callichthyidae | 5 | 1.5 |
6 | Corydoras habrosus | Callichthyidae | 3 | 0.5 |
At its most basic, the filter()
method takes an argument called items
which accepts a list of Pandas column names you want to select. Calling df.filter(items=['species', 'family'])
does exactly the same as df[['species', 'family']]
.
When you call filter()
it returns a transformed dataframe and doesn’t change the original, so if you want to retain the transformed dataframe you’ll need to save it to a new variable or overwrite the original. For example, df_subset = df.filter(items=['species', 'family'])
.
df.filter(items=['species', 'family'])
species | family | |
---|---|---|
0 | Pterophyllum altum | Cichlidae |
1 | Pterophyllum scalare | Cichlidae |
2 | Pterophyllum leopoldi | Cichlidae |
3 | Corydoras paleatus | Callichthyidae |
4 | Corydoras aeneus | Callichthyidae |
5 | Corydoras trilineatus | Callichthyidae |
6 | Corydoras habrosus | Callichthyidae |
The like
argument is used to select only those columns that contain a specific value. For example, to return a subset dataframe containing only the column names that contain an e
we can pass in the argument like='e'
and set axis=1
.
df.filter(like='e', axis=1)
species | length | weight | |
---|---|---|---|
0 | Pterophyllum altum | 15 | 12.3 |
1 | Pterophyllum scalare | 10 | 9.2 |
2 | Pterophyllum leopoldi | 8 | 6.5 |
3 | Corydoras paleatus | 5 | 0.5 |
4 | Corydoras aeneus | 7 | 2.5 |
5 | Corydoras trilineatus | 5 | 1.5 |
6 | Corydoras habrosus | 3 | 0.5 |
The like
argument returns column names that match a substring value. However, the regex
argument allows you to pass in a Python regular expression. For example, to return only the columns that start with len
we can use the regex ^len
.
df.filter(regex='^len', axis=1)
length | |
---|---|
0 | 15 |
1 | 10 |
2 | 8 |
3 | 5 |
4 | 7 |
5 | 5 |
6 | 3 |
As mentioned above, since the filter()
function is appended to a Pandas dataframe object to obtain its input data and returns a dataframe as its output, it lends itself to being used in method chaining. Each step in the chain transforms the dataframe and passes its output to the next step, creating a list of operations that are performed sequentially.
df_subset = (
df
.filter(items=['species', 'family', 'weight'])
.query('family == "Cichlidae"')
)
df_subset
species | family | weight | |
---|---|---|---|
0 | Pterophyllum altum | Cichlidae | 12.3 |
1 | Pterophyllum scalare | Cichlidae | 9.2 |
2 | Pterophyllum leopoldi | Cichlidae | 6.5 |
Matt Clarke, Monday, January 16, 2023