The Pandas query()
function is an awesome tool for filtering Pandas dataframes. It takes simple string arguments on column names and uses standard Pandas operators that allow you to easily search, filter, or subset a dataframe, so it only shows rows that match your search criteria.
In this simple tutorial I’ll show you how easy it is to use the query()
function to filter a dataframe by working through some basic example codes highlighting its key features.
To get started, open a Jupyter notebook and import the Pandas library, then either import your own dataset, or create a Pandas dataframe to practice querying. We’ll make a simple dataframe that contains a mixture of categorical and numerical variables.
import pandas as pd
df = pd.DataFrame(
[('Pterophyllum altum', 'Cichlidae', 15, 12.3),
('Pterophyllum scalare', 'Cichlidae', 10, 9.2),
('Pterophyllum leopoldi', 'Cichlidae', 8, 6.5),
('Corydoras paleatus', 'Callichthyidae', 5, 0.5),
('Corydoras aeneus', 'Callichthyidae', 7, 2.5),
('Corydoras trilineatus', 'Callichthyidae', 5, 1.5),
('Corydoras habrosus', 'Callichthyidae', 3, 0.5),
],
columns=['species', 'family', 'length', 'weight']
)
df
species | family | length | weight | |
---|---|---|---|---|
0 | Pterophyllum altum | Cichlidae | 15 | 12.3 |
1 | Pterophyllum scalare | Cichlidae | 10 | 9.2 |
2 | Pterophyllum leopoldi | Cichlidae | 8 | 6.5 |
3 | Corydoras paleatus | Callichthyidae | 5 | 0.5 |
4 | Corydoras aeneus | Callichthyidae | 7 | 2.5 |
5 | Corydoras trilineatus | Callichthyidae | 5 | 1.5 |
6 | Corydoras habrosus | Callichthyidae | 3 | 0.5 |
First, we’ll use query()
to filter the dataframe to return only the rows where the species
column contains a string that exactly matches ==
the string “Corydoras aeneus”. This returns a new dataframe, so you can assign it to a new variable if you need to, or chain other commands to it using method chaining.
df.query('species == "Corydoras aeneus"')
species | family | length | weight | |
---|---|---|---|---|
4 | Corydoras aeneus | Callichthyidae | 7 | 2.5 |
To use query()
to filter the dataframe so it shows only those rows where the species
column does not match a string we use the !=
or “not equals” operator.
df.query('family != "Cichlidae"')
species | family | length | weight | |
---|---|---|---|---|
3 | Corydoras paleatus | Callichthyidae | 5 | 0.5 |
4 | Corydoras aeneus | Callichthyidae | 7 | 2.5 |
5 | Corydoras trilineatus | Callichthyidae | 5 | 1.5 |
6 | Corydoras habrosus | Callichthyidae | 3 | 0.5 |
Next, we’ll use query()
to filter the dataframe to return only those rows where the species
column contains a value that matches a value from a list. We can do this in two ways. For shorter lists, you can pass the list directly into the query()
command, but for longer lists you can bring in an externally defined list by prefixing the variable name with @
.
df.query('species in ["Corydoras paleatus", "Corydoras aeneus"]')
species | family | length | weight | |
---|---|---|---|---|
3 | Corydoras paleatus | Callichthyidae | 5 | 0.5 |
4 | Corydoras aeneus | Callichthyidae | 7 | 2.5 |
targets = ['Corydoras trilineatus', 'Corydoras aeneus']
df.query('species in @targets')
species | family | length | weight | |
---|---|---|---|---|
4 | Corydoras aeneus | Callichthyidae | 7 | 2.5 |
5 | Corydoras trilineatus | Callichthyidae | 5 | 1.5 |
To return only those rows where a column value is not in a list we need to use the not in
operator instead of the in
operator. As with before, you can either define the list within the query()
function or refer to the variable using an @
prefix on the list variable name.
df.query('species not in ["Corydoras paleatus", "Corydoras aeneus"]')
species | family | length | weight | |
---|---|---|---|---|
0 | Pterophyllum altum | Cichlidae | 15 | 12.3 |
1 | Pterophyllum scalare | Cichlidae | 10 | 9.2 |
2 | Pterophyllum leopoldi | Cichlidae | 8 | 6.5 |
5 | Corydoras trilineatus | Callichthyidae | 5 | 1.5 |
6 | Corydoras habrosus | Callichthyidae | 3 | 0.5 |
targets = ['Corydoras trilineatus', 'Corydoras aeneus']
df.query('species not in @targets')
species | family | length | weight | |
---|---|---|---|---|
0 | Pterophyllum altum | Cichlidae | 15 | 12.3 |
1 | Pterophyllum scalare | Cichlidae | 10 | 9.2 |
2 | Pterophyllum leopoldi | Cichlidae | 8 | 6.5 |
3 | Corydoras paleatus | Callichthyidae | 5 | 0.5 |
6 | Corydoras habrosus | Callichthyidae | 3 | 0.5 |
The query()
function also supports the Pandas operators that work with numeric data. For example, to filter the dataframe based on column values that are less than x we can use df.query('weight > 9')
.
df.query('weight > 9')
species | family | length | weight | |
---|---|---|---|---|
0 | Pterophyllum altum | Cichlidae | 15 | 12.3 |
1 | Pterophyllum scalare | Cichlidae | 10 | 9.2 |
Similarly, we can filter the dataframe to show only those rows where a value is less than x by using the less than operator.
df.query('weight < 9')
species | family | length | weight | |
---|---|---|---|---|
2 | Pterophyllum leopoldi | Cichlidae | 8 | 6.5 |
3 | Corydoras paleatus | Callichthyidae | 5 | 0.5 |
4 | Corydoras aeneus | Callichthyidae | 7 | 2.5 |
5 | Corydoras trilineatus | Callichthyidae | 5 | 1.5 |
6 | Corydoras habrosus | Callichthyidae | 3 | 0.5 |
To return only those rows where values in the weight
column are greater than or equal to a given value we need to use the >=
operator.
df.query('weight >= 9.3')
species | family | length | weight | |
---|---|---|---|---|
0 | Pterophyllum altum | Cichlidae | 15 | 12.3 |
To return the rows that contain a value less than or equal to a given value we need to use the <=
operator.
df.query('weight <= 1.5')
species | family | length | weight | |
---|---|---|---|---|
3 | Corydoras paleatus | Callichthyidae | 5 | 0.5 |
5 | Corydoras trilineatus | Callichthyidae | 5 | 1.5 |
6 | Corydoras habrosus | Callichthyidae | 3 | 0.5 |
To return rows that have a value that is between two values, such as between 5 and 12, we can use the and
operator to separate two separate queries.
df.query('weight > 5 and weight < 12')
species | family | length | weight | |
---|---|---|---|---|
1 | Pterophyllum scalare | Cichlidae | 10 | 9.2 |
2 | Pterophyllum leopoldi | Cichlidae | 8 | 6.5 |
Finally, to find rows that match multiple conditions, you can mix and match the various queries to construct a more complex search pattern using one or more and
operators to separate the individual queries.
df.query('family == "Cichlidae" and length > 3')
species | family | length | weight | |
---|---|---|---|---|
0 | Pterophyllum altum | Cichlidae | 15 | 12.3 |
1 | Pterophyllum scalare | Cichlidae | 10 | 9.2 |
2 | Pterophyllum leopoldi | Cichlidae | 8 | 6.5 |
df.query('family == "Cichlidae" and length > 3 and weight < 10')
species | family | length | weight | |
---|---|---|---|---|
1 | Pterophyllum scalare | Cichlidae | 10 | 9.2 |
2 | Pterophyllum leopoldi | Cichlidae | 8 | 6.5 |
Matt Clarke, Monday, January 09, 2023