How to select, filter, and subset data in Pandas dataframes

Learn a range of useful techniques to select, filter, and subset data stored in Pandas dataframes to extract the exact information you need.

How to select, filter, and subset data in Pandas dataframes
Picture by Markus Spiske, Unsplash.
36 minutes to read

Selecting, filtering and subsetting data is probably the most common task you’ll undertake if you work with data. It allows you to extract subsets of data where row or column values match specific parameters, or specific combinations of parameters, so it’s a key requirement for all forms of exploratory data analysis, data visualisation, and preprocessing for machine learning and modeling.

If, like me, you’ve come to Pandas from another platform, such as SQL, where you could write complex queries to select the data you wanted off the top of your head, getting to grips with the way you do this in Pandas can see a bit intimidating. However, when you break it down, it’s actually quite straightforward. Here are the main things you need to know to select and subset data from Pandas dataframes.

Load the data

For this project I’m using the Bank Marketing dataset, which you download from the UCI Machine Learning Repository. This is in a text format, but is separated with semicolons rather than commas. To load the data into a Pandas dataframe we’ll import the Pandas package with import pandas as pd and then use the pd.read_csv() to load the data into a dataframe called df. We’ll need to pass in the argument sep=";" to tell Pandas to split the data correctly. As we only need a small number of columns, we can specifically define the ones to import with usecols.

import pandas as pd
import numpy as np
df = pd.read_csv('bank-full.csv', sep=";", usecols=['age','job','marital','education','balance','duration'])

Viewing the head, tail, and a sample

Pandas includes three functions to allow you to quickly view the dataframe: head(), tail(), and sample(). By default head() and tail() return the first five rows from the top and bottom of the dataframe respectively, while sample() returns a single random row. Appending the function to the df will print the output. The first column in bold which lacks a header is the index.

df.head()
age job marital education balance duration
0 58 management married tertiary 2143 261
1 44 technician single secondary 29 151
2 33 entrepreneur married secondary 2 76
3 47 blue-collar married unknown 1506 92
4 33 unknown single unknown 1 198
df.head(1)
age job marital education balance duration
0 58 management married tertiary 2143 261
df.tail()
age job marital education balance duration
45206 51 technician married tertiary 825 977
45207 71 retired divorced primary 1729 456
45208 72 retired married secondary 5715 1127
45209 57 blue-collar married secondary 668 508
45210 37 entrepreneur married secondary 2971 361
df.sample()
age job marital education balance duration
43187 37 management single tertiary 1027 313

With all three of these functions, you can pass in an integer value to show a specific number of rows. For example, df.head(3) shows the first three rows and df.sample(5) returns five random rows.

df.head(3)
age job marital education balance duration
0 58 management married tertiary 2143 261
1 44 technician single secondary 29 151
2 33 entrepreneur married secondary 2 76
df.sample(5)
age job marital education balance duration
43069 33 management married tertiary 2213 240
35532 33 technician single secondary 5214 293
23572 37 management married tertiary 6743 13
14327 45 blue-collar single secondary 91 55
24373 57 housemaid married tertiary 3760 1070

Transposing the data

If you’ve got lots of columns and find them difficult to view, you can flip the orientation using transpose. To do this simply append .T to the end of your Pandas command.

df.head().T
0 1 2 3 4
age 58 44 33 47 33
job management technician entrepreneur blue-collar unknown
marital married single married married single
education tertiary secondary secondary unknown unknown
balance 2143 29 2 1506 1
duration 261 151 76 92 198

Selecting an individual column or series

Each column within a Pandas dataframe is called a series. Depending on the way you select data from the dataframe, Pandas will either return the data as a series or a subset of the original dataframe. There are several ways to select an individual series or column. For example, to select the marital column we can either use df.marital or df['marital'], both of which return the same results.

df.marital
0         married
1          single
2         married
3         married
4          single
           ...   
45206     married
45207    divorced
45208     married
45209     married
45210     married
Name: marital, Length: 45211, dtype: object
df['marital']
0         married
1          single
2         married
3         married
4          single
           ...   
45206     married
45207    divorced
45208     married
45209     married
45210     married
Name: marital, Length: 45211, dtype: object

To return a single column in a neater dataframe format, you can pass in the column name inside two sets of square brackets, so df[['marital']] will return a dataframe containing only this column. You can also do the same thing by using single brackets and appending the to_frame() function to convert the series to a dataframe.

df[['marital']].head()
marital
0 married
1 single
2 married
3 married
4 single
df['marital'].to_frame().head()
marital
0 married
1 single
2 married
3 married
4 single

Finally, you can use the iloc method to select the series using its column index. So, df.iloc[:,2] will select the marital column based on its index, while appending to_frame() turns it into a dataframe.

df.iloc[:,2].to_frame().head()
marital
0 married
1 single
2 married
3 married
4 single

Selecting multiple specific columns

To select a subset of multiple specific columns from a dataframe we can use the double square brackets approach again, but define a list of column names instead of a single one. Here are the last five rows with the age and job columns. The loc method can be used to achieve the same result.

df[['age','job']].tail()
age job
45206 51 technician
45207 71 retired
45208 72 retired
45209 57 blue-collar
45210 37 entrepreneur
df.loc[:, ['age', 'job']].tail()
age job
45206 51 technician
45207 71 retired
45208 72 retired
45209 57 blue-collar
45210 37 entrepreneur

Selecting rows where a value matches a value

To select rows where a column value matches a given value exactly, we can use =='value' to return all rows which match exactly. To return rows that do not match a value, we can use !='value' to denote that they are not equal to.

df_retired = df[df['job']=='retired']
df_retired.head()
age job marital education balance duration
8 58 retired married primary 121 50
15 51 retired married primary 229 353
18 60 retired married primary 60 219
24 40 retired married primary 0 181
43 54 retired married secondary 529 1492
df_not_retired = df[df['job']!='retired']
df_not_retired.head()
age job marital education balance duration
0 58 management married tertiary 2143 261
1 44 technician single secondary 29 151
2 33 entrepreneur married secondary 2 76
3 47 blue-collar married unknown 1506 92
4 33 unknown single unknown 1 198

Selecting rows where a column value starts with a value

Next, we’ll try filtering the dataframe based on the values within the columns. For categorical data you can use Pandas string functions to filter the data. The startswith() function returns rows where a given column contains values that start with a certain value, and endswith() which returns rows with values that end with a certain value.

df_technician = df[df['job'].str.startswith('tech')]
df_technician.head()
age job marital education balance duration
1 44 technician single secondary 29 151
9 43 technician single secondary 593 55
12 53 technician married secondary 6 517
13 58 technician married unknown 71 71
29 36 technician single secondary 265 348
df_ar = df[df['job'].str.endswith('ar')]
df_ar.head()
age job marital education balance duration
3 47 blue-collar married unknown 1506 92
17 57 blue-collar married primary 52 38
20 28 blue-collar married secondary 723 262
22 32 blue-collar single primary 23 160
33 59 blue-collar married secondary 0 226

Selecting rows where a column value contains a value

To select rows where a column value contains a given value, we can use the str.contains() Pandas function. Here we’ll select all of the rows where the education value contains a y. To select the rows that do not contain a value we can place a tilde (the squiggle ~ character) immediately before the df['education'].str.contains('y').

df_education = df[df['education'].str.contains('y')]
df_education.head()
age job marital education balance duration
0 58 management married tertiary 2143 261
1 44 technician single secondary 29 151
2 33 entrepreneur married secondary 2 76
5 35 management married tertiary 231 139
6 28 management single tertiary 447 217
df_education = df[~df['education'].str.contains('y')]
df_education.head()
age job marital education balance duration
3 47 blue-collar married unknown 1506 92
4 33 unknown single unknown 1 198
13 58 technician married unknown 71 71
16 45 admin. single unknown 13 98
42 60 blue-collar married unknown 104 22

Select rows where values are less than a value

For numeric columns we can use the regular numeric Python operators to select the subsets of our dataframe. For example, df[df['balance'] < 10] returns all rows where the balance column value is less than 10, while df[df['balance'] <= 1] returns all rows where the balance column value is less than or equal to 1.

df_low_balance = df[df['balance'] < 10]
df_low_balance.head()
age job marital education balance duration
2 33 entrepreneur married secondary 2 76
4 33 unknown single unknown 1 198
7 42 entrepreneur divorced tertiary 2 380
12 53 technician married secondary 6 517
19 33 services married secondary 0 54
df_low_balance = df[df['balance'] <= 1]
df_low_balance.head()
age job marital education balance duration
4 33 unknown single unknown 1 198
19 33 services married secondary 0 54
24 40 retired married primary 0 181
25 44 admin. married secondary -372 172
28 46 management single secondary -246 255

Rows where column values that are greater than

The same approach can be used with to select rows with higher values. For example, df[df['balance'] > 1000] returns all rows where the balance column value is greater than 1000, and df[df['balance'] >= 500] returns all rows where the balance column value is greater than or equal to 500.

df_high_balance = df[df['balance'] > 1000]
df_high_balance.head()
age job marital education balance duration
0 58 management married tertiary 2143 261
3 47 blue-collar married unknown 1506 92
34 51 management married tertiary 10635 336
40 44 services divorced secondary 2586 160
50 54 blue-collar married secondary 1291 266
df_high_balance = df[df['balance'] >= 500]
df_high_balance.head()
age job marital education balance duration
0 58 management married tertiary 2143 261
3 47 blue-collar married unknown 1506 92
9 43 technician single secondary 593 55
20 28 blue-collar married secondary 723 262
21 56 management married tertiary 779 164

Rows where column values are between two numeric values

To select rows where a value lies between two values you can use parentheses and the ampersand & or and operator. For example, df[ (df['balance'] >= 100) & (df['balance'] <= 200) ] returns all customers who had a balance value of 100 to 200.

df_100_200 = df[ (df['balance'] >= 100) & (df['balance'] <= 200) ]
df_100_200.head()
age job marital education balance duration
8 58 retired married primary 121 50
14 57 services married secondary 162 174
27 52 entrepreneur married secondary 113 127
42 60 blue-collar married unknown 104 22
59 46 services married primary 179 1778

Rows with columns matching multiple criteria

You can use parentheses to create compound selection criteria that mix categorical and numeric data. Here we’re using df[ (df['marital'] == 'married') & ( df['balance'] >= 2000 ) ] to select all of the customers who are married and have a balance of 2000 or more.

df_multiple = df[ (df['marital'] == 'married') & ( df['balance'] >= 2000 ) ]
df_multiple.head()
age job marital education balance duration
0 58 management married tertiary 2143 261
34 51 management married tertiary 10635 336
65 51 management married tertiary 6530 91
70 57 blue-collar married secondary 5935 258
83 59 admin. married secondary 2343 1042

You can make these statements as complex as you wish. For example, the below example selects all rows where customers are either married or divorced and have a balance of 2000 or more.

df_multiple = df[ ((df['marital'] == 'married') | (df['marital'] == 'divorced'))
                 & ( df['balance'] >= 2000 ) ]
df_multiple.head()
age job marital education balance duration
0 58 management married tertiary 2143 261
34 51 management married tertiary 10635 336
40 44 services divorced secondary 2586 160
65 51 management married tertiary 6530 91
70 57 blue-collar married secondary 5935 258

Selecting specific rows with iloc

If you want to select specific items from a dataframe based on their index value (the customer ID in our dataframe), you can pass the specific index values to iloc as a nested list. So, df.iloc[[70, 65, 40]] returns the rows on customer 70, 65, and 40.

df.iloc[[70, 65, 40]]
age job marital education balance duration
70 57 blue-collar married secondary 5935 258
65 51 management married tertiary 6530 91
40 44 services divorced secondary 2586 160

Using slice notation to select a range of rows

The iloc technique is particularly useful at times because it supports slice notation. This means you can create subsets based on their index value, which can be useful for splitting up massive datasets into smaller chunks. For example, df.iloc[5:10] returns rows 5 to 9 based on their index values.

df.iloc[5:10]
age job marital education balance duration
5 35 management married tertiary 231 139
6 28 management single tertiary 447 217
7 42 entrepreneur divorced tertiary 2 380
8 58 retired married primary 121 50
9 43 technician single secondary 593 55

Selecting specific rows and columns with loc

The loc method can be used to mix the approach and select subsets. For example, here we’ll select the rows where the index value is either 0 or 34 and return only the age, job, and education columns.

rows = [0, 34]
cols = ['age', 'job', 'education']
df.loc[rows, cols]
age job education
0 58 management tertiary
34 51 management tertiary

Select columns by data type

The other common thing you’ll need to do, especially if you work with machine learning models, is select data based on the data type. You can view the data types by entering df.info(). The data type appears in the Dtype column. As you can see below, we have three numeric columns containing int64 data and three non-numeric columns containing object or string data.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   balance    45211 non-null  int64 
 5   duration   45211 non-null  int64 
dtypes: int64(3), object(3)
memory usage: 2.1+ MB

To obtain a list of the column names matching a specific data type you can use the select_dtypes() function and pass in a list of data types to return. Appending .columns returns these as the column names.

df.select_dtypes(['int64']).columns
Index(['age', 'balance', 'duration'], dtype='object')
df.select_dtypes(['object']).columns
Index(['job', 'marital', 'education'], dtype='object')

The select_dtypes() function includes a useful exclude argument that allows you to use a similar approach to return the columns that do not match a particular data type. For example, exclude=['object'] returns all columns that are not an object data type. Numpy can be used to do similar things, so df.select_dtypes([np.number]).columns returns all numeric columns, whether they’re floats or integers, while df.select_dtypes(exclude=[np.number]).columns excludes them.

df.select_dtypes(exclude=['object']).columns
Index(['age', 'balance', 'duration'], dtype='object')
df.select_dtypes([np.number]).columns
Index(['age', 'balance', 'duration'], dtype='object')
df.select_dtypes(exclude=[np.number]).columns
Index(['job', 'marital', 'education'], dtype='object')

To view the columns based on their data type, we simply drop the .columns suffix and assign the output to a dataframe. This approach allows us to print a dataframe containing only the numeric columns and a dataframe containing only the non-numeric columns.

df_numeric = df.select_dtypes([np.number])
df_numeric.head()
age balance duration
0 58 2143 261
1 44 29 151
2 33 2 76
3 47 1506 92
4 33 1 198
df_non_numeric = df.select_dtypes(exclude=[np.number])
df_non_numeric.head()
job marital education
0 management married tertiary
1 technician single secondary
2 entrepreneur married secondary
3 blue-collar married unknown
4 unknown single unknown

Sorting a dataframe by a specific column

Pandas includes the sort_values() function to let you sort the dataframe. This requires either the name of a single column by which to sort the data, or a list of multiple columns. For example, the by='balance' argument will sort the dataframe in ascending order of the account balance. To change this from the default, and show the customers with the highest balance first, we can use the by='balance', ascending=False arguments.

df.sort_values(by='balance').head()
age job marital education balance duration
12909 26 blue-collar single secondary -8019 299
15682 49 management married tertiary -6847 206
38736 60 management divorced tertiary -4057 242
7413 43 management married tertiary -3372 386
1896 57 self-employed married tertiary -3313 153
df.sort_values(by='balance', ascending=False).head()
age job marital education balance duration
39989 51 management single tertiary 102127 90
26227 59 management married tertiary 98417 145
43393 84 retired married secondary 81204 390
42558 84 retired married secondary 81204 679
41693 60 retired married primary 71188 205

Sorting by multiple columns

Similarly, if you want to sort a dataframe based on the values of multiple columns, you can pass in a list of values. Here we’ll sort by job, then marital, then education, then balance.

df.sort_values(by=['job','marital','education','balance'], ascending=False).head()
age job marital education balance duration
42394 44 unknown single unknown 18881 187
25748 30 unknown single unknown 6929 341
27158 39 unknown single unknown 2822 573
39803 47 unknown single unknown 1819 180
44651 48 unknown single unknown 1819 1440

Matt Clarke, Saturday, March 06, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.