Selecting, filtering and subsetting data is probably the most common task you’ll undertake if you work with data. It allows you to extract subsets of data where row or column values match specific parameters, or specific combinations of parameters, so it’s a key requirement for all forms of exploratory data analysis, data visualisation, and preprocessing for machine learning and modeling.
If, like me, you’ve come to Pandas from another platform, such as SQL, where you could write complex queries to select the data you wanted off the top of your head, getting to grips with the way you do this in Pandas can see a bit intimidating. However, when you break it down, it’s actually quite straightforward. Here are the main things you need to know to select and subset data from Pandas dataframes.
Load the data
For this project I’m using the Bank Marketing dataset, which you
download from the UCI Machine Learning Repository. This is in a text format, but is separated with semicolons rather than commas. To load the data into a Pandas dataframe we’ll import the Pandas package with import pandas as pd
and then use the pd.read_csv()
to load the data into a dataframe called df
. We’ll need to pass in the argument sep=";"
to tell Pandas to split the data correctly. As we only need a small number of columns, we can specifically define the ones to import with usecols
.
import pandas as pd
import numpy as np
df = pd.read_csv('bank-full.csv', sep=";", usecols=['age','job','marital','education','balance','duration'])
Viewing the head, tail, and a sample
Pandas includes three functions to allow you to quickly view the dataframe: head()
, tail()
, and sample()
. By default head()
and tail()
return the first five rows from the top and bottom of the dataframe respectively, while sample()
returns a single random row. Appending the function to the df
will print the output. The first column in bold which lacks a header is the index.
|
age |
job |
marital |
education |
balance |
duration |
0 |
58 |
management |
married |
tertiary |
2143 |
261 |
1 |
44 |
technician |
single |
secondary |
29 |
151 |
2 |
33 |
entrepreneur |
married |
secondary |
2 |
76 |
3 |
47 |
blue-collar |
married |
unknown |
1506 |
92 |
4 |
33 |
unknown |
single |
unknown |
1 |
198 |
|
age |
job |
marital |
education |
balance |
duration |
0 |
58 |
management |
married |
tertiary |
2143 |
261 |
|
age |
job |
marital |
education |
balance |
duration |
45206 |
51 |
technician |
married |
tertiary |
825 |
977 |
45207 |
71 |
retired |
divorced |
primary |
1729 |
456 |
45208 |
72 |
retired |
married |
secondary |
5715 |
1127 |
45209 |
57 |
blue-collar |
married |
secondary |
668 |
508 |
45210 |
37 |
entrepreneur |
married |
secondary |
2971 |
361 |
|
age |
job |
marital |
education |
balance |
duration |
43187 |
37 |
management |
single |
tertiary |
1027 |
313 |
With all three of these functions, you can pass in an integer value to show a specific number of rows. For example, df.head(3)
shows the first three rows and df.sample(5)
returns five random rows.
|
age |
job |
marital |
education |
balance |
duration |
0 |
58 |
management |
married |
tertiary |
2143 |
261 |
1 |
44 |
technician |
single |
secondary |
29 |
151 |
2 |
33 |
entrepreneur |
married |
secondary |
2 |
76 |
|
age |
job |
marital |
education |
balance |
duration |
43069 |
33 |
management |
married |
tertiary |
2213 |
240 |
35532 |
33 |
technician |
single |
secondary |
5214 |
293 |
23572 |
37 |
management |
married |
tertiary |
6743 |
13 |
14327 |
45 |
blue-collar |
single |
secondary |
91 |
55 |
24373 |
57 |
housemaid |
married |
tertiary |
3760 |
1070 |
Transposing the data
If you’ve got lots of columns and find them difficult to view, you can flip the orientation using transpose. To do this simply append .T
to the end of your Pandas command.
|
0 |
1 |
2 |
3 |
4 |
age |
58 |
44 |
33 |
47 |
33 |
job |
management |
technician |
entrepreneur |
blue-collar |
unknown |
marital |
married |
single |
married |
married |
single |
education |
tertiary |
secondary |
secondary |
unknown |
unknown |
balance |
2143 |
29 |
2 |
1506 |
1 |
duration |
261 |
151 |
76 |
92 |
198 |
Selecting an individual column or series
Each column within a Pandas dataframe is called a series. Depending on the way you select data from the dataframe, Pandas will either return the data as a series or a subset of the original dataframe. There are several ways to select an individual series or column. For example, to select the marital
column we can either use df.marital
or df['marital']
, both of which return the same results.
0 married
1 single
2 married
3 married
4 single
...
45206 married
45207 divorced
45208 married
45209 married
45210 married
Name: marital, Length: 45211, dtype: object
0 married
1 single
2 married
3 married
4 single
...
45206 married
45207 divorced
45208 married
45209 married
45210 married
Name: marital, Length: 45211, dtype: object
To return a single column in a neater dataframe format, you can pass in the column name inside two sets of square brackets, so df[['marital']]
will return a dataframe containing only this column. You can also do the same thing by using single brackets and appending the to_frame()
function to convert the series to a dataframe.
|
marital |
0 |
married |
1 |
single |
2 |
married |
3 |
married |
4 |
single |
df['marital'].to_frame().head()
|
marital |
0 |
married |
1 |
single |
2 |
married |
3 |
married |
4 |
single |
Finally, you can use the iloc
method to select the series using its column index. So, df.iloc[:,2]
will select the marital
column based on its index, while appending to_frame()
turns it into a dataframe.
df.iloc[:,2].to_frame().head()
|
marital |
0 |
married |
1 |
single |
2 |
married |
3 |
married |
4 |
single |
Selecting multiple specific columns
To select a subset of multiple specific columns from a dataframe we can use the double square brackets approach again, but define a list of column names instead of a single one. Here are the last five rows with the age
and job
columns. The loc
method can be used to achieve the same result.
|
age |
job |
45206 |
51 |
technician |
45207 |
71 |
retired |
45208 |
72 |
retired |
45209 |
57 |
blue-collar |
45210 |
37 |
entrepreneur |
df.loc[:, ['age', 'job']].tail()
|
age |
job |
45206 |
51 |
technician |
45207 |
71 |
retired |
45208 |
72 |
retired |
45209 |
57 |
blue-collar |
45210 |
37 |
entrepreneur |
Selecting rows where a value matches a value
To select rows where a column value matches a given value exactly, we can use =='value'
to return all rows which match exactly. To return rows that do not match a value, we can use !='value'
to denote that they are not equal to.
df_retired = df[df['job']=='retired']
df_retired.head()
|
age |
job |
marital |
education |
balance |
duration |
8 |
58 |
retired |
married |
primary |
121 |
50 |
15 |
51 |
retired |
married |
primary |
229 |
353 |
18 |
60 |
retired |
married |
primary |
60 |
219 |
24 |
40 |
retired |
married |
primary |
0 |
181 |
43 |
54 |
retired |
married |
secondary |
529 |
1492 |
df_not_retired = df[df['job']!='retired']
df_not_retired.head()
|
age |
job |
marital |
education |
balance |
duration |
0 |
58 |
management |
married |
tertiary |
2143 |
261 |
1 |
44 |
technician |
single |
secondary |
29 |
151 |
2 |
33 |
entrepreneur |
married |
secondary |
2 |
76 |
3 |
47 |
blue-collar |
married |
unknown |
1506 |
92 |
4 |
33 |
unknown |
single |
unknown |
1 |
198 |
Selecting rows where a column value starts with a value
Next, we’ll try filtering the dataframe based on the values within the columns. For categorical data you can use Pandas string functions to filter the data. The startswith()
function returns rows where a given column contains values that start with a certain value, and endswith()
which returns rows with values that end with a certain value.
df_technician = df[df['job'].str.startswith('tech')]
df_technician.head()
|
age |
job |
marital |
education |
balance |
duration |
1 |
44 |
technician |
single |
secondary |
29 |
151 |
9 |
43 |
technician |
single |
secondary |
593 |
55 |
12 |
53 |
technician |
married |
secondary |
6 |
517 |
13 |
58 |
technician |
married |
unknown |
71 |
71 |
29 |
36 |
technician |
single |
secondary |
265 |
348 |
df_ar = df[df['job'].str.endswith('ar')]
df_ar.head()
|
age |
job |
marital |
education |
balance |
duration |
3 |
47 |
blue-collar |
married |
unknown |
1506 |
92 |
17 |
57 |
blue-collar |
married |
primary |
52 |
38 |
20 |
28 |
blue-collar |
married |
secondary |
723 |
262 |
22 |
32 |
blue-collar |
single |
primary |
23 |
160 |
33 |
59 |
blue-collar |
married |
secondary |
0 |
226 |
Selecting rows where a column value contains a value
To select rows where a column value contains a given value, we can use the str.contains()
Pandas function. Here we’ll select all of the rows where the education
value contains a y. To select the rows that do not contain a value we can place a tilde (the squiggle ~ character) immediately before the df['education'].str.contains('y')
.
df_education = df[df['education'].str.contains('y')]
df_education.head()
|
age |
job |
marital |
education |
balance |
duration |
0 |
58 |
management |
married |
tertiary |
2143 |
261 |
1 |
44 |
technician |
single |
secondary |
29 |
151 |
2 |
33 |
entrepreneur |
married |
secondary |
2 |
76 |
5 |
35 |
management |
married |
tertiary |
231 |
139 |
6 |
28 |
management |
single |
tertiary |
447 |
217 |
df_education = df[~df['education'].str.contains('y')]
df_education.head()
|
age |
job |
marital |
education |
balance |
duration |
3 |
47 |
blue-collar |
married |
unknown |
1506 |
92 |
4 |
33 |
unknown |
single |
unknown |
1 |
198 |
13 |
58 |
technician |
married |
unknown |
71 |
71 |
16 |
45 |
admin. |
single |
unknown |
13 |
98 |
42 |
60 |
blue-collar |
married |
unknown |
104 |
22 |
Select rows where values are less than a value
For numeric columns we can use the regular numeric Python operators to select the subsets of our dataframe. For example, df[df['balance'] < 10]
returns all rows where the balance
column value is less than 10, while df[df['balance'] <= 1]
returns all rows where the balance
column value is less than or equal to 1.
df_low_balance = df[df['balance'] < 10]
df_low_balance.head()
|
age |
job |
marital |
education |
balance |
duration |
2 |
33 |
entrepreneur |
married |
secondary |
2 |
76 |
4 |
33 |
unknown |
single |
unknown |
1 |
198 |
7 |
42 |
entrepreneur |
divorced |
tertiary |
2 |
380 |
12 |
53 |
technician |
married |
secondary |
6 |
517 |
19 |
33 |
services |
married |
secondary |
0 |
54 |
df_low_balance = df[df['balance'] <= 1]
df_low_balance.head()
|
age |
job |
marital |
education |
balance |
duration |
4 |
33 |
unknown |
single |
unknown |
1 |
198 |
19 |
33 |
services |
married |
secondary |
0 |
54 |
24 |
40 |
retired |
married |
primary |
0 |
181 |
25 |
44 |
admin. |
married |
secondary |
-372 |
172 |
28 |
46 |
management |
single |
secondary |
-246 |
255 |
Rows where column values that are greater than
The same approach can be used with to select rows with higher values. For example, df[df['balance'] > 1000]
returns all rows where the balance
column value is greater than 1000, and df[df['balance'] >= 500]
returns all rows where the balance
column value is greater than or equal to 500.
df_high_balance = df[df['balance'] > 1000]
df_high_balance.head()
|
age |
job |
marital |
education |
balance |
duration |
0 |
58 |
management |
married |
tertiary |
2143 |
261 |
3 |
47 |
blue-collar |
married |
unknown |
1506 |
92 |
34 |
51 |
management |
married |
tertiary |
10635 |
336 |
40 |
44 |
services |
divorced |
secondary |
2586 |
160 |
50 |
54 |
blue-collar |
married |
secondary |
1291 |
266 |
df_high_balance = df[df['balance'] >= 500]
df_high_balance.head()
|
age |
job |
marital |
education |
balance |
duration |
0 |
58 |
management |
married |
tertiary |
2143 |
261 |
3 |
47 |
blue-collar |
married |
unknown |
1506 |
92 |
9 |
43 |
technician |
single |
secondary |
593 |
55 |
20 |
28 |
blue-collar |
married |
secondary |
723 |
262 |
21 |
56 |
management |
married |
tertiary |
779 |
164 |
Rows where column values are between two numeric values
To select rows where a value lies between two values you can use parentheses and the ampersand &
or and
operator. For example, df[ (df['balance'] >= 100) & (df['balance'] <= 200) ]
returns all customers who had a balance
value of 100 to 200.
df_100_200 = df[ (df['balance'] >= 100) & (df['balance'] <= 200) ]
df_100_200.head()
|
age |
job |
marital |
education |
balance |
duration |
8 |
58 |
retired |
married |
primary |
121 |
50 |
14 |
57 |
services |
married |
secondary |
162 |
174 |
27 |
52 |
entrepreneur |
married |
secondary |
113 |
127 |
42 |
60 |
blue-collar |
married |
unknown |
104 |
22 |
59 |
46 |
services |
married |
primary |
179 |
1778 |
Rows with columns matching multiple criteria
You can use parentheses to create compound selection criteria that mix categorical and numeric data. Here we’re using df[ (df['marital'] == 'married') & ( df['balance'] >= 2000 ) ]
to select all of the customers who are married and have a balance
of 2000 or more.
df_multiple = df[ (df['marital'] == 'married') & ( df['balance'] >= 2000 ) ]
df_multiple.head()
|
age |
job |
marital |
education |
balance |
duration |
0 |
58 |
management |
married |
tertiary |
2143 |
261 |
34 |
51 |
management |
married |
tertiary |
10635 |
336 |
65 |
51 |
management |
married |
tertiary |
6530 |
91 |
70 |
57 |
blue-collar |
married |
secondary |
5935 |
258 |
83 |
59 |
admin. |
married |
secondary |
2343 |
1042 |
You can make these statements as complex as you wish. For example, the below example selects all rows where customers are either married or divorced and have a balance of 2000 or more.
df_multiple = df[ ((df['marital'] == 'married') | (df['marital'] == 'divorced'))
& ( df['balance'] >= 2000 ) ]
df_multiple.head()
|
age |
job |
marital |
education |
balance |
duration |
0 |
58 |
management |
married |
tertiary |
2143 |
261 |
34 |
51 |
management |
married |
tertiary |
10635 |
336 |
40 |
44 |
services |
divorced |
secondary |
2586 |
160 |
65 |
51 |
management |
married |
tertiary |
6530 |
91 |
70 |
57 |
blue-collar |
married |
secondary |
5935 |
258 |
Selecting specific rows with iloc
If you want to select specific items from a dataframe based on their index value (the customer ID in our dataframe), you can pass the specific index values to iloc
as a nested list. So, df.iloc[[70, 65, 40]]
returns the rows on customer 70, 65, and 40.
|
age |
job |
marital |
education |
balance |
duration |
70 |
57 |
blue-collar |
married |
secondary |
5935 |
258 |
65 |
51 |
management |
married |
tertiary |
6530 |
91 |
40 |
44 |
services |
divorced |
secondary |
2586 |
160 |
Using slice notation to select a range of rows
The iloc
technique is particularly useful at times because it supports slice notation. This means you can create subsets based on their index value, which can be useful for splitting up massive datasets into smaller chunks. For example, df.iloc[5:10]
returns rows 5 to 9 based on their index values.
|
age |
job |
marital |
education |
balance |
duration |
5 |
35 |
management |
married |
tertiary |
231 |
139 |
6 |
28 |
management |
single |
tertiary |
447 |
217 |
7 |
42 |
entrepreneur |
divorced |
tertiary |
2 |
380 |
8 |
58 |
retired |
married |
primary |
121 |
50 |
9 |
43 |
technician |
single |
secondary |
593 |
55 |
Selecting specific rows and columns with loc
The loc
method can be used to mix the approach and select subsets. For example, here we’ll select the rows where the index value is either 0 or 34 and return only the age
, job
, and education
columns.
rows = [0, 34]
cols = ['age', 'job', 'education']
df.loc[rows, cols]
|
age |
job |
education |
0 |
58 |
management |
tertiary |
34 |
51 |
management |
tertiary |
Select columns by data type
The other common thing you’ll need to do, especially if you work with machine learning models, is select data based on the data type. You can view the data types by entering df.info()
. The data type appears in the Dtype
column. As you can see below, we have three numeric columns containing int64
data and three non-numeric columns containing object
or string data.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 45211 non-null int64
1 job 45211 non-null object
2 marital 45211 non-null object
3 education 45211 non-null object
4 balance 45211 non-null int64
5 duration 45211 non-null int64
dtypes: int64(3), object(3)
memory usage: 2.1+ MB
To obtain a list of the column names matching a specific data type you can use the select_dtypes()
function and pass in a list of data types to return. Appending .columns
returns these as the column names.
df.select_dtypes(['int64']).columns
Index(['age', 'balance', 'duration'], dtype='object')
df.select_dtypes(['object']).columns
Index(['job', 'marital', 'education'], dtype='object')
The select_dtypes()
function includes a useful exclude
argument that allows you to use a similar approach to return the columns that do not match a particular data type. For example, exclude=['object']
returns all columns that are not an object
data type. Numpy can be used to do similar things, so df.select_dtypes([np.number]).columns
returns all numeric columns, whether they’re floats or integers, while df.select_dtypes(exclude=[np.number]).columns
excludes them.
df.select_dtypes(exclude=['object']).columns
Index(['age', 'balance', 'duration'], dtype='object')
df.select_dtypes([np.number]).columns
Index(['age', 'balance', 'duration'], dtype='object')
df.select_dtypes(exclude=[np.number]).columns
Index(['job', 'marital', 'education'], dtype='object')
To view the columns based on their data type, we simply drop the .columns
suffix and assign the output to a dataframe. This approach allows us to print a dataframe containing only the numeric columns and a dataframe containing only the non-numeric columns.
df_numeric = df.select_dtypes([np.number])
df_numeric.head()
|
age |
balance |
duration |
0 |
58 |
2143 |
261 |
1 |
44 |
29 |
151 |
2 |
33 |
2 |
76 |
3 |
47 |
1506 |
92 |
4 |
33 |
1 |
198 |
df_non_numeric = df.select_dtypes(exclude=[np.number])
df_non_numeric.head()
|
job |
marital |
education |
0 |
management |
married |
tertiary |
1 |
technician |
single |
secondary |
2 |
entrepreneur |
married |
secondary |
3 |
blue-collar |
married |
unknown |
4 |
unknown |
single |
unknown |
Sorting a dataframe by a specific column
Pandas includes the sort_values()
function to let you sort the dataframe. This requires either the name of a single column by which to sort the data, or a list of multiple columns. For example, the by='balance'
argument will sort the dataframe in ascending
order of the account balance. To change this from the default, and show the customers with the highest balance first, we can use the by='balance', ascending=False
arguments.
df.sort_values(by='balance').head()
|
age |
job |
marital |
education |
balance |
duration |
12909 |
26 |
blue-collar |
single |
secondary |
-8019 |
299 |
15682 |
49 |
management |
married |
tertiary |
-6847 |
206 |
38736 |
60 |
management |
divorced |
tertiary |
-4057 |
242 |
7413 |
43 |
management |
married |
tertiary |
-3372 |
386 |
1896 |
57 |
self-employed |
married |
tertiary |
-3313 |
153 |
df.sort_values(by='balance', ascending=False).head()
|
age |
job |
marital |
education |
balance |
duration |
39989 |
51 |
management |
single |
tertiary |
102127 |
90 |
26227 |
59 |
management |
married |
tertiary |
98417 |
145 |
43393 |
84 |
retired |
married |
secondary |
81204 |
390 |
42558 |
84 |
retired |
married |
secondary |
81204 |
679 |
41693 |
60 |
retired |
married |
primary |
71188 |
205 |
Sorting by multiple columns
Similarly, if you want to sort a dataframe based on the values of multiple columns, you can pass in a list of values. Here we’ll sort by job
, then marital
, then education
, then balance
.
df.sort_values(by=['job','marital','education','balance'], ascending=False).head()
|
age |
job |
marital |
education |
balance |
duration |
42394 |
44 |
unknown |
single |
unknown |
18881 |
187 |
25748 |
30 |
unknown |
single |
unknown |
6929 |
341 |
27158 |
39 |
unknown |
single |
unknown |
2822 |
573 |
39803 |
47 |
unknown |
single |
unknown |
1819 |
180 |
44651 |
48 |
unknown |
single |
unknown |
1819 |
1440 |
Matt Clarke, Saturday, March 06, 2021