When working with Pandas, you’ll often need to sort a dataframe by one or more columns. While the Pandas sort_values()
method makes it easy to sort categorical data in alphabetical order or reverse alphabetical order, and numeric values in ascending or descending order, sometimes values need their own custom sort order because they’re not in alphabetical or numeric order.
For example, weekdays and month names are not in alphabetical order, and sizes are not in numeric order. So, if you want to sort a dataframe by weekdays, month names, or sizes, you need to create a custom sort order. In the past, I’ve done this using map()
, but there is a far better way - the Pandas CategoricalDtype()
.
The CategoricalDtype is a Pandas data type that specifies categorical data, but it also allows you to specify a custom sort order. This means you can use the CategoricalDtype to sort categorical data in any order you want, not just alphabetical or numeric order. In this post, you’ll learn how to use the Pandas CategoricalDtype
to create custom sort orders for weekdays, month names, and sizes, so you can sort them correctly.
To get started, open a Jupyter notebook and import the Pandas package and the CategoricalDtype
from the Pandas types
module.
import pandas as pd
from pandas.api.types import CategoricalDtype
First, we’ll create a custom sort order for weekdays. To show what happens without a custom sort order, let’s create a dataframe with a column of weekdays and use sort_values()
to sort the dataframe by the weekdays column. As you can see, the weekdays are sorted in alphabetical order, which is not the correct order in which days occur…
df = pd.DataFrame.from_dict({
"day": ['Monday', 'Sunday', 'Wednesday', 'Friday',
'Tuesday', 'Thursday', 'Saturday', ],
})
df.sort_values(by='day')
day | |
---|---|
3 | Friday |
0 | Monday |
6 | Saturday |
1 | Sunday |
5 | Thursday |
4 | Tuesday |
2 | Wednesday |
To solve the problem we’ll pass a list of weekday names to CategoricalDtype()
in the order we want them to be sorted along with the ordered=True
argument. Then, we’ll use the astype()
method to convert the day
column to the CategoricalDtype
we created. Finally, we’ll use sort_values()
to sort the dataframe by the day
column, and we’ll see that the weekdays are now sorted in the correct order.
weekday = CategoricalDtype(['Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday', 'Saturday', 'Sunday'],
ordered=True)
df['day'] = df['day'].astype(weekday)
df.sort_values(by='day')
day | |
---|---|
0 | Monday |
4 | Tuesday |
2 | Wednesday |
5 | Thursday |
3 | Friday |
6 | Saturday |
1 | Sunday |
The other very common categorical data type that requires a custom sort order are month names. Let’s create a Pandas dataframe of month names in random order and then see again what happens with the default sort order of sort_values()
, which places them in alphabetical order.
df = pd.DataFrame.from_dict({
"month": ['January', 'October', 'November', 'February',
'May', 'June','March', 'April',
'July', 'August', 'September', 'December'],
})
df.sort_values(by='month')
month | |
---|---|
7 | April |
9 | August |
11 | December |
3 | February |
0 | January |
8 | July |
5 | June |
6 | March |
4 | May |
2 | November |
1 | October |
10 | September |
To create a custom sort order for month names, we can simply pass our list of month names to CategoricalDtype()
in the order we want them to be sorted along with the ordered=True
argument. Then, we’ll use the astype()
method to convert the month
column to the CategoricalDtype
we created. Finally, we’ll use sort_values()
to sort the dataframe by the month
column, and we’ll see that the months are now sorted in the correct order.
month = CategoricalDtype(['January', 'February', 'March', 'April',
'May', 'June', 'July', 'August',
'September', 'October', 'November', 'December'],
ordered=True)
df['month'] = df['month'].astype(month)
df.sort_values(by='month')
month | |
---|---|
0 | January |
3 | February |
6 | March |
7 | April |
4 | May |
5 | June |
8 | July |
9 | August |
10 | September |
1 | October |
2 | November |
11 | December |
Product, garment or clothing sizes are another common categorical data type that requires a custom sort order. Let’s create a Pandas dataframe of sizes in random order and then see again what happens with the default sort order of sort_values()
, which places them in alphabetical order.
df = pd.DataFrame.from_dict({
"size": ['L', 'XL', 'XS', 'S', 'M', 'XXL']
})
df.sort_values(by='size')
size | |
---|---|
0 | L |
4 | M |
3 | S |
1 | XL |
2 | XS |
5 | XXL |
To create a custom sort order for product sizes, we simply pass a list of sizes placed in the correct order to CategoricalDtype()
along with the ordered=True
argument. Then, we’ll use the astype()
method to convert the size
column to the CategoricalDtype
we created. Finally, we’ll use sort_values()
to sort the dataframe by the size
column, and we’ll see that the sizes are now sorted in the correct order.
size = CategoricalDtype(['XS', 'S', 'M', 'L', 'XL', 'XXL'], ordered=True)
df[''] = df['size'].astype(size)
df.sort_values(by='size')
size | |
---|---|
0 | XS |
4 | S |
3 | M |
1 | L |
2 | XL |
5 | XXL |
Matt Clarke, Sunday, November 06, 2022