When working with Pandas dataframes you’ll commonly need to sort the data in some way. This is easy to do with the sort_values()
and sort_index()
methods. These two methods allow you to sort in ascending or descending order, alphabetical or reverse alphabetical order, and using times, datetimes, or even custom sort orders of your own creation. In this post, we’ll go through the various features of the sort_values()
function, so you can use it to sort your data in the way you want.
The sort_values()
is the most commonly used way to sort data in Pandas dataframes. At its most basic, it can be used to control when data are ordered in ascending or descending order. For example, if you wanted to sort the sales
column in ascending order, however, the function also includes some more advanced parameters that allow you to do some much more sophisticated sorting. Here’s a summary of the parameters you can use with the sort_values()
method and what they do. We’ll work through each one in the next steps.
Parameter | Description |
---|---|
by | The by parameter takes a string or list of strings that denote the columns to sort the dataframe by. For example, df.sort_values(by='PageViews') will sort a dataframe by the PageViews column, while df.sort_values(by=['PageViews', 'Events']) will sort by both PageViews and Events . |
ascending | The ascending parameter is optional and defaults to True so when you don't pass the parameter the dataframe will be sorted in ascending order of the defined column. To sort the dataframe values in descending order you need to pass the argument ascending=False . |
kind | The optional kind parameter defines the sorting algorithm Pandas uses to sort the column. By default, this is set to quicksort , but you can also set kind to mergesort , heapsort , or stable if you prefer. The numpy.sort() documentation explains how these sorting algorithms work. |
inplace | The optional inplace parameter is set to False by default and allows you to sort a dataframe without saving the changes back to the original. If you set this to inplace=True , the sort order you set will automatically be saved back to the source dataframe without you needing to redeclare it back to the dataframe's variable name. |
na_position | The optional na_position parameter is set to last by default and puts NaN values at the bottom of the sort order. If you want NaN values to appear at the top of the sort you can specify na_position='first' . |
ignore_index | The optional ignore_index parameter is set to False by default. When you sort the dataframe this will preserve the original index numbers. If you want to renumber your index so the values represent the new sort order you can pass in the argument ignore_index=True . |
key | The optional key parameter lets you apply a lambda function to a column before you sort it. It's really useful for sorting monetary values, since you can remove currency formatting and cast the dtype to a numeric value to obtain an accurate sort. |
To get started, open a Jupyter notebook and import the Pandas and Numpy packages, then either load up a dataframe of your own data, or create a dataframe containing different Pandas dtypes and including some missing values. These will let us work through the various methods of sorting in Pandas to see how they handle data of different kinds.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
"page": [1, 2, 3, 4, 5, 6, 7],
"day": ['Monday', 'Wednesday', 'Tuesday', 'Sunday', 'Thursday', 'Saturday', 'Friday'],
"users": [34, 34, 45, 46, np.nan, 35, 42],
"sessions": [493, 433, np.nan, 478, 427, np.nan, 233],
"pageviews": [837, 35, 2342, 3423, 4322, 332, 300],
"events": [2347, 24663, 36323, 24463, 374, 322, 2563],
"revenue": ['£342.34', '£357.35', '£753.35', '£4732.34', '£584.35', '£2824.23', '£1008.98'],
"score": [1, 2, 3, -3, 4, -2, -1],
"dates": pd.date_range('2022-01-01', periods=7, freq='D'),
"datetimes": pd.date_range('2022-01-01 09:00', periods=7, freq='H'),
})
df
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
The most commonly used way of sorting a dataframe is to sort by a single column using the by
parameter. This will sort the dataframe in ascending order by default. That means numeric data will be sorted from smallest to largest and string data will be sorted alphabetically. If you want to sort in descending order, you can use the ascending
parameter and set it to False
.
df.sort_values(by='pageviews')
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
You can also sort a Pandas dataframe by multiple columns. For example, you might want to order by pageviews
first and then by events. To do this you simply pass a list of column names to the
by parameter. By default, Pandas will sort in ascending order. If you want to sort in descending order, you can use the
ascending parameter and set it to
False` for each value.
df.sort_values(by=['pageviews', 'events'])
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
As we saw above, by default Pandas will sort in ascending order. You don’t need to specify the ascending
parameter if you want to sort in ascending order, but it can make your code easier for others to read.
df.sort_values(by='pageviews', ascending=True)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
To sort a Pandas dataframe column in descending order, you can set the ascending
parameter to False
.
df.sort_values(by='pageviews', ascending=False)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
If the column contains non-numeric or categorical data with an object
or categorical
data type, you can sort the column in alphabetical order by setting the ascending
parameter to True
.
df.sort_values(by='day', ascending=True)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
If you want to sort the column in reverse alphabetical order, set the ascending
parameter to False
.
df.sort_values(by='day', ascending=False)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
If you want to sort multiple columns in their own orders, set the ascending
parameter to a list of True
and False
values.
df.sort_values(by=['pageviews', 'events'],
ascending=[False, True])
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
By default sort_values()
uses the quicksort
algorithm. You can change this by setting the kind
parameter to mergesort
or heapsort
. The Numpy documentation has more information on the differences between the algorithms.
df.sort_values(by='sessions', kind='quicksort')
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
df.sort_values(by='sessions', kind='mergesort')
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
df.sort_values(by='sessions', kind='heapsort')
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
df.sort_values(by='sessions', kind='stable')
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
If you want to save the sort order to the source dataframe, you can use the inplace
parameter. This will modify the source dataframe and return None
. It’s the same as using df = df.sort_values(by='sessions')
but it’s more efficient.
df
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
df.sort_values(by='sessions', inplace=True)
df
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
If you’re dealing with a dataset that contains missing values or NaNs, you can use the na_position
parameter to set how these values are sorted. By default, NaN values are sorted to the end of the DataFrame.
df.sort_values(by='sessions')
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
The above code should give you the same results as manually setting na_position='last'
.
df.sort_values(by='sessions', na_position='last')
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
If you want to bring the NaNs to the top, you can use na_position='first'
.
df.sort_values(by='sessions', na_position='first')
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
The ignore_index
parameter can be used to create a new index based on the sort order. This is useful if you want to reset the index after sorting. The default value is False
and will leave the original index intact.
```python
df.sort_values(by='sessions', ignore_index=False)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
If you want to sort the data by a given sort order and then create a new index that reflects the sort order, you can do this by setting ignore_index
to True
.
df.sort_values(by='sessions', ignore_index=True)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
1 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
2 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
4 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
5 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
6 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
The new key
argument added to the sort_values()
function in Pandas 1.1 allows you to apply a lambda
function to the column before sorting the data. This can be very useful when you want to sort object data that represent currency or monetary values, where the presence of a pound sign or dollar sign aids formatting but turns a float
into an object
, which can’t be sorted as effectively. For example, let’s see what happens when we attempt to sort the Revenue
column in descending order. The values, as you can see below, don’t get sorted as you would expect.
df.sort_values(by='revenue', ascending=False)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
By using the key
parameter we can define a lambda
function that gets applied to the column before the data gets sorted, but which retains the original formatting. The example below uses the replace()
function to replace the pound sign with a blank space, then uses astype()
to cast the object
dtype to a float
dtype, allowing it to be sorted correctly. The sort_values()
returns the sorted values, but leaves the pound sign in the data.
df.sort_values(by='revenue',
ascending=False,
key=lambda val: val.str.replace('£', '').astype(float)
)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
Sometimes, you might want to sort the data in a dataframe based on the index itself. You can do this using the sort_index()
method. The sort_index()
method has the same parameters as sort_values()
, but it only has one parameter, axis
, which is set to 0
by default. The axis
parameter is used to specify whether you want to sort the index or the columns. The axis
parameter can be set to 0
to sort the index, or 1
to sort the columns.
df.sort_index(ascending=True)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
You can also use the sort_index()
function to sort a Pandas dataframe by column name. By default, sort_index()
sets ascending
to True, so columns will appear in alphabetical order, however, you can easily change this by using df.sort_index(axis=1, ascending=False)
instead of df.sort_index(axis=1)
.
df.sort_index(axis=1)
dates | datetimes | day | events | page | pageviews | revenue | score | sessions | users | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 2022-01-07 | 2022-01-01 15:00:00 | Friday | 2563 | 7 | 300 | £1008.98 | -1 | 233.0 | 42.0 |
4 | 2022-01-05 | 2022-01-01 13:00:00 | Thursday | 374 | 5 | 4322 | £584.35 | 4 | 427.0 | NaN |
1 | 2022-01-02 | 2022-01-01 10:00:00 | Wednesday | 24663 | 2 | 35 | £357.35 | 2 | 433.0 | 34.0 |
3 | 2022-01-04 | 2022-01-01 12:00:00 | Sunday | 24463 | 4 | 3423 | £4732.34 | -3 | 478.0 | 46.0 |
0 | 2022-01-01 | 2022-01-01 09:00:00 | Monday | 2347 | 1 | 837 | £342.34 | 1 | 493.0 | 34.0 |
2 | 2022-01-03 | 2022-01-01 11:00:00 | Tuesday | 36323 | 3 | 2342 | £753.35 | 3 | NaN | 45.0 |
5 | 2022-01-06 | 2022-01-01 14:00:00 | Saturday | 322 | 6 | 332 | £2824.23 | -2 | NaN | 35.0 |
By default, when you sort a numeric column it will appear in ascending order, so if you’ve got a mixture of positive and negative values, the negative values will appear at the top of the column. You can flip the orientation so that the values appear in descending order by using df.sort_values(by='column_name', ascending=False)
.
df.sort_values(by='score')
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
In rare cases, you might want to consider negative and positive values equally via their absolute value. You can also do this using the key
argument and by passing the Pandas abs
function to it.
df.sort_values(by='score', key=pd.Series.abs)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
Dates are also handled by sort_values()
by default. You can sort by date by passing the column name to the by
argument. The default sorting order is ascending, so the earliest date will be at the top of the dataframe. You can change this by passing ascending=False
to the sort_values()
function.
df.sort_values(by='dates', ascending=False)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
You can also sort by datetime by passing the column name to the by
argument. The default sorting order is ascending, so the earliest datetime will be at the top of the dataframe. You can change this by passing ascending=False
to the sort_values()
function.
df.sort_values(by='datetimes', ascending=True)
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
Finally, you can even create custom sort orders of your own using the Pandas CategoricalDtype class. While the ascending
parameter allows you to put categorical variables or object
dtypes into alphabetical order, sometimes this won’t make sense. For example, as we saw above, weekdays in the Day
column of our test dataset are placed in alphabetical order, but this doesn’t represent the order in which they should be sorted, i.e. Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. Similarly, month names and product sizes, such as S, XS, M, L, XL, XXL etc, require special treatment via the creation of a custom sort order.
A really neat way to do this is via the Pandas CategoricalDtype
class of pandas.api.types
. To use this, you’ll first need to load the class, then create a list of values to which you want to set the CategoricalDtype
, and finally pass in the ordered=True
argument which defines the custom sort order. Once that’s done, you can use the astype()
function to cast the object
data to a new category
dtype, then you can pass the column name to by
and sort the data using your custom sort order.
from pandas.api.types import CategoricalDtype
weekday = CategoricalDtype(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], ordered=True)
df['day'] = df['day'].astype(weekday)
df.sort_values(by='day')
page | day | users | sessions | pageviews | events | revenue | score | dates | datetimes | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Monday | 34.0 | 493.0 | 837 | 2347 | £342.34 | 1 | 2022-01-01 | 2022-01-01 09:00:00 |
2 | 3 | Tuesday | 45.0 | NaN | 2342 | 36323 | £753.35 | 3 | 2022-01-03 | 2022-01-01 11:00:00 |
1 | 2 | Wednesday | 34.0 | 433.0 | 35 | 24663 | £357.35 | 2 | 2022-01-02 | 2022-01-01 10:00:00 |
4 | 5 | Thursday | NaN | 427.0 | 4322 | 374 | £584.35 | 4 | 2022-01-05 | 2022-01-01 13:00:00 |
6 | 7 | Friday | 42.0 | 233.0 | 300 | 2563 | £1008.98 | -1 | 2022-01-07 | 2022-01-01 15:00:00 |
5 | 6 | Saturday | 35.0 | NaN | 332 | 322 | £2824.23 | -2 | 2022-01-06 | 2022-01-01 14:00:00 |
3 | 4 | Sunday | 46.0 | 478.0 | 3423 | 24463 | £4732.34 | -3 | 2022-01-04 | 2022-01-01 12:00:00 |
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 6 to 5
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 page 7 non-null int64
1 day 7 non-null category
2 users 6 non-null float64
3 sessions 5 non-null float64
4 pageviews 7 non-null int64
5 events 7 non-null int64
6 revenue 7 non-null object
7 score 7 non-null int64
8 dates 7 non-null datetime64[ns]
9 datetimes 7 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](2), float64(2), int64(4), object(1)
memory usage: 923.0+ bytes
Matt Clarke, Sunday, November 06, 2022