When working with categorical data in Pandas dataframes, it can help to get an understanding of the number of times a given value appears - a feature called “cardinality.” The Pandas value_counts()
function is ideal for this.
The value_counts()
function returns an object containing the name of each categorical variable and the number of times it occurs within the column. However, what if you want to extract the most common value itself, rather than get a count of the frequency distribution?
As you might expect, Pandas includes a variety of functions that can be used to determine the most common value in a dataframe column. In this quick tutorial, we’ll go over some code examples that show you how to find the most common value using value_counts()
, mode()
, idxmax()
, and nlargest()
.
To get started, open a Jupyter notebook and import some data into a Pandas dataframe. I’ve created a CSV file of Google Analytics data that you can use if you don’t have a suitable dataset of your own.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/google-analytics.csv')
df.head()
First, we’ll use the Pandas value_counts()
function to examine the distribution of the data in the Browser
column. This contains a categorical variable containing the name of the web browser used to access the site.
To use value_counts()
, you simply append the function to the Pandas column name. Optionally, you can also append the to_frame()
method to convert the data to a dataframe. Running the function shows us that Chrome is the most widely used browser, with 6869 occurrences in the Browser
column.
df['Browser'].value_counts().to_frame()
Browser | |
---|---|
Chrome | 6869 |
Safari | 1379 |
Edge | 817 |
Samsung Internet | 321 |
Amazon Silk | 216 |
Firefox | 177 |
Internet Explorer | 130 |
Android Webview | 45 |
Android Browser | 16 |
Opera | 10 |
Safari (in-app) | 10 |
Opera Mini | 3 |
Playstation 4 | 2 |
awin.com - site screen shotter | 2 |
UC Browser | 1 |
Mozilla Compatible Agent | 1 |
Iron | 1 |
Next, we’ll extract only the most common value in the column by using value_counts()
and idxmax()
together. The value_counts()
function counts the number of times each value appears, and the idxmax()
function returns the index of the row with the highest value.
df['Browser'].value_counts().idxmax()
'Chrome'
We can also find the most common value in a Pandas dataframe column using the mode()
function. You can run the mode()
function on an entire dataframe using df.mode()
and it will return the most common value in each column.
df.mode()
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
0 | New Visitor | organic | Chrome | desktop | 2020-08-03 | 1 |
You can also use the mode function to get the most common value in a column. For example, df['Browser'].mode()
will return the most common browser. Since this returns an object
, instead of the actual value, you need to append [0]
to the end to extract the value itself.
df['Browser'].mode()
0 Chrome
dtype: object
df['Browser'].mode()[0]
'Chrome'
Finally, there’s nlargest()
. The nlargest()
function, as the name suggests, returns the n
(this means any number) largest values in the column, so when you use it with value_counts()
it can return the most common value based on their number of occurrences.
For example, df['Browser'].value_counts().nlargest(3)
will return the three most commonly seen browsers and the number of occurrences.
df['Browser'].value_counts().nlargest(3)
Chrome 6869
Safari 1379
Edge 817
Name: Browser, dtype: int64
Since the browser name is stored in index[0]
and the number of occurrences is stored in values[0]
, we can use df['Browser'].value_counts().nlargest(1).index[0]
to get the most common browser name.
df['Browser'].value_counts().nlargest(1).index[0]
'Chrome'
Matt Clarke, Saturday, November 26, 2022