How to create descriptive statistics using the Pandas describe function

The Pandas describe function lets you quickly create descriptive statistics from a dataframe and is the first step in Exploratory Data Analysis or EDA.

How to create descriptive statistics using the Pandas describe function
Picture by Pixabay, Pexels.
13 minutes to read

The Pandas describe() function generates descriptive statistics on the contents of a Pandas dataframe to show the central tendency, shape, distribution, and dispersion of variables. Examining descriptive statistics is the first task in any quantitative data analysis, and they’re very quick and easy to generate using Pandas.

The describe() function is commonly used during the Exploratory Data Analysis or EDA step immediately after data is loaded into the dataframe and allows the data scientist to quickly understand the data within. Here’s a quick tutorial explaining how to use the describe() function.

Load the packages

First, open a Jupyter notebook and import the packages you’ll need. We’ll only need Pandas, which you’ll probably already have installed. If you need to install it you can enter the following command in a terminal: pip3 install pandas.

import pandas as pd

Import the data

Next, we’ll import data into Pandas from a CSV file using the read_csv() function to load a remote file from my GitHub repository of datasets. We’ll assign the data to a variable called df and print the first few rows of the dataframe using the head() function. Alternatively, you can load your own data, or create a Pandas dataframe from scratch.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/google-analytics.csv')
df.head()
User Type Source Medium Browser Device Category Date Pageviews
0 New Visitor (direct) (none) Amazon Silk mobile 2020-07-31 3
1 New Visitor (direct) (none) Amazon Silk mobile 2020-07-14 1
2 New Visitor (direct) (none) Amazon Silk tablet 2020-07-14 1
3 New Visitor (direct) (none) Amazon Silk tablet 2020-08-07 1
4 New Visitor (direct) (none) Amazon Silk tablet 2020-08-12 1

Examine the data types in the dataframe

To understand what the dataframe contains, we can use df.info(). The info() function prints out the data types of each column in the dataframe and counts the number of missing values in each column.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User Type        10000 non-null  object
 1   Source           10000 non-null  object
 2   Medium           10000 non-null  object
 3   Browser          10000 non-null  object
 4   Device Category  10000 non-null  object
 5   Date             10000 non-null  object
 6   Pageviews        10000 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 547.0+ KB

Use the describe() function to generate descriptive statistics

Next, we’ll use the Pandas describe() function to generate descriptive statistics for each column in the dataframe. By default, the describe() function will return descriptive statistics on the numeric columns in your dataframe only.

In the example dataframe above, only the Pageviews column contains numeric data. For each numeric column found in the dataframe, the describe() function will return the count, mean, standard deviation, minimum value, maximum value, and the 25%, 50% and 75% percentile values.

df.describe()
Pageviews
count 10000.000000
mean 1.447600
std 0.972393
min 1.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 14.000000

Generate descriptive statistics of the categorical columns

What many data scientists don’t realise is that you can also use describe() on dataframes that include categorical or non-numeric variables, such as object data types. To use this method you pass in the additional include='all argument to ensure that all columns in the input dataframe are included in the output descriptive statistics.

In addition to the values returned by the default describe() function with no arguments, df.describe(include='all') will also return some additional statistics for categorical data columns. These include unique showing the number of unique values in the column, top showing the most common value in the column, and freq showing the frequency of the top value within the column.

df.describe(include='all')
User Type Source Medium Browser Device Category Date Pageviews
count 10000 10000 10000 10000 10000 10000 10000.000000
unique 1 19 4 17 3 30 NaN
top New Visitor google organic Chrome desktop 2020-08-03 NaN
freq 10000 6225 7509 6869 4882 395 NaN
mean NaN NaN NaN NaN NaN NaN 1.447600
std NaN NaN NaN NaN NaN NaN 0.972393
min NaN NaN NaN NaN NaN NaN 1.000000
25% NaN NaN NaN NaN NaN NaN 1.000000
50% NaN NaN NaN NaN NaN NaN 1.000000
75% NaN NaN NaN NaN NaN NaN 2.000000
max NaN NaN NaN NaN NaN NaN 14.000000

Transpose the dataframe to improve readability

If you are examining the descriptive statistics on a very large dataframe, you may want to use the Pandas transpose option by appending .T after the function is called. This flips the orientation of the rows and columns and can make larger dataframes much easier to read.

df.describe(include='all').T
count unique top freq mean std min 25% 50% 75% max
User Type 10000 1 New Visitor 10000 NaN NaN NaN NaN NaN NaN NaN
Source 10000 19 google 6225 NaN NaN NaN NaN NaN NaN NaN
Medium 10000 4 organic 7509 NaN NaN NaN NaN NaN NaN NaN
Browser 10000 17 Chrome 6869 NaN NaN NaN NaN NaN NaN NaN
Device Category 10000 3 desktop 4882 NaN NaN NaN NaN NaN NaN NaN
Date 10000 30 2020-08-03 395 NaN NaN NaN NaN NaN NaN NaN
Pageviews 10000.0 NaN NaN NaN 1.4476 0.972393 1.0 1.0 1.0 2.0 14.0

Generate descriptive statistics for specific columns

If you only want to generate descriptive statistics for specific columns in your Pandas dataframe you can specify the list by placing this within double square brackets. For example, df[['Source', 'Medium']] will filter the Pandas dataframe to show only the two specified columns, and you can then append .describe() to view their summary statistics.

df[['Source', 'Medium']].describe()
Source Medium
count 10000 10000
unique 19 4
top google organic
freq 6225 7509

Generate descriptive statistics for specific data types

Another useful technique is to use the describe() function to show only data of a specific data type. To do this you can pass a list of data types to the include argument when calling describe(). If you run df.info() first it will tell you what data types are present within the dataframe. You can then specify the list and pass it like this: df.describe(include=['object']).

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User Type        10000 non-null  object
 1   Source           10000 non-null  object
 2   Medium           10000 non-null  object
 3   Browser          10000 non-null  object
 4   Device Category  10000 non-null  object
 5   Date             10000 non-null  object
 6   Pageviews        10000 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 547.0+ KB
df.describe(include=['object'])
User Type Source Medium Browser Device Category Date
count 10000 10000 10000 10000 10000 10000
unique 1 19 4 17 3 30
top New Visitor google organic Chrome desktop 2020-08-03
freq 10000 6225 7509 6869 4882 395

Changing the percentile values to use

If we go back to our default describe(include='all') output we’ll see that it includes: count, unique, top, freq, mean, std, min, max and percentile values for 25%, 50%, and 75%.

df.describe(include='all').T
count unique top freq mean std min 25% 50% 75% max
User Type 10000 1 New Visitor 10000 NaN NaN NaN NaN NaN NaN NaN
Source 10000 19 google 6225 NaN NaN NaN NaN NaN NaN NaN
Medium 10000 4 organic 7509 NaN NaN NaN NaN NaN NaN NaN
Browser 10000 17 Chrome 6869 NaN NaN NaN NaN NaN NaN NaN
Device Category 10000 3 desktop 4882 NaN NaN NaN NaN NaN NaN NaN
Date 10000 30 2020-08-03 395 NaN NaN NaN NaN NaN NaN NaN
Pageviews 10000.0 NaN NaN NaN 1.4476 0.972393 1.0 1.0 1.0 2.0 14.0

Add additional custom percentiles

However, it’s also possible to define additional percentiles such as 10%, 20%, and 80% if you want to get an idea of the spread of data in different areas. To define specific additional percentiles you need to pass a list of values to the percentiles argument when calling the describe() function, using a decimal value to indicate the percentage, for example .10 for 10%.

df.describe(include='all', percentiles=[.10, .20, .80]).T
count unique top freq mean std min 10% 20% 50% 80% max
User Type 10000 1 New Visitor 10000 NaN NaN NaN NaN NaN NaN NaN NaN
Source 10000 19 google 6225 NaN NaN NaN NaN NaN NaN NaN NaN
Medium 10000 4 organic 7509 NaN NaN NaN NaN NaN NaN NaN NaN
Browser 10000 17 Chrome 6869 NaN NaN NaN NaN NaN NaN NaN NaN
Device Category 10000 3 desktop 4882 NaN NaN NaN NaN NaN NaN NaN NaN
Date 10000 30 2020-08-03 395 NaN NaN NaN NaN NaN NaN NaN NaN
Pageviews 10000.0 NaN NaN NaN 1.4476 0.972393 1.0 1.0 1.0 1.0 2.0 14.0

Matt Clarke, Saturday, November 27, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Joining Data with pandas

Learn to combine data from multiple tables by joining data together using pandas.

Start course for FREE

Comments