How to create descriptive statistics using the Pandas describe function

Picture by Pixabay, Pexels.

13 minutes to read

Data Science Pandas

The Pandas describe() function generates descriptive statistics on the contents of a Pandas dataframe to show the central tendency, shape, distribution, and dispersion of variables. Examining descriptive statistics is the first task in any quantitative data analysis, and they’re very quick and easy to generate using Pandas.

The describe() function is commonly used during the Exploratory Data Analysis or EDA step immediately after data is loaded into the dataframe and allows the data scientist to quickly understand the data within. Here’s a quick tutorial explaining how to use the describe() function.

Load the packages

First, open a Jupyter notebook and import the packages you’ll need. We’ll only need Pandas, which you’ll probably already have installed. If you need to install it you can enter the following command in a terminal: pip3 install pandas.

import pandas as pd

Import the data

Next, we’ll import data into Pandas from a CSV file using the read_csv() function to load a remote file from my GitHub repository of datasets. We’ll assign the data to a variable called df and print the first few rows of the dataframe using the head() function. Alternatively, you can load your own data, or create a Pandas dataframe from scratch.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/google-analytics.csv')
df.head()

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
0	New Visitor	(direct)	(none)	Amazon Silk	mobile	2020-07-31	3
1	New Visitor	(direct)	(none)	Amazon Silk	mobile	2020-07-14	1
2	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-07-14	1
3	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-08-07	1
4	New Visitor	(direct)	(none)	Amazon Silk	tablet	2020-08-12	1

Examine the data types in the dataframe

To understand what the dataframe contains, we can use df.info(). The info() function prints out the data types of each column in the dataframe and counts the number of missing values in each column.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User Type        10000 non-null  object
 1   Source           10000 non-null  object
 2   Medium           10000 non-null  object
 3   Browser          10000 non-null  object
 4   Device Category  10000 non-null  object
 5   Date             10000 non-null  object
 6   Pageviews        10000 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 547.0+ KB

Use the describe() function to generate descriptive statistics

Next, we’ll use the Pandas describe() function to generate descriptive statistics for each column in the dataframe. By default, the describe() function will return descriptive statistics on the numeric columns in your dataframe only.

In the example dataframe above, only the Pageviews column contains numeric data. For each numeric column found in the dataframe, the describe() function will return the count, mean, standard deviation, minimum value, maximum value, and the 25%, 50% and 75% percentile values.

df.describe()

	Pageviews
count	10000.000000
mean	1.447600
std	0.972393
min	1.000000
25%	1.000000
50%	1.000000
75%	2.000000
max	14.000000

Generate descriptive statistics of the categorical columns

What many data scientists don’t realise is that you can also use describe() on dataframes that include categorical or non-numeric variables, such as object data types. To use this method you pass in the additional include='all argument to ensure that all columns in the input dataframe are included in the output descriptive statistics.

In addition to the values returned by the default describe() function with no arguments, df.describe(include='all') will also return some additional statistics for categorical data columns. These include unique showing the number of unique values in the column, top showing the most common value in the column, and freq showing the frequency of the top value within the column.

df.describe(include='all')

	User Type	Source	Medium	Browser	Device Category	Date	Pageviews
count	10000	10000	10000	10000	10000	10000	10000.000000
unique	1	19	4	17	3	30	NaN
top	New Visitor	google	organic	Chrome	desktop	2020-08-03	NaN
freq	10000	6225	7509	6869	4882	395	NaN
mean	NaN	NaN	NaN	NaN	NaN	NaN	1.447600
std	NaN	NaN	NaN	NaN	NaN	NaN	0.972393
min	NaN	NaN	NaN	NaN	NaN	NaN	1.000000
25%	NaN	NaN	NaN	NaN	NaN	NaN	1.000000
50%	NaN	NaN	NaN	NaN	NaN	NaN	1.000000
75%	NaN	NaN	NaN	NaN	NaN	NaN	2.000000
max	NaN	NaN	NaN	NaN	NaN	NaN	14.000000

Transpose the dataframe to improve readability

If you are examining the descriptive statistics on a very large dataframe, you may want to use the Pandas transpose option by appending .T after the function is called. This flips the orientation of the rows and columns and can make larger dataframes much easier to read.

df.describe(include='all').T

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
User Type	10000	1	New Visitor	10000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Source	10000	19	google	6225	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Medium	10000	4	organic	7509	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Browser	10000	17	Chrome	6869	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Device Category	10000	3	desktop	4882	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Date	10000	30	2020-08-03	395	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Pageviews	10000.0	NaN	NaN	NaN	1.4476	0.972393	1.0	1.0	1.0	2.0	14.0

Generate descriptive statistics for specific columns

If you only want to generate descriptive statistics for specific columns in your Pandas dataframe you can specify the list by placing this within double square brackets. For example, df[['Source', 'Medium']] will filter the Pandas dataframe to show only the two specified columns, and you can then append .describe() to view their summary statistics.

df[['Source', 'Medium']].describe()

	Source	Medium
count	10000	10000
unique	19	4
top	google	organic
freq	6225	7509

Generate descriptive statistics for specific data types

Another useful technique is to use the describe() function to show only data of a specific data type. To do this you can pass a list of data types to the include argument when calling describe(). If you run df.info() first it will tell you what data types are present within the dataframe. You can then specify the list and pass it like this: df.describe(include=['object']).

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User Type        10000 non-null  object
 1   Source           10000 non-null  object
 2   Medium           10000 non-null  object
 3   Browser          10000 non-null  object
 4   Device Category  10000 non-null  object
 5   Date             10000 non-null  object
 6   Pageviews        10000 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 547.0+ KB

df.describe(include=['object'])

	User Type	Source	Medium	Browser	Device Category	Date
count	10000	10000	10000	10000	10000	10000
unique	1	19	4	17	3	30
top	New Visitor	google	organic	Chrome	desktop	2020-08-03
freq	10000	6225	7509	6869	4882	395

Changing the percentile values to use

If we go back to our default describe(include='all') output we’ll see that it includes: count, unique, top, freq, mean, std, min, max and percentile values for 25%, 50%, and 75%.

df.describe(include='all').T

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
User Type	10000	1	New Visitor	10000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Source	10000	19	google	6225	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Medium	10000	4	organic	7509	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Browser	10000	17	Chrome	6869	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Device Category	10000	3	desktop	4882	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Date	10000	30	2020-08-03	395	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Pageviews	10000.0	NaN	NaN	NaN	1.4476	0.972393	1.0	1.0	1.0	2.0	14.0

Add additional custom percentiles

However, it’s also possible to define additional percentiles such as 10%, 20%, and 80% if you want to get an idea of the spread of data in different areas. To define specific additional percentiles you need to pass a list of values to the percentiles argument when calling the describe() function, using a decimal value to indicate the percentage, for example .10 for 10%.

df.describe(include='all', percentiles=[.10, .20, .80]).T

	count	unique	top	freq	mean	std	min	10%	20%	50%	80%	max
User Type	10000	1	New Visitor	10000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Source	10000	19	google	6225	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Medium	10000	4	organic	7509	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Browser	10000	17	Chrome	6869	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Device Category	10000	3	desktop	4882	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Date	10000	30	2020-08-03	395	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Pageviews	10000.0	NaN	NaN	NaN	1.4476	0.972393	1.0	1.0	1.0	1.0	2.0	14.0

Matt Clarke, Saturday, November 27, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.