The Pandas describe()
function generates descriptive statistics on the contents of a Pandas dataframe to show the central tendency, shape, distribution, and dispersion of variables. Examining descriptive statistics is the first task in any quantitative data analysis, and they’re very quick and easy to generate using Pandas.
The describe()
function is commonly used during the Exploratory Data Analysis or EDA step immediately after data is loaded into the dataframe and allows the data scientist to quickly understand the data within. Here’s a quick tutorial explaining how to use the describe()
function.
First, open a Jupyter notebook and import the packages you’ll need. We’ll only need Pandas, which you’ll probably already have installed. If you need to install it you can enter the following command in a terminal: pip3 install pandas
.
import pandas as pd
Next, we’ll import data into Pandas from a CSV file using the read_csv()
function to load a remote file from my GitHub repository of datasets. We’ll assign the data to a variable called df
and print the first few rows of the dataframe using the head()
function. Alternatively, you can load your own data, or create a Pandas dataframe from scratch.
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/google-analytics.csv')
df.head()
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
0 | New Visitor | (direct) | (none) | Amazon Silk | mobile | 2020-07-31 | 3 |
1 | New Visitor | (direct) | (none) | Amazon Silk | mobile | 2020-07-14 | 1 |
2 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-07-14 | 1 |
3 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-08-07 | 1 |
4 | New Visitor | (direct) | (none) | Amazon Silk | tablet | 2020-08-12 | 1 |
To understand what the dataframe contains, we can use df.info()
. The info()
function prints out the data types of each column in the dataframe and counts the number of missing values in each column.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User Type 10000 non-null object
1 Source 10000 non-null object
2 Medium 10000 non-null object
3 Browser 10000 non-null object
4 Device Category 10000 non-null object
5 Date 10000 non-null object
6 Pageviews 10000 non-null int64
dtypes: int64(1), object(6)
memory usage: 547.0+ KB
Next, we’ll use the Pandas describe()
function to generate descriptive statistics for each column in the dataframe. By default, the describe()
function will return descriptive statistics on the numeric columns in your dataframe only.
In the example dataframe above, only the Pageviews
column contains numeric data. For each numeric column found in the dataframe, the describe()
function will return the count, mean, standard deviation, minimum value, maximum value, and the 25%, 50% and 75% percentile values.
df.describe()
Pageviews | |
---|---|
count | 10000.000000 |
mean | 1.447600 |
std | 0.972393 |
min | 1.000000 |
25% | 1.000000 |
50% | 1.000000 |
75% | 2.000000 |
max | 14.000000 |
What many data scientists don’t realise is that you can also use describe()
on dataframes that include categorical or non-numeric variables, such as object
data types. To use this method you pass in the additional include='all
argument to ensure that all columns in the input dataframe are included in the output descriptive statistics.
In addition to the values returned by the default describe()
function with no arguments, df.describe(include='all')
will also return some additional statistics for categorical data columns. These include unique
showing the number of unique values in the column, top
showing the most common value in the column, and freq
showing the frequency of the top value within the column.
df.describe(include='all')
User Type | Source | Medium | Browser | Device Category | Date | Pageviews | |
---|---|---|---|---|---|---|---|
count | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000.000000 |
unique | 1 | 19 | 4 | 17 | 3 | 30 | NaN |
top | New Visitor | organic | Chrome | desktop | 2020-08-03 | NaN | |
freq | 10000 | 6225 | 7509 | 6869 | 4882 | 395 | NaN |
mean | NaN | NaN | NaN | NaN | NaN | NaN | 1.447600 |
std | NaN | NaN | NaN | NaN | NaN | NaN | 0.972393 |
min | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
25% | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
50% | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 |
75% | NaN | NaN | NaN | NaN | NaN | NaN | 2.000000 |
max | NaN | NaN | NaN | NaN | NaN | NaN | 14.000000 |
If you are examining the descriptive statistics on a very large dataframe, you may want to use the Pandas transpose option by appending .T
after the function is called. This flips the orientation of the rows and columns and can make larger dataframes much easier to read.
df.describe(include='all').T
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
User Type | 10000 | 1 | New Visitor | 10000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Source | 10000 | 19 | 6225 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
Medium | 10000 | 4 | organic | 7509 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Browser | 10000 | 17 | Chrome | 6869 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Device Category | 10000 | 3 | desktop | 4882 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Date | 10000 | 30 | 2020-08-03 | 395 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Pageviews | 10000.0 | NaN | NaN | NaN | 1.4476 | 0.972393 | 1.0 | 1.0 | 1.0 | 2.0 | 14.0 |
If you only want to generate descriptive statistics for specific columns in your Pandas dataframe you can specify the list by placing this within double square brackets. For example, df[['Source', 'Medium']]
will filter the Pandas dataframe to show only the two specified columns, and you can then append .describe()
to view their summary statistics.
df[['Source', 'Medium']].describe()
Source | Medium | |
---|---|---|
count | 10000 | 10000 |
unique | 19 | 4 |
top | organic | |
freq | 6225 | 7509 |
Another useful technique is to use the describe()
function to show only data of a specific data type. To do this you can pass a list of data types to the include
argument when calling describe()
. If you run df.info()
first it will tell you what data types are present within the dataframe. You can then specify the list and pass it like this: df.describe(include=['object'])
.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User Type 10000 non-null object
1 Source 10000 non-null object
2 Medium 10000 non-null object
3 Browser 10000 non-null object
4 Device Category 10000 non-null object
5 Date 10000 non-null object
6 Pageviews 10000 non-null int64
dtypes: int64(1), object(6)
memory usage: 547.0+ KB
df.describe(include=['object'])
User Type | Source | Medium | Browser | Device Category | Date | |
---|---|---|---|---|---|---|
count | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 |
unique | 1 | 19 | 4 | 17 | 3 | 30 |
top | New Visitor | organic | Chrome | desktop | 2020-08-03 | |
freq | 10000 | 6225 | 7509 | 6869 | 4882 | 395 |
If we go back to our default describe(include='all')
output we’ll see that it includes: count
, unique
, top
, freq
, mean
, std
, min
, max
and percentile values for 25%
, 50%
, and 75%
.
df.describe(include='all').T
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
User Type | 10000 | 1 | New Visitor | 10000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Source | 10000 | 19 | 6225 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
Medium | 10000 | 4 | organic | 7509 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Browser | 10000 | 17 | Chrome | 6869 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Device Category | 10000 | 3 | desktop | 4882 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Date | 10000 | 30 | 2020-08-03 | 395 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Pageviews | 10000.0 | NaN | NaN | NaN | 1.4476 | 0.972393 | 1.0 | 1.0 | 1.0 | 2.0 | 14.0 |
However, it’s also possible to define additional percentiles such as 10%
, 20%
, and 80%
if you want to get an idea of the spread of data in different areas. To define specific additional percentiles you need to pass a list of values to the percentiles
argument when calling the describe()
function, using a decimal value to indicate the percentage, for example .10
for 10%
.
df.describe(include='all', percentiles=[.10, .20, .80]).T
count | unique | top | freq | mean | std | min | 10% | 20% | 50% | 80% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
User Type | 10000 | 1 | New Visitor | 10000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Source | 10000 | 19 | 6225 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
Medium | 10000 | 4 | organic | 7509 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Browser | 10000 | 17 | Chrome | 6869 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Device Category | 10000 | 3 | desktop | 4882 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Date | 10000 | 30 | 2020-08-03 | 395 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Pageviews | 10000.0 | NaN | NaN | NaN | 1.4476 | 0.972393 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 | 14.0 |
Matt Clarke, Saturday, November 27, 2021