Standard deviation, STD or STDEV, is a descriptive statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance.
Pandas comes with a built-in function called std()
that calculates the standard deviation of a DataFrame or Series, so it’s very easy to use. It’s also returned when you use the Pandas describe()
function to get an overview of the descriptive statistics of a dataframe.
In this quick tutorial, we’ll go over some simple code samples that show you how to calculate the standard deviation of all dataframe columns, specific dataframe columns, or a single dataframe column. I’ll also show you how to return the degrees of freedom.
To get started, open a Jupyter notebook and import the Pandas and Numpy libraries. Then, either import some data into a Pandas dataframe from an existing dataset, or execute the code below to create a dummy dataset to use.
import pandas as pd
import numpy as np
df = pd.DataFrame({'age': [18, 19, 20, 18, 20, 21, 18, 19, 20, 21],
'height': [192, 189, 157, 178, 189, 201, 210, 189, 198, np.nan],
'weight': [69, 72, 73, 87, 89, 100, 98, 89, 72, np.nan]})
df
age | height | weight | |
---|---|---|---|
0 | 18 | 192.0 | 69.0 |
1 | 19 | 189.0 | 72.0 |
2 | 20 | 157.0 | 73.0 |
3 | 18 | 178.0 | 87.0 |
4 | 20 | 189.0 | 89.0 |
5 | 21 | 201.0 | 100.0 |
6 | 18 | 210.0 | 98.0 |
7 | 19 | 189.0 | 89.0 |
8 | 20 | 198.0 | 72.0 |
9 | 21 | NaN | NaN |
First, we’ll use the std()
method to calculate the standard deviation of all columns in the DataFrame. To do this, you simply append the std()
method to the DataFrame object. It returns a Series object with the standard deviation of each column.
df.std()
age 1.173788
height 15.081261
weight 11.935009
dtype: float64
To calculate the standard deviation of a single column, you can use the std()
method on the column itself. Running this on the age
column shows we’ve got a low standard deviation of 1.17 years so the dispersion of the data is fairly low and the subjects are of roughly similar ages.
df['age'].std()
1.1737877907772671
To calculate the standard deviation of specific columns, we can use the std()
method on the DataFrame and pass the column names as a list.
df[['age', 'height']].std()
age 1.173788
height 15.081261
dtype: float64
By default, the std()
function will skip NaN values, so using the skipna
parameter is not necessary, and will return the same result as just std()
df['height'].std(skipna=True)
15.081261367818158
The std()
function can also return the degrees of freedom or DDOF. This is the number of values that are used to calculate the standard deviation. The default value is 0, which means that the standard deviation is calculated using all the values in the column.
df['age'].std(ddof=0)
1.1135528725660042
If you set the DDOF to 1, then the standard deviation is calculated using all the values in the column except for the last value. This is useful when you want to calculate the standard deviation of a sample instead of a population.
df['age'].std(ddof=1)
1.1737877907772671
Matt Clarke, Sunday, November 27, 2022