The Pearson correlation coefficient, or PCC, is the standard statistical method for computing pairwise or bivariate correlation in Pandas. It’s so commonly used in statistics, that it is often referred to simply as “the correlation coefficient.” Pearson correlation coefficient is exactly the same thing as Pearson’s r, and the Pearson product moment correlation coefficient (PPMCC), it just has several names.
As with other correlation coefficients, Pearson correlation is used to compute the strength of linear correlation between two variables in a dataset. It’s basically the ratio between the covariance of the variables and the product of their standard deviations, and gives a normalised measure of covariance that returns a value between 1 and -1. A value of 1 indicates a perfect positive linear relationship, a value of -1 indicates a perfect negative linear relationship, and a value of 0 indicates no linear relationship.
By interpreting the result of the Pearson correlation coefficient you can tell whether one variable is associated with a change in the other. For example, the price of new cars might be positively correlated with top speed, since most very expensive new cars are supercars or hypercars.
In this tutorial, I’ll explain Pearson correlation and show how you can use it to spot correlations between variables in your dataset using the Pandas
The Pearson correlation coefficient can be easily calculated in Pandas using
corr() function is used to compute pairwise correlation coefficients on Pandas dataframe values, and can either calculate them as individual pairs (i.e. top speed and price), or as pairs across an entire dataframe.
Since Pearson correlation coefficient is so widely used by statisticians and data scientists, the
corr() function is pre-configured with default values to return the Pearson correlation coefficient. However, you can change this to use the similar Spearman rank correlation (or Spearman’s r), or the Kendall Tau correlation coefficient, if you think they better suit your data.
corr() function takes three arguments, all of which are optional. Here’s a quick summary of what they do.
corr(method='pearson'), or just
method='pearson' is the default, the resulting correlation matrix will have a value for every pair of columns in the original DataFrame.
The value will be a number between -1 and 1, where 1 is a perfect positive linear relationship, 0 is no linear relationship, and -1 is a perfect negative linear relationship. Different authors use slightly different interpretations of the coefficients, but they’re generally very similar to the ones below.
|Pearson's r value||Strength of relationship|
|0||No linear relationship|
|0.1 to 0.3||Weak linear relationship|
|0.3 to 0.5||Moderate linear relationship|
|0.5 to 0.7||Strong linear relationship|
|0.7 to 1||Very strong linear relationship|
|-0.1 to -0.3||Weak negative linear relationship|
|-0.3 to -0.5||Moderate negative linear relationship|
|-0.5 to -0.7||Strong negative linear relationship|
|-0.7 to -1||Very strong negative linear relationship|
To get started, open a new Jupyter notebook and import the Pandas library. You only need this to calculate correlation coefficients in Pandas, but to cover some more advanced topics we’ll also import the Matplotlib and Seaborn data visualisation packages, and the Scipy statistical analysis package.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy.stats import pearsonr
Then, either create a dummy dataset of data with linear relationships between the variables, or import a dataset into Pandas from a CSV file. I’m using a house price dataset, as this contains several correlated variables.
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/housing.csv') df.head().T
|ocean_proximity||NEAR BAY||NEAR BAY||NEAR BAY||NEAR BAY||NEAR BAY|
To calculate the Pearson correlation for a pair of columns, you can append the
.corr() method to the first column and pass the second column as an argument. If we do this for the
median_house_value columns we get back a PCC or r of 0.6880, which according to our interpretation in the lookup table denotes a strong linear correlation. That means, samples with a higher
median_income generally had a higher
To calculate the Pearson correlation coefficient for every pair of values in the dataframe, you can simply append the
corr() method to the end of the dataframe object. The resulting dataframe, or matrix, will have the correlation coefficient for every pair of columns in the dataframe. To avoid getting a
FutureWarning, you’ll want to set
numeric_only=True in the
When examining a dataset it’s often easier to visualise the correlations using Seaborn, or a similar data visualisation library. Visualising the correlations via a heatmap is my preferred technique and is easily done by using just a Seaborn function and a Matplotlib function to adjust the figure size.
plt.figure(figsize=(14,8)) sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='Blues')
There are various ways to visualise correlations between pairs of variables. A quick way to view the correlations of a single pair is to use the Seaborn
regplot() function. This takes three arguments comprising the
y columns you want to plot, and the
data from the dataframe.
plt.figure(figsize=(14,8)) sns.regplot(x='median_income', y='median_house_value', data=df) plt.title('Median Income vs Median House Value')
Text(0.5, 1.0, 'Median Income vs Median House Value')
To determine whether the correlation coefficient is statistically significant or not you can use the
pearsonr() method from
scipy.stats. This returns does much the same as the Pandas
corr() function, in that it also returns the correlation coefficient, but crucially, it provides the p-value that Pandas does not.
The p-value is the probability that the correlation coefficient is not statistically significant. If the p-value is less than 0.05 then the correlation coefficient is statistically significant. We get a very low p value, so the correlation coefficient is statistically significant.
By contrast, if you use Scipy to calculate the Pearson correlation coefficient and p value of the
median_house_value columns you get back a higher p value, indicating that the result is not statistically significant.
Matt Clarke, Sunday, November 20, 2022