Spearman’s rank correlation coefficient, sometimes called Spearman’s rho, is a nonparametric statistic used to measure rank correlation, or the statistical dependence between the rankings of two variables. It explains how well a statistical relationship between two variables can be explained using a monotonic function.
While Pearson’s correlation coefficient is used to measure the relationship between pairs of variables with a linear relationship, Spearman’s rank correlation coefficient assesses monotonic relationships, whether they’re linear or not. It’s a non-parametric measure of correlation, meaning that it doesn’t assume that the data is normally distributed.
As the name suggests, it’s a measure of rank correlation, meaning that it’s based on the ranks of the data rather than the data itself. This is important because it means that it’s not affected by outliers, which can skew Pearson’s correlation coefficient. Spearman’s rank correlation is equal to the Pearson correlation coefficient of the ranks of the data and, like Pearson’s correlation coefficient, it ranges from -1 to 1.
Spearman’s rank correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed.
It is often used as an alternative to the Pearson correlation in the presence of outliers and can be useful for both continuous and ordinal variables. It assumes that the data must at least be ordinal and the scores on one variable need to be related monotonically to the other variable in the pairwise comparison.
It’s typically used on ordinal or continuous variables, but it can also be used on nominal variables. It’s often used in place of Pearson’s correlation coefficient when the data is not normally distributed.
corr() function can be used to compute pairwise correlation coefficients, including Spearman’s rho, on Pandas dataframe values, and can either calculate them as individual pairs (i.e. top speed and price), or as pairs across an entire dataframe.
By default, the
corr() function calculates the more commonly used Pearson’s correlation coefficient, but it can be easily modified to return the Spearman’s rank correlation coefficient, or Spearman’s rho, by passing in the optional argument
method='spearman'. Here’s a quick summary of the parameters you can pass to the Pandas
When you use the optional
method='spearman' option, the
corr() function will return the Spearman’s rank correlation coefficient for each pair of columns in the dataframe. The correlation coefficient returned, will be a value between -1 and +1. Here’s how you can interpret what these coefficients mean:
|Spearman's rho value||Strength of relationship|
|-1||Perfect negative linear relationship|
|-0.7||Strong negative linear relationship|
|-0.5||Moderate negative linear relationship|
|-0.3||Weak negative linear relationship|
|0||No linear relationship|
|0.3||Weak positive linear relationship|
|0.5||Moderate positive linear relationship|
|0.7||Strong positive linear relationship|
|1||Perfect positive linear relationship|
To work through some simple examples showing how to compute Spearman’s rank correlation in Python we’ll be using Pandas. To get started, import Pandas in a Jupyter notebook. We’ll also use the Matplotlib and Seaborn data visualisation libraries to visualise the Spearman rank correlation, and we’ll use Scipy Stats to test the statistical significance.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy.stats import spearmanr
Next, import a dataset into Pandas that contains data with linear relationships. I’m using a house price dataset.
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/housing.csv') df.head().T
|ocean_proximity||NEAR BAY||NEAR BAY||NEAR BAY||NEAR BAY||NEAR BAY|
To calculate Spearman’s rank correlation on a single pair of Pandas dataframe columns we’ll use the Pandas
corr() function with the
method parameter set to
spearman. To use the function we run
df['median_income'].corr(df['median_house_value'], method='spearman'), where the first column is the column we want to compare to the second column.
To calculate the Spearman’s rank correlation for all columns in a Pandas dataframe we can use the
corr() method with the
method parameter set to
spearman, but we’ll assign this to
df rather than a single column, and we won’t pass in another column to compare against.
At present, you’ll need to set
numeric_only=True to avoid getting a FutureWarning about the default value changing to
False in a future version of Pandas.
By using Seaborn, we can visualise the Spearman’s rank correlation coefficients using a heatmap visualisation of correlation coefficients. This can be an effective way to quickly understand relationships between data, as you can simply lookup those values where the correlation is strongly negative or strongly positive.
plt.figure(figsize=(14,8)) sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='Blues')
You can also use Seaborn to examine the relationship between a specific pair of variables that appear to have a strong correlation. The Seaborn
regplot() function is great for this.
plt.figure(figsize=(14,8)) sns.regplot(x='median_income', y='median_house_value', data=df) plt.title('Median Income vs Median House Value')
Text(0.5, 1.0, 'Median Income vs Median House Value')
Finally, you can measure the statistical significance of Spearman’s rho using Scipy Stats. Scipy Stats calculates both the correlation coefficient of a pair of variables as well as their statistical significance, but it’s clunkier to use than Pandas, so tends to be used mostly to validate whether a correlation is significant or not.
To use it, you simply call
spearmanr() and pass in the two Pandas columns you want to compare. By comparing the
median_house_value columns we get back a moderately strong correlation and a statistically significant p value.
By contrast, if you use Scipy to calculate the Spearman’s rank correlation coefficient and p value of the
median_house_value columns you get back a higher p value, indicating that the result is not statistically significant.
Matt Clarke, Friday, December 02, 2022