Python Correlation Real-World Examples

This tutorial shows how to apply correlation in real world. I will use the Peloton and covid example to illustrate this concept.

In early 2021, a lot of consumers wanted to buy Peloton bikes but Peloton had difficulty meeting the demand. One year after in Jan. 2022 , Forbes instead reported that Peloton faced challenges as the demand has started falling and Peloton even reduced their production. Thus, it seems there is a positive correlation between consumers’ concerns about Covid and interest in Peloton bikes.

To test the potential correlation, I used data from Google trend from early 2020 to early 2022.

The following is the complete Python code, as well as the output. We can see that the output only has the correlation coefficient, which is positive, suggesting that there is a positive correlation between the keyword of Covid and keyword of Peloton based on Google Trends.

import pandas as pd
data_Covid_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Covid_and_Peloton.csv")
print(data_Covid_Peloton)
data_Covid_Peloton=data_Covid_Peloton.set_index('Week')
correlation_result=data_Covid_Peloton.corr()
print(correlation_result)
          Week  Peloton  Covid
0    2/23/2020       26      2
1     3/1/2020       24      5
2     3/8/2020       24     20
3    3/15/2020       67     53
4    3/22/2020       69     62
..         ...      ...    ...
106   3/6/2022       32     16
107  3/13/2022       29     16
108  3/20/2022       27     14
109  3/27/2022       25     14
110   4/3/2022       25     14

[111 rows x 3 columns]
          Peloton     Covid
Peloton  1.000000  0.480158
Covid    0.480158  1.000000

We can also use scatter plot to plot the relationship between these two variables. Below is the figure, and for the complete Python code regarding how to plot it, you can refer to my post on scatter plots. However, we do not have a p-value to see whether the relationship is statistically significant or not. In order to do that, we can use the statistics function in the Python package of scipy.


Use Scipy scipy.stats.pearsonr() to Calculate Pearson Correlation

The function of scipy.stats.pearsonr(xy) can be used to calculate Pearson correlations. The function will return two values, one is correlation coefficient, and the other one is p-value.

import pandas as pd
import scipy.stats
data_Covid_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Covid_and_Peloton.csv")
data_Covid_Peloton=data_Covid_Peloton.set_index('Week')
correlation_result=scipy.stats.pearsonr(data_Covid_Peloton.Covid, data_Covid_Peloton.Peloton)
print(correlation_result)

The following is the output. Thus, we can see that the correlation coefficient is the same as the one from Pandas. The second one is the p-value (ie., 9.669931403711145e-08), which is smaller than 0.05, suggesting the correlation is statistically significant.

(0.4801576989020214, 9.669931403711145e-08)

Use scipy.stats.spearmanr() to Calculate Spearman Rank-order Correlation

The name of the function of scipy.stats.pearsonr() indicates that it is a Pearson correlation. Note that, Pearson correlation assumes that X and Y are from independent normal distributions. However, in some situations, such assumptions can not be met. If that is the case, we can use nonparametric methods such as Spearman rank-order correlation.

The Spearman correlation assesses monotonic relationships of two variables. What does it mean? In the following chart (credit: Wikipedia), while the Pearson correlation is 0.88, the Spearman correlation is 1. This is because Spearman correlation measures whether larger X values always correspond to larger Y values. Thus, the follow chart shows X and Y has a perfect positive relationship, since X and Y are moving the same direction.

Same as Pearson correlation, Spearman rank-order correlation is in the range between -1 and +1. The value of -1 suggest a perfect negative correlation, whereas +1 implies a perfect positive relationship. In the code below, we use a function scipy.stats.spearmanr() from scipy.stats.

import pandas as pd
import scipy.stats
data_Covid_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Covid_and_Peloton.csv")
#print(data_Covid_Peloton)
data_Covid_Peloton=data_Covid_Peloton.set_index('Week')
correlation_result=scipy.stats.spearmanr(data_Covid_Peloton.Covid, data_Covid_Peloton.Peloton)
print(correlation_result)

The following is the output. As we can see the correlation coefficient is different from the the one from Pearson correlation.

SpearmanrResult(correlation=0.550233184505772, pvalue=3.9586134701706696e-10)