This tutorial shows how to apply correlation in real world. I will use the Peloton and covid example to illustrate this concept.
In early 2021, a lot of consumers wanted to buy Peloton bikes but Peloton had difficulty meeting the demand. One year after in Jan. 2022 , Forbes instead reported that Peloton faced challenges as the demand has started falling and Peloton even reduced their production. Thus, it seems there is a positive correlation between consumers’ concerns about Covid and interest in Peloton bikes.
To test the potential correlation, I used data from Google trend from early 2020 to early 2022.
The following is the complete Python code, as well as the output. We can see that the output only has the correlation coefficient, which is positive, suggesting that there is a positive correlation between the keyword of Covid and keyword of Peloton based on Google Trends.
import pandas as pd data_Covid_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Covid_and_Peloton.csv") print(data_Covid_Peloton) data_Covid_Peloton=data_Covid_Peloton.set_index('Week') correlation_result=data_Covid_Peloton.corr() print(correlation_result)
Week Peloton Covid 0 2/23/2020 26 2 1 3/1/2020 24 5 2 3/8/2020 24 20 3 3/15/2020 67 53 4 3/22/2020 69 62 .. ... ... ... 106 3/6/2022 32 16 107 3/13/2022 29 16 108 3/20/2022 27 14 109 3/27/2022 25 14 110 4/3/2022 25 14 [111 rows x 3 columns] Peloton Covid Peloton 1.000000 0.480158 Covid 0.480158 1.000000
We can also use scatter plot to plot the relationship between these two variables. Below is the figure, and for the complete Python code regarding how to plot it, you can refer to my post on scatter plots. However, we do not have a p-value to see whether the relationship is statistically significant or not. In order to do that, we can use the statistics function in the Python package of
scipy.stats.pearsonr() to Calculate Pearson Correlation
The function of
) can be used to calculate Pearson correlations. The function will return two values, one is correlation coefficient, and the other one is p-value.
import pandas as pd import scipy.stats data_Covid_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Covid_and_Peloton.csv") data_Covid_Peloton=data_Covid_Peloton.set_index('Week') correlation_result=scipy.stats.pearsonr(data_Covid_Peloton.Covid, data_Covid_Peloton.Peloton) print(correlation_result)
The following is the output. Thus, we can see that the correlation coefficient is the same as the one from Pandas. The second one is the p-value (ie., 9.669931403711145e-08), which is smaller than 0.05, suggesting the correlation is statistically significant.
scipy.stats.spearmanr() to Calculate Spearman Rank-order Correlation
The name of the function of
scipy.stats.pearsonr() indicates that it is a Pearson correlation. Note that, Pearson correlation assumes that X and Y are from independent normal distributions. However, in some situations, such assumptions can not be met. If that is the case, we can use nonparametric methods such as Spearman rank-order correlation.
The Spearman correlation assesses monotonic relationships of two variables. What does it mean? In the following chart (credit: Wikipedia), while the Pearson correlation is 0.88, the Spearman correlation is 1. This is because Spearman correlation measures whether larger X values always correspond to larger Y values. Thus, the follow chart shows X and Y has a perfect positive relationship, since X and Y are moving the same direction.
Same as Pearson correlation, Spearman rank-order correlation is in the range between -1 and +1. The value of -1 suggest a perfect negative correlation, whereas +1 implies a perfect positive relationship. In the code below, we use a function
import pandas as pd import scipy.stats data_Covid_Peloton=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Covid_and_Peloton.csv") #print(data_Covid_Peloton) data_Covid_Peloton=data_Covid_Peloton.set_index('Week') correlation_result=scipy.stats.spearmanr(data_Covid_Peloton.Covid, data_Covid_Peloton.Peloton) print(correlation_result)
The following is the output. As we can see the correlation coefficient is different from the the one from Pearson correlation.