Correlation is a statistical measure of the relationship between two variables, X
and Y
.
This tutorial how to use Scipy, Numpy, and Pandas to do Pearson correlation analysis. Finally, it also shows how you can plot correlation in Python using seaborn
.
Method 1: Use scipy to calculate correlation in Python
scipy.stats.pearsonr(x, y)
Method 2: Use numpy to calculate correlation in Python
np.corrcoef(x, y)
Method 3: Use pandas to calculate correlation in Python
Sample Data
The following is the sample data for correlation.
Temperature | Iced coffee sales |
---|---|
34 | 41 |
36 | 40 |
40 | 50 |
60 | 150 |
40 | 100 |
75 | 200 |
Example 1: Use scipy.stats.pearsonr()
to calculate correlation
scipy.stats.pearsonr(
x
,
y
)
can be used to calculate Pearson correlations. The function will return two values, one is correlation coefficient, and the other one is p-value.
import scipy.stats
Temperature=[34,36,40,60,40,75]
Iced_coffee_sales=[41,40,50,150,100,200]
correlation_result=scipy.stats.pearsonr(Temperature, Iced_coffee_sales)
print(correlation_result)
Output:
(0.9641683562073264, 0.0019028578016146566)
We can see that the correlation coefficient is 0.96. The second one is the p-value (0.0019), which is smaller than 0.05, suggesting the correlation is statistically significant.
Example 2: Use np.corrcoef()
to Calculate Pearson Correlation
We can also use Numpy
to calculate Pearson correlation. The following is the Python code.
import numpy as np
np.corrcoef(Temperature, Iced_coffee_sales)
Output:
array([[1. , 0.96416836], [0.96416836, 1. ]])
Example 3: Use pd.corr()
to Calculate Pearson Correlation
Pandas also has a function for Pearson correlation. The following shows to use this funtion. But, note that, we need to make sure the data is a dataframe before using the function in Pandas.
Below, we form a dataframe and then use the function in Pandas.
import pandas as pd
Temperature=[34,36,40,60,40,75]
Iced_coffee_sales=[41,40,50,150,100,200]
data_iced_coffee={'Temperature':Temperature,'Iced_coffee_sales':Iced_coffee_sales}
data_iced_coffee=pd.DataFrame(data=data_iced_coffee)
print(data_iced_coffee.corr())
Output:
Temperature Iced_coffee_sales Temperature 1.000000 0.964168 Iced_coffee_sales 0.964168 1.000000
Example 4: Temperature and Ice Cream (real data)
Ice Cream and Frozen Dessert production index is from stlouisfed.org and the average monthly temperatures in the United Stated from statista.com. The following is the data.
DATE Ice_Cream_Index Temperature 0 1/1/2020 84.1969 34.57 1 2/1/2020 99.7767 36.18 2 3/1/2020 108.0301 46.08 3 4/1/2020 102.7954 50.88 4 5/1/2020 112.5288 60.91 5 6/1/2020 122.0301 70.29 6 7/1/2020 116.9799 75.65 7 8/1/2020 120.7120 74.71 8 9/1/2020 111.8634 65.91 9 10/1/2020 100.7911 54.28 10 11/1/2020 93.1239 46.31 11 12/1/2020 83.7717 35.71 12 1/1/2021 92.7025 30.97 13 2/1/2021 101.2105 30.60 14 3/1/2021 111.4117 45.54 15 4/1/2021 110.6197 51.87 16 5/1/2021 109.4956 60.34 17 6/1/2021 113.7700 72.59 18 7/1/2021 108.8079 75.40 19 8/1/2021 106.8920 73.98 20 9/1/2021 95.8193 67.80 21 10/1/2021 93.0939 56.95 22 11/1/2021 82.6352 45.09 23 12/1/2020 76.7508 39.34
The following Python code uses pd.corr()
in Pandas to calculate correlation.
import pandas as pd
data_Ice_cream=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Ice_cream_data.csv")
data_Ice_cream=data_Ice_cream.set_index('DATE')
correlation_result=data_Ice_cream.corr()
print(correlation_result)
Output:
Ice_Cream_Index Temperature Ice_Cream_Index 1.000000 0.689277 Temperature 0.689277 1.000000
Pearson correlation coefficient is 0.689. The positive coefficient suggests that temperature and ice production are positively correlated. However, it does not produce the p-value, and thus we do not know whether it is statistically significant.
The following code shows how to use seaborn to do a scatter plot.
import seaborn as sns
import pandas as pd
data_Ice_cream=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Ice_cream_data.csv")
sns.lmplot(x='Temperature',y='Ice_Cream_Index',data=data_Ice_cream,fit_reg=True)
