How to Conduct Correlation Analysis in Python

Correlation is a statistical measure of the relationship between two variables, X and Y.

This tutorial how to use Scipy, Numpy, and Pandas to do Pearson correlation analysis. Finally, it also shows how you can plot correlation in Python using seaborn.

Method 1: Use scipy to calculate correlation in Python

scipy.stats.pearsonr(x, y)

Method 2: Use numpy to calculate correlation in Python

np.corrcoef(x, y)

Method 3: Use pandas to calculate correlation in Python

pd.corr()

Sample Data

The following is the sample data for correlation.

TemperatureIced coffee sales
34 41
3640
4050
60150
40100
75200
Correlation Example

Example 1: Use scipy.stats.pearsonr() to calculate correlation

scipy.stats.pearsonr(xy) can be used to calculate Pearson correlations. The function will return two values, one is correlation coefficient, and the other one is p-value.

import scipy.stats
Temperature=[34,36,40,60,40,75]
Iced_coffee_sales=[41,40,50,150,100,200]
correlation_result=scipy.stats.pearsonr(Temperature, Iced_coffee_sales)
print(correlation_result)

Output:

(0.9641683562073264, 0.0019028578016146566)

We can see that the correlation coefficient is 0.96. The second one is the p-value (0.0019), which is smaller than 0.05, suggesting the correlation is statistically significant.

Example 2: Use np.corrcoef() to Calculate Pearson Correlation

We can also use Numpy to calculate Pearson correlation. The following is the Python code.

import numpy as np
np.corrcoef(Temperature, Iced_coffee_sales)

Output:

array([[1.        , 0.96416836],
       [0.96416836, 1.        ]])

Example 3: Use pd.corr() to Calculate Pearson Correlation

Pandas also has a function for Pearson correlation. The following shows to use this funtion. But, note that, we need to make sure the data is a dataframe before using the function in Pandas.

Below, we form a dataframe and then use the function in Pandas.

import pandas as pd
Temperature=[34,36,40,60,40,75]
Iced_coffee_sales=[41,40,50,150,100,200]
data_iced_coffee={'Temperature':Temperature,'Iced_coffee_sales':Iced_coffee_sales}
data_iced_coffee=pd.DataFrame(data=data_iced_coffee)
print(data_iced_coffee.corr())

Output:

                   Temperature  Iced_coffee_sales
Temperature           1.000000           0.964168
Iced_coffee_sales     0.964168           1.000000

Example 4: Temperature and Ice Cream (real data)

Ice Cream and Frozen Dessert production index is from stlouisfed.org and the average monthly temperatures in the United Stated from statista.com. The following is the data.

        DATE  Ice_Cream_Index  Temperature
0    1/1/2020          84.1969        34.57
1    2/1/2020          99.7767        36.18
2    3/1/2020         108.0301        46.08
3    4/1/2020         102.7954        50.88
4    5/1/2020         112.5288        60.91
5    6/1/2020         122.0301        70.29
6    7/1/2020         116.9799        75.65
7    8/1/2020         120.7120        74.71
8    9/1/2020         111.8634        65.91
9   10/1/2020         100.7911        54.28
10  11/1/2020          93.1239        46.31
11  12/1/2020          83.7717        35.71
12   1/1/2021          92.7025        30.97
13   2/1/2021         101.2105        30.60
14   3/1/2021         111.4117        45.54
15   4/1/2021         110.6197        51.87
16   5/1/2021         109.4956        60.34
17   6/1/2021         113.7700        72.59
18   7/1/2021         108.8079        75.40
19   8/1/2021         106.8920        73.98
20   9/1/2021          95.8193        67.80
21  10/1/2021          93.0939        56.95
22  11/1/2021          82.6352        45.09
23  12/1/2020          76.7508        39.34

The following Python code uses pd.corr() in Pandas to calculate correlation.

import pandas as pd
data_Ice_cream=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Ice_cream_data.csv")
data_Ice_cream=data_Ice_cream.set_index('DATE')
correlation_result=data_Ice_cream.corr()
print(correlation_result)

Output:

                 Ice_Cream_Index  Temperature
Ice_Cream_Index         1.000000     0.689277
Temperature             0.689277     1.000000

Pearson correlation coefficient is 0.689. The positive coefficient suggests that temperature and ice production are positively correlated. However, it does not produce the p-value, and thus we do not know whether it is statistically significant.

The following code shows how to use seaborn to do a scatter plot.

import seaborn as sns
import pandas as pd
data_Ice_cream=pd.read_csv("https://raw.githubusercontent.com/TidyPython/data_visualization/main/Ice_cream_data.csv")
sns.lmplot(x='Temperature',y='Ice_Cream_Index',data=data_Ice_cream,fit_reg=True)
Plot Correlation Analysis in Python (Scatter Plot and fitted Regression Line)
Plot Correlation Analysis in Python (Scatter Plot and fitted Regression Line)

Further Reading