Correlation: Definition, Formula, and Examples

What is Correlation?

Correlation is a statistical measure of the relationship between two variables, X and Y. For instance, you can measure to what extent temperature (X) is related to the production of ice cream (Y). You probably would expect that higher temperatures correspond with higher production of ice cream. On the plot shown below, an upward trend line also suggests a positive correlation between X and Y.

Conceptual Framework of Correlation

Conceptually, there are three types of correlation, namely negative, zero, positive correlations. For negative correlation, it means that X going up corresponds with Y going down (or, vice versa). In contrast, the positive correlation of X and Y means that if X going up, Y will move up. For zero correlation, it means that X and Y do not correspond with each other.

Meaning of Correlation Coefficients
Meaning of Correlation Coefficients

Correlation coefficients are in the range between -1 and +1. The following table summarizes the implication of different values of correlation coefficients. Note that, we need to view the strength of a correlation based on the absolute value, rather than positive or negative signs. Explained Variance measures the proportion to which X accounts for the variation (dispersion) of Y. For instance, +0.6 implies that X explains for 36% variance of Y.

Direction NegativeNegativeNegativeNonexistentPositivePositivePositive
Strength StrongMediumWeakAbsentWeakMediumStrong
Explained Variance100%36%9%09%36%100%
Meaning and Strength of Correlation Coefficients

Correlation Formula and Manual Calculation

The following is the formula for correlation. Let’s use a hypothetical example and dataset to show how to use this formula manually. Regarding how to use Python to do correlation analysis, please refer to my other other tutorial.


TemperatureIced coffee sales
34 41
Correlation Example

We first calculate the means of both temperature and Iced coffee sales.


Then, we can use the means to write down the specific calculation based on the correlation formula.

$$\sum_{i=1}^{n}((x_i-\bar{x})(y_i-\bar{y})) = (34-47.5)\times (41-96.83) + (36-47.5) \times (40-96.83) \\ + (40-47.5) \times (40-96.83) + (60-47.5) \times (150-96.83) + \\ (40-47.5) \times (100 -96.83) + (75-47.5) \times (200-96.83) = 5236.5 $$
$$\sum_{i=1}^{n}(x_i-\bar{x})^2=(34-47.5)^2+(36-47.5)^2+(40-47.5)^2 \\+(60-47.5)^2+(40-47.5)^2+(75-47.5)^2=1339.5$$
$$\sum_{i=1}^{n}(y_i-\bar{y})^2=(41-96.83)^2+(40-96.83)^2+(50-96.83)^2 \\+(150-96.83)^2+(100-96.83)^2+(200-96.83)^2=22020.83$$
$$r_{xy}=\frac{\sum_{i=1}^{n}((x_i-\bar{x})(y_i-\bar{y}))}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}=\frac{5236.5}{\sqrt{1339.5 \times 22020.83}}=0.96 $$