How to Conduct Simple Linear Regression in Python

This tutorial explains how to use NumPy and scipy to do simple linear regression in Python. I will first briefly explain the concept of linear regression, the math part of simple linear regression, and the actual Python code of simple linear regression.

Simple Linear Regression vs. Multiple Linear Regression

There are two categories of linear regression, namely simple and multiple. The only difference between them is the number of X variables. Simple linear regression has only one X, whereas multiple linear regression has more than one X.

Difference between Simple Linear Regression and Multiple Linear Regression

Simple Linear Regression

Below is the basic regression equation, in which 𝛽₀ is the intercept and 𝛽₁ is the regression coefficient and 𝜀 is the random error.

\[y=\beta_0 +\beta_1x+\epsilon\]

We need to estimate the 𝛽₀ and 𝛽₁ based on the data we have and denote those estimated parameters as estimated regression coefficients 𝑏₀ and 𝑏₁. Thus, the estimated regression function is as follows.

\[f(x)=b_0 +b_1x\]

You might have questions regarding what criteria to find 𝑏₀ and 𝑏₁. It is based on the goal of minimizing the residual, namely

\[y_i -f(x_i)\]

If we square each and then sum them up, it is called the Sum of Squared Residuals (SSR). This method is called Ordinary Least Squares (OLS).

\[\Sigma(y_i-f(x_i))^2\]

The following is the exact formula to calculate 𝑏₁ and 𝑏₀. Note that, we can just use Python to calculate all of these, and thus these formulas are just to help us understand the underlying process.

\[b_1=\frac{\sum (X_i-\bar{X})(Y_i-\bar{Y})}{\sum(X_i-\bar{X})^2} =\frac{COV(XY)}{Var(X)}\]

\[b_0=\bar{Y}-b_1\bar{X}\]

The formulas above are equivalent to the following using the matrix method.

\[b=\begin{bmatrix} b_0\\ b_1 \end{bmatrix}=(X^{‘}X)^{-1}X^{‘}Y\]


Hypothetical Question and Data

I would like to use some simple data to showcase what exactly means by doing simple linear regression. Suppose we would like to look at the impact of price on consumer purchase intention (1 = Not at all, 7 = Very likely). The basic idea is that we would expect a higher price would lead to lower purchase intention.

We can use the following hypothetical data and print it out to see what it looks like.

import pandas as pd
Prices=(5,6,7,8,9,10)
Purchase_Intention=(7,6,5,5,3,4)

df = pd.DataFrame(
    {'Prices': Prices,
     'Purchase_Intention': Purchase_Intention})
print(df)
   Prices  Purchase_Intention
0       5                   7
1       6                   6
2       7                   5
3       8                   5
4       9                   3
5      10                   4

Example 1: From Scratch (Use Matrix Method)

We can use matrix manipulation and calculation to do the calculation, without any specific, direct functions.

import numpy as np
import pandas as pd

Intercept=(1,1,1,1,1,1)
Prices=(5,6,7,8,9,10)
Purchase_Intention=(7,6,5,5,3,4)


x_df = pd.DataFrame(
    {'Intercept':Intercept,
        'Prices': Prices}
)
print(x_df)
y_df = pd.DataFrame(
    {'Purchase_Intention': Purchase_Intention}
)
print(y_df)

x_matrix= np.array(x_df)
print(x_matrix)
y_matrix= np.array(y_df)
print(y_matrix)

x_matrix_T=x_matrix.transpose()
print(x_matrix_T)
x_T_x=np.matmul(x_matrix_T,x_matrix)
print(x_T_x)
b=np.matmul(np.matmul(np.linalg.inv(x_T_x),x_matrix_T),y_matrix)
print(b)

The following is the output. The slope is 𝑏₁ (-0.69), and the intercept is 𝑏₀ (10.14).

   Intercept  Prices
0          1       5
1          1       6
2          1       7
3          1       8
4          1       9
5          1      10
   Purchase_Intention
0                   7
1                   6
2                   5
3                   5
4                   3
5                   4
[[ 1  5]
 [ 1  6]
 [ 1  7]
 [ 1  8]
 [ 1  9]
 [ 1 10]]
[[7]
 [6]
 [5]
 [5]
 [3]
 [4]]
[[ 1  1  1  1  1  1]
 [ 5  6  7  8  9 10]]
[[  6  45]
 [ 45 355]]
[[10.14285714]
 [-0.68571429]]

Example 2: Use scipy.stats.linregress() for Simple Linear Regression

We can use the scipy.stats.linregress to conduct simple linear regression.

import scipy.stats
Prices=(5,6,7,8,9,10)
Purchase_Intention=(7,6,5,5,3,4)
res = scipy.stats.linregress(Prices, Purchase_Intention)
print(res)
LinregressResult(slope=-0.6857142857142857, intercept=10.142857142857142, rvalue=-0.9071147352221454, pvalue=0.012540816801036057, stderr=0.1590789817951435)

The output above shows that the slope 𝑏₁ is -0.69, and the intercept 𝑏₀ is 10.14. The p-value is 0.01, which suggests that the regression model is statistically significant. Thus, we can write the estimated regression function as follows.

\[f(x)=b_0 +b_1x = 10.14 -0.69x\]

We can also plot it to show the estimated regression line and the original data points. The following is the complete Python code to produce the chart. The slope of the red line is 𝑏₁ (-0.69), and the intercept is 𝑏₀ (10.14).

import scipy.stats
import matplotlib.pyplot as plt
import numpy as np

Prices=(5,6,7,8,9,10)
Purchase_Intention=(7,6,5,5,3,4)

res = scipy.stats.linregress(Prices, Purchase_Intention)

plt.plot(Prices, Purchase_Intention, 'o', label='original data')
plt.plot(Prices, res.intercept + res.slope*np.array(Prices), 'r', label='fitted line')
plt.xlabel('Price', fontsize=25)
plt.ylabel('Purchase Intention', fontsize=25)
plt.legend(fontsize=25)
plt.show()
Simple linear regression in Python
Simple linear regression in Python

Additional Remark

In the chart below, I added additional notations to illustrate where the residuals are measured visually. That is, the distance between the blue points (original data points) and the red points on the red line (fitted point) is the residual. Square each residual and sum them up (6 of them in this case), and you will get the Sum of Squared Residuals (SSR).

Thus, you can imagine that potentially you could draw multiple red lines trying to mimic the overall trend of the blue points, and the red line shown here has the lowest Sum of Squared Residuals (SSR). That is why this method is called Ordinary Least Squares (OLS).


Further Reading

Leave a Comment