Linear Regression: Python Numpy Implementation from Scratch

This tutorial shows how you can conduct linear regression Python Numpy from scratch.

1. Math and Matrix of Linear Regression

We can use just use pure matrix calculation to estimate the regression coefficients in a linear regression model. Below is the process.

\[ Y= \left[ \begin{array} {} y_{11} \\ y_{12} \\ y_{13} \\ ..\\y_{1n} \end{array} \right] = \left[ \begin{array} {} b_0+b_1 x_{11} + b_2 x_{21} \\ b_0+b_1 x_{12}+b_2 x_{22} \\ b_0+b_1 x_{13}+ b_2 x_{23} \\..\\b_0+b_1 x_{1n} + b_2 x_{2n} \end{array} \right] = \left[ \begin{array} {} 1& x_{11} & x_{21} \\ 1 & x_{12} & x_{22} \\ 1 & x_{13} & x_{23} \\..\\1 & x_{1n} & x_{2n} \end{array} \right] \begin{bmatrix} b_0\\ b_1\\ b_2\end{bmatrix} = X B \]

Thus, we can simplify the function above to the function below.

\[ Y = XB \]

We can time X transpose on both sides and get the following.

\[ X^TY = X^TXB \]

Since XT X is a square matrix, we can calculate its inverse matrix and time both sides.

\[ (X^T X)^{-1} X^TY =(X^T X)^{-1} X^T X B\]

Since (XT X)-1XT X is an identity matrix, we can write it as follows.

\[ (X^T X)^{-1} X^TY = B\]

If we change the position of left and right, it will become below. By using the following function, we can calculate the regression coefficients of the linear model.

\[B =(X^TX)^{-1}X^TY\]

Where,

\[ B = \begin{bmatrix} b_0\\ b_1\\ b_2\end{bmatrix} \]

\[ X= \left[ \begin{array} {} 1& x_{11} & x_{21} \\ 1 & x_{12} & x_{22} \\ 1 & x_{13} & x_{23} \\..\\1 & x_{1n} & x_{2n} \end{array} \right] \]

\[ Y= \left[ \begin{array} {} y_{11} \\ y_{12} \\ y_{13} \\ ..\\y_{1n} \end{array} \right] \]

2. Sample Data for Linear regression

The following is a linear regression model, including household income as IVs and purchase intention as DV.

\[f(x)=b_0 +b_1 \times Price+b_2 \times Household \ Income \]

The following is the hypothetical data, including purchase intention as DV and prices and household income as IVs.

PricesHousehold IncomePurchase Intention
577
656
745
865
933
1034
Data for Linear Regression Model

3. Steps of Doing Linear Regression with Python Numpy

Below are 6 steps of using Numpy to estimate the regression coefficients in linear regression models.

Step 1: Prepare the X matrix and Y vector

# Generate the X matrix
import numpy as np
X_rawdata = np.array([np.ones(6),[5,6,7,8,9,10], [7,5,4,6,3,3]])
X_matrix=X_rawdata.T
print("X Matrix:\n", X_matrix)

Output:

X Matrix:
 [[ 1.  5.  7.]
 [ 1.  6.  5.]
 [ 1.  7.  4.]
 [ 1.  8.  6.]
 [ 1.  9.  3.]
 [ 1. 10.  3.]]
# Generate the Y vector
Y_rawdata = np.array([[7,6,5,5,3,4]])
Y_vector=Y_rawdata.T
print("Y Vector:\n",Y_vector)

Output:

Y Vector:
 [[7]
 [6]
 [5]
 [5]
 [3]
 [4]]

Step 2: Calculate XT and XTX

The following Python code calculates XT.

# calculates X^T 
X_matrix_T=X_matrix.transpose()
print("X Matrix Transpose:\n",X_matrix_T)

Output:

X Matrix Transpose:
 [[ 1.  1.  1.  1.  1.  1.]
 [ 5.  6.  7.  8.  9. 10.]
 [ 7.  5.  4.  6.  3.  3.]]

The following Python code calculates XT X.

# calculates X^T X 
X_T_X=np.matmul(X_matrix_T,X_matrix)
print(X_T_X)

Output:

[[  6.  45.  28.]
 [ 45. 355. 198.]
 [ 28. 198. 144.]]

Step 3: Calculate (XTX)-1

The following Python code calculates (XT X)-1.

# calculates (X^T X)^(-1) 
X_T_X_Inv=np.linalg.inv(X_T_X) 
print(X_T_X_Inv)

Output:

[[22.23134328 -1.74626866 -1.92164179]
 [-1.74626866  0.14925373  0.13432836]
 [-1.92164179  0.13432836  0.19589552]]

Step 4: Calculate (XTX)-1XTY

The following code calculates (XTX)-1XTY.

# calculates (X^T X)^(-1) X^T Y 
X_T_X_Inv@X_matrix_T@Y_vector

Output:

array([[ 6.73880597],
       [-0.44776119],
       [ 0.34701493]])

Step 5: Write out the linear regression model

We can see 𝑏₀ = 6.73, 𝑏₁ = -0.45, and b2 =0.35. We can write the estimated regression function below.

\[f(x)=b_0 +b_1x_1+b_2x_2=6.73-0.45Price+0.35Household Income\]

Step 6. Use numpy.linalg.lstsq to verify

We can use the Numpy function numpy.linalg.lstsq to verify our calculation above. Below is the Python code for linear regression regression model.

# Use numpy.linalg.lstsq to verify 
results=np.linalg.lstsq(X_matrix, Y_vector, rcond=None)[0]
print(results)

Output:

[[ 6.73880597]
 [-0.44776119]
 [ 0.34701493]]

As we can see, it is exactly the same as matrix calculation method shown above. Thus, we know that we did it correctly by using the matrix method.

4. Conclusion

This tutorial shows how you can conduct linear regression using Python Numpy from scratch. Thus, we do not need to use any built-in function to do linear regression. Further, we also used numpy.linalg.lstsq to verify our Numpy method, and the result confirmed that our Numpy code was correct.


Further Reading