How to Conduct Multiple Linear Regression in Python

Multiple linear regression is a linear model accessing the relationship between a dependent variable (DV, or Y) and multiple intendent variables (IV, or X). This tutorial will show you how to use the Sklearn and Statsmodels in Python to do multiple linear regression.

Method 1: Numpy

np.linalg.lstsq(x_matrix, y_matrix, rcond=None)[0]

Method 2: Sklearn

lm.fit(IVs, DV)

Method 3: statsmodels

sm.OLS(y_matrix, x_matrix)


Sample Data

The following is the linear regression model, including household income as IVs and purchase intention as DV.

f(x)=b0 +b1 Price+b2 Household Income

The following is the hypothetical data, including purchase intention as DV and prices and household income as IVs.

Purchase
Intention
PricesHousehold
Income
757
665
574
586
393
4103
Data for linear regression in Python

The following Python code generates the hypothetical data and then changed it into appropriate format.

import numpy as np
X_rawdata = np.array([np.ones(6),[5,6,7,8,9,10], [7,5,4,6,3,3]])
X_matrix=X_rawdata.T
print("X Matrix:\n", X_matrix)

X_rawdata_no_1 = np.array([[5,6,7,8,9,10], [7,5,4,6,3,3]])
X_matrix_no_1=X_rawdata_no_1.T
print("X Matrix without 1s:\n", X_matrix_no_1)

Y_rawdata = np.array([[7,6,5,5,3,4]])
Y_vector=Y_rawdata.T
print("Y Vector:\n",Y_vector)

Output:

X Matrix:
 [[ 1.  5.  7.]
 [ 1.  6.  5.]
 [ 1.  7.  4.]
 [ 1.  8.  6.]
 [ 1.  9.  3.]
 [ 1. 10.  3.]]

X Matrix without 1s:
 [[ 5  7]
 [ 6  5]
 [ 7  4]
 [ 8  6]
 [ 9  3]
 [10  3]]

Y Vector:
 [[7]
 [6]
 [5]
 [5]
 [3]
 [4]]

Method 1 – Use numpy.linalg.lstsq

NumPy also has a function numpy.linalg.lstsq, which can return the least-squares solution to a linear matrix equation. Right behind is the output. As we can see, it is exactly the same as the first method.

results_1=np.linalg.lstsq(X_matrix, Y_vector, rcond=None)[0]
print(results_1)
[[ 6.73880597]
 [-0.44776119]
 [ 0.34701493]]

Method 2 – Use sklearn.linear_model

We can also use sklearn.linear_model to conduct multiple linear regression analysis. Note that, sklearn does not need to have 1s in the x matrix.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
results_2 = lm.fit(X_matrix_no_1, Y_vector)
print("results_2 is as follows:")
print(results_2.intercept_)
print(results_2.coef_)
results_2 is as follows:
[6.73880597]
[[-0.44776119  0.34701493]]

Method 3 – Use statsmodels.api

The problem with Numpy and Sklearn is that they do not have p-values (See the Stackoverflow discussion here). Thus, if we want to produce the p-value, we can just use statsmodels, which provides p-values in the output.

Note that, we need to include a column of intercept in the X matrix, as this package will not include it by default. Thus, statsmodels are the same as Numpy in requesting the intercept column (i.e., 1s).

import statsmodels.api as sm

results_3 = sm.OLS(Y_vector, X_matrix)
results_3_values = results_3.fit()
print(results_3_values.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.884
Model:                            OLS   Adj. R-squared:                  0.807
Method:                 Least Squares   F-statistic:                     11.47
Date:                Tue, 07 Jun 2022   Prob (F-statistic):             0.0393
Time:                        18:45:47   Log-Likelihood:                -3.5751
No. Observations:                   6   AIC:                             13.15
Df Residuals:                       3   BIC:                             12.53
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.7388      2.928      2.302      0.105      -2.579      16.056
x1            -0.4478      0.240     -1.867      0.159      -1.211       0.316
x2             0.3470      0.275      1.263      0.296      -0.528       1.222
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.144
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.090
Skew:                          -0.169   Prob(JB):                        0.956
Kurtosis:                       2.506   Cond. No.                         104.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

As we can see, the regression coefficients are the same as above. Note that, the p-values for 𝑏₁ = -0.45, and b2 =0.35 do not reach the significance, as both of them are greater than 0.05.


Further Reading