Multiple linear regression is a linear model accessing the relationship between a dependent variable (DV, or Y) and multiple intendent variables (IV, or X). This tutorial will show you how to use the Sklearn and Statsmodels in Python to do multiple linear regression.
Method 1: Numpy
np.linalg.lstsq(x_matrix, y_matrix, rcond=None)[0]
Method 2: Sklearn
lm.fit(IVs, DV)
Method 3: statsmodels
sm.OLS(y_matrix, x_matrix)
Sample Data
The following is the linear regression model, including household income as IV
s and purchase intention as DV
.
f(x)=b0 +b1 Price+b2 Household Income
The following is the hypothetical data, including purchase intention as DV
and prices and household income as IV
s.
Purchase Intention | Prices | Household Income |
---|---|---|
7 | 5 | 7 |
6 | 6 | 5 |
5 | 7 | 4 |
5 | 8 | 6 |
3 | 9 | 3 |
4 | 10 | 3 |
The following Python code generates the hypothetical data and then changed it into appropriate format.
import numpy as np
X_rawdata = np.array([np.ones(6),[5,6,7,8,9,10], [7,5,4,6,3,3]])
X_matrix=X_rawdata.T
print("X Matrix:\n", X_matrix)
X_rawdata_no_1 = np.array([[5,6,7,8,9,10], [7,5,4,6,3,3]])
X_matrix_no_1=X_rawdata_no_1.T
print("X Matrix without 1s:\n", X_matrix_no_1)
Y_rawdata = np.array([[7,6,5,5,3,4]])
Y_vector=Y_rawdata.T
print("Y Vector:\n",Y_vector)
Output:
X Matrix: [[ 1. 5. 7.] [ 1. 6. 5.] [ 1. 7. 4.] [ 1. 8. 6.] [ 1. 9. 3.] [ 1. 10. 3.]] X Matrix without 1s: [[ 5 7] [ 6 5] [ 7 4] [ 8 6] [ 9 3] [10 3]] Y Vector: [[7] [6] [5] [5] [3] [4]]
Method 1 – Use numpy.linalg.lstsq
NumPy also has a function numpy.linalg.lstsq
, which can return the least-squares solution to a linear matrix equation. Right behind is the output. As we can see, it is exactly the same as the first method.
results_1=np.linalg.lstsq(X_matrix, Y_vector, rcond=None)[0]
print(results_1)
[[ 6.73880597] [-0.44776119] [ 0.34701493]]
Method 2 – Use sklearn.linear_model
We can also use sklearn.linear_model
to conduct multiple linear regression analysis. Note that, sklearn does not need to have 1s in the x matrix.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
results_2 = lm.fit(X_matrix_no_1, Y_vector)
print("results_2 is as follows:")
print(results_2.intercept_)
print(results_2.coef_)
results_2 is as follows: [6.73880597] [[-0.44776119 0.34701493]]
Method 3 – Use statsmodels.api
The problem with Numpy and Sklearn is that they do not have p-values (See the Stackoverflow discussion here). Thus, if we want to produce the p-value, we can just use statsmodels, which provides p-values in the output.
Note that, we need to include a column of intercept in the X matrix, as this package will not include it by default. Thus, statsmodels are the same as Numpy in requesting the intercept column (i.e., 1s).
import statsmodels.api as sm
results_3 = sm.OLS(Y_vector, X_matrix)
results_3_values = results_3.fit()
print(results_3_values.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.884 Model: OLS Adj. R-squared: 0.807 Method: Least Squares F-statistic: 11.47 Date: Tue, 07 Jun 2022 Prob (F-statistic): 0.0393 Time: 18:45:47 Log-Likelihood: -3.5751 No. Observations: 6 AIC: 13.15 Df Residuals: 3 BIC: 12.53 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 6.7388 2.928 2.302 0.105 -2.579 16.056 x1 -0.4478 0.240 -1.867 0.159 -1.211 0.316 x2 0.3470 0.275 1.263 0.296 -0.528 1.222 ============================================================================== Omnibus: nan Durbin-Watson: 2.144 Prob(Omnibus): nan Jarque-Bera (JB): 0.090 Skew: -0.169 Prob(JB): 0.956 Kurtosis: 2.506 Cond. No. 104. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
As we can see, the regression coefficients are the same as above. Note that, the p-values for 𝑏₁ = -0.45, and b2 =0.35 do not reach the significance, as both of them are greater than 0.05.