How to Simulate Data For Linear Regression in R

This tutorial shows how you can simulate data for linear regression models in R, including independenent variables (IVs) and dependent variables (DV).

Suppose that the following is the linear model we are going to simulate. Further, it includes the assumption of normal distribution for X and error terms.

\[ Y=2+3X+ \epsilon \]

\[ X \sim N(2,1^2) \]

\[ \epsilon \sim N(0,5^2) \]

Simulate Data For Linear Regression in R

Step 1: Simulate Independent Variables (X)

To simulate data for linear regresssion, we first simulate the independent variable (X) using rnorm() to generate a sample of 200 data points. The parameters are mean = 2 and sd = 1.

\[ X \sim N(2,1^2) \]

# Set seed 
set.seed(1)

# sample size = 200, mean = 2, sd = 1 
X <- rnorm(200, 2, 1)

# print out data
print(X)

# plot the histogram of X
hist(X)

Output:

  [1]  1.37354619  2.18364332  1.16437139
  [4]  3.59528080  2.32950777  1.17953162
  [7]  2.48742905  2.73832471  2.57578135
 [10]  1.69461161  3.51178117  2.38984324
 [13]  1.37875942 -0.21469989  3.12493092
...
[190]  1.07389050  1.82289604  2.40201178
[193]  1.26825183  2.83037317  0.79191721
[196]  0.95201559  3.44115771  0.98415253
[199]  2.41197471  1.61892395
Histogram of X (Simulated Independent Variable using R)
Simulate data for linear regression in R: Histogram of X

Step 2: Simulate error term in linear model

\[ \epsilon \sim N(0,5^2) \]

We use rnorm() to simulate a sample of 200 data points, with mean = 0 and sd = 5, for the error term.

# Set seed for error term
set.seed(2)

# simulate error term
error_term<-rnorm(200,0, 5)

# print it out
print(error_term)

# plot the histogram of error_term
hist(error_term)

Output:

> print(error_term)
  [1]  -4.48457273   0.92424592   7.93922666
  [4]  -5.65187837  -0.40125878   0.66210142
  [7]   3.53977365  -1.19849012   9.92236968
 [10]  -0.69393506   2.08825375   4.90876389
 [13]  -1.96347678  -5.19834488   8.91114480
...
[196]   3.50283898  -2.21379757  -3.94259983
[199]  -4.28387855  -3.73209507
Histogram of Error Term in the simulated data for linear regression model in R
Simulate data for linear regression in R: Histogram of Error Term

Step 3: Simulate dependent variable Y

We can then combine X and the error term to simulate dependent variable Y for linear models.

\[ Y=2+3X+ \epsilon \]

# simulate Y by combining X and error term
Y=2+3*X+error_term

# print out Y
print(Y)

Output:

  [1]  1.63606583  9.47517590 13.43234082
  [4]  7.13396404  8.58726453  6.20069627
  [7] 13.00206080  9.01648399 19.64971374
 [10]  6.38989978 14.62359726 14.07829360
 [13]  4.17280148 -3.84244455 20.28593756
...
[196]  8.35888574 10.10967555  1.00985777
[199]  4.95204559  3.12467677

Step 4: Verify the simulated data

We can test whether we simulate the data correctly by running a linear regression in R. We can see the estimated intercept is 1.94 and regression coefficient is 3.03.

Since both estimated values are close to the true values stated earlier, we can conclude that we simulate the data correctly.

# estimate 
estimated_coefficients <- lm(Y~X)

# summarize the output
summary(estimated_coefficients)

Output:

> summary(estimated_coefficients)

Call:
lm(formula = Y ~ X)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.2522  -4.1781  -0.1083   3.8856  10.5735 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.9388     0.9187   2.110   0.0361
X             3.0283     0.4108   7.372 4.47e-12
               
(Intercept) *  
X           ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.384 on 198 degrees of freedom
Multiple R-squared:  0.2154,	Adjusted R-squared:  0.2114 
F-statistic: 54.35 on 1 and 198 DF,  p-value: 4.469e-12

Further Reading