This tutorial shows how you can simulate data for linear regression models in R, including independenent variables (IVs) and dependent variables (DV).
Suppose that the following is the linear model we are going to simulate. Further, it includes the assumption of normal distribution for X and error terms.
\[ Y=2+3X+ \epsilon \]
\[ X \sim N(2,1^2) \]
\[ \epsilon \sim N(0,5^2) \]
Simulate Data For Linear Regression in R
Step 1: Simulate Independent Variables (X)
To simulate data for linear regresssion, we first simulate the independent variable (X) using rnorm() to generate a sample of 200 data points. The parameters are mean = 2 and sd = 1.
\[ X \sim N(2,1^2) \]
# Set seed
set.seed(1)
# sample size = 200, mean = 2, sd = 1
X <- rnorm(200, 2, 1)
# print out data
print(X)
# plot the histogram of X
hist(X)
Output:
[1] 1.37354619 2.18364332 1.16437139 [4] 3.59528080 2.32950777 1.17953162 [7] 2.48742905 2.73832471 2.57578135 [10] 1.69461161 3.51178117 2.38984324 [13] 1.37875942 -0.21469989 3.12493092 ... [190] 1.07389050 1.82289604 2.40201178 [193] 1.26825183 2.83037317 0.79191721 [196] 0.95201559 3.44115771 0.98415253 [199] 2.41197471 1.61892395

Step 2: Simulate error term in linear model
\[ \epsilon \sim N(0,5^2) \]
We use rnorm() to simulate a sample of 200 data points, with mean = 0 and sd = 5, for the error term.
# Set seed for error term
set.seed(2)
# simulate error term
error_term<-rnorm(200,0, 5)
# print it out
print(error_term)
# plot the histogram of error_term
hist(error_term)
Output:
> print(error_term) [1] -4.48457273 0.92424592 7.93922666 [4] -5.65187837 -0.40125878 0.66210142 [7] 3.53977365 -1.19849012 9.92236968 [10] -0.69393506 2.08825375 4.90876389 [13] -1.96347678 -5.19834488 8.91114480 ... [196] 3.50283898 -2.21379757 -3.94259983 [199] -4.28387855 -3.73209507

Step 3: Simulate dependent variable Y
We can then combine X and the error term to simulate dependent variable Y for linear models.
\[ Y=2+3X+ \epsilon \]
# simulate Y by combining X and error term
Y=2+3*X+error_term
# print out Y
print(Y)
Output:
[1] 1.63606583 9.47517590 13.43234082 [4] 7.13396404 8.58726453 6.20069627 [7] 13.00206080 9.01648399 19.64971374 [10] 6.38989978 14.62359726 14.07829360 [13] 4.17280148 -3.84244455 20.28593756 ... [196] 8.35888574 10.10967555 1.00985777 [199] 4.95204559 3.12467677
Step 4: Verify the simulated data
We can test whether we simulate the data correctly by running a linear regression in R. We can see the estimated intercept is 1.94 and regression coefficient is 3.03.
Since both estimated values are close to the true values stated earlier, we can conclude that we simulate the data correctly.
# estimate
estimated_coefficients <- lm(Y~X)
# summarize the output
summary(estimated_coefficients)
Output:
> summary(estimated_coefficients) Call: lm(formula = Y ~ X) Residuals: Min 1Q Median 3Q Max -12.2522 -4.1781 -0.1083 3.8856 10.5735 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.9388 0.9187 2.110 0.0361 X 3.0283 0.4108 7.372 4.47e-12 (Intercept) * X *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.384 on 198 degrees of freedom Multiple R-squared: 0.2154, Adjusted R-squared: 0.2114 F-statistic: 54.35 on 1 and 198 DF, p-value: 4.469e-12