This tutorial shows the steps of simulating a dataset for logistic regression R. Logistic regression is based on the following link function.

\[ Prob \{Y=1|X \} = \frac{1}{1+e^{- X \beta}} \]

In particular, the following are the steps for simulating a dataset for logistic regression in R.

## Step 1: Generate Xs

Suppose that we have 2 Xs in the logistic regression, and the following is the R code to generate them.

```
# set the size of the sample
n=200
# set seed
set.seed(123)
# simulate X_1 (normal distribution; mean = 1, SD=1)
X_1 <- rnorm(n,1,1)
# print out the first 6 rows of X_1
head(X_1)
# simulate X_2 (normal distribution; mean = 2, SD=1)
X_2 <- rnorm(n,2,1)
# print out the first 6 rows of X_2
head(X_2)
```

> head(X_1) [1] 0.4395244 0.7698225 2.5587083 [4] 1.0705084 1.1292877 2.7150650 > head(X_2) [1] 4.198810 3.312413 1.734855 2.543194 [5] 1.585660 1.523753

## Step 2: Generate *Xβ*

The following is to generate *Xβ*. The following are the true values of the statement.

Xβ = 0.5+0.3*X_1+0.8*X_2

```
xb<-0.5+0.3*X_1+0.8*X_2
head(xb)
```

> head(xb) [1] 3.990906 3.380877 2.655496 2.855708 [5] 2.107314 2.533522

## Step 3: Generate p

The following is to generate probability p.

```
p <- 1/(1 + exp(-xb))
head(p)
```

> head(p) [1] 0.9818525 0.9671015 0.9343490 [4] 0.9456130 0.8916121 0.9264587

## Step 4: Generate binary Y and combine a DataFrame

The following is to generate binary Y and then combine X1, X2, and Y into a DataFrame.

```
# generate Y
Y<-rbinom(n, size = 1, prob = p)
head(Y)
# combine X1, X2, and Y into a DataFrame
df <- data.frame(X_1=X_1, X_2=X_2, Y=Y)
head(df)
```

> head(Y) [1] 1 1 1 1 1 0 > head(df) X_1 X_2 Y 1 0.4395244 4.198810 1 2 0.7698225 3.312413 1 3 2.5587083 1.734855 1 4 1.0705084 2.543194 1 5 1.1292877 1.585660 1 6 2.7150650 1.523753 0

## Step 5 (optional): Run a logistic regression using simulated data

The following is to run a logistic regression using simulated data. This step is optional. Based on the output, we can see the following is the estimated *Xβ*, which is slightly different from the true values in Step 2.

Xb = 0.95+0.51*X_1+0.44*X_2

```
# run logistic regresion using simulated data
results<- glm(Y ~X_1+X_2, data=df,family = "binomial")
summary(results)
```

> summary(results) Call: glm(formula = Y ~ X_1 + X_2, family = "binomial", data = df) Deviance Residuals: Min 1Q Median 3Q Max -2.4665 0.3153 0.4078 0.4879 0.7932 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.9491 0.5517 1.720 0.0854 . X_1 0.5063 0.2816 1.798 0.0721 . X_2 0.4391 0.2513 1.747 0.0806 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 130.03 on 199 degrees of freedom Residual deviance: 123.80 on 197 degrees of freedom AIC: 129.8 Number of Fisher Scoring iterations: 5