# ANOVA Assumptions

There are 3 assumption for ANOVA:

1. Normality – The responses for each factor level have a normal population distribution.
2. Equal variances (Homogeneity of Variance) – These distributions have the same variance.
3. Independence – The data are independent.

You can use R to test the assumptions of normality and equality variances (The following are the two tutorials). In contrast, for independence, you would judge based on study design.

Below, I provide a theoretical discussion on ANOVA assumptions, especially why there is a need for normality assumption for ANOVA.

## Short Discussion: Why Normality Assumption for ANOVA

The following is the ANOVA formula that you would often see for one-way ANOVA.

$\frac{MSB}{MSE}=\frac{\frac{SSB}{k-1}}{\frac{SSE}{n-k}}=\frac{\frac{\sum_{i=1}^kn_i(\bar{x_i}-\bar{x})^2}{k-1}}{\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_i}(x_{ij}-\bar{x_i})^2}{n-k}} \sim F(k-1,n-k)$

It can be rewritten as follows.

$\frac{ \frac{Q_B}{k-1}}{ \frac{Q_E}{n-k}} \sim F(k-1,n-k)$

Where,

$$Q_B$$ and $$Q_E$$ are chi-square distributions with respective degrees of freedom $$k-1$$ and $$n-k$$.

Finally, Chi-square distribution is the square of standard normal distribution, and that is why there is a connection between normality assumption and ANOVA test.

## Theoretical Background: Cochran’s Theorem

However, to fully understand why normality assumption in ANOVA, we need to have a basic idea of Cochran’s Theorem.

Let $$X_1, X_2, … X_n$$ be independent $$N(0, \sigma^2)$$ distributed random variables, and suppose that:

$\sum_{i=1}^n X_i^2 = Q_1 +Q_2+…+Q_k$

Where, $$Q_1, Q_2, …, Q_k$$ are positive semi-definite quadratic forms in the random variables $$X_1, X_2, …, X_m$$, that is,

$Q_i = X^{‘} A_i X, i = 1, 2, …, k$

Set Rank $$A_i = r_i, i =1, 2, …, k$$.

If

$r_1 + r_2 + … + r_k = n$

then,

1. $$Q_1, Q_2, …, Q_k$$ are indepenent.
2. $$Q_i \sim \sigma^2 \chi ^2 (r_i)$$

## Theoretical Background: Assumptions for One-way ANOVA

After knowing Cochran’s Theorem, we can further discuss the theoretical background of assumptions for one-way ANVOA.

Let’s assume that we have a data sample $$x_{ij}$$, where

• $$i$$ is from 1 to $$k$$, namely we have $$k$$ groups.
• $$j$$ is from 1 to $$n_i$$, such that each $$i$$ group has $$n_i$$ observations.

Thus, we can get the total sample size as follows.

$n=\sum_{i=1}^k n_i$

Assume that $$i^{th}$$ group follows the normal distribution $$N(b_i, \sigma^2)$$. Here, we are assuming that all these $$k$$ groups are independent and have the same variance.

Side Note:
Here, you see why there are assumptions of Equal variances and Independence for ANOVA.

Thus, the null hypothesis for one-way ANOVA is:

$H_0 : b_1 = b_2 = …=b_k$

We can rewrite $$b_i$$ as follows.

$b_i = \mu +a_i$

where,

$\mu = \frac{1}{n} \sum_{i=1}^k n_i b_i$

$a_i = b_i – \mu$

$$a_i$$ is the difference between $$i^{th}$$ group mean and the overall mean $$\mu$$. We can also get the following.

$\sum_{i=1}^k n_i a_i =0$

With the notation above, we can write the following.

$x_{ij} = \mu +a_i + \epsilon_{ij}$

Where,

$\epsilon_{ij} \sim N(0, \sigma^2)$

Side Note:
Here, you see why there is normality assumption for ANOVA. Further, you should be also aware that the normality test is on the residuals $$\epsilon_{ij} =x_{ij}-\bar{x_i}$$ rather than the original data sample $$x_{ij}$$.

We can then use the least square to estimate parameters.

$\sum_{i=1}^k \sum_{j=1}^{n_i} \epsilon_{ij} ^2 = \sum_{i=1}^k \sum_{j=1}^{n_i} (x_{ij}-\mu-a_i)^2$

Thus, the results of least square estimates are

$\hat{\mu} =\bar{x}$

$\hat{a_i} =\bar{x_i}-\bar{x}$

where,

$\bar{x} =\sum_{i=1}^k \sum_{j=1}^{n_i} x_{ij}$

$\bar{x_i} =\frac{1}{n_i}\sum_{j=1}^{n_i} x_{ij}$

Thus, we can get the folllowing.

$\sum_{i=1}^k \sum_{j=1}^{n_i} \epsilon_{ij} ^2 = \sum_{i=1}^k \sum_{j=1}^{n_i} (x_{ij}-\mu-a_i)^2=\sum_{i=1}^k \sum_{j=1}^{n_i} (x_{ij}-\bar{x_i})^2$

Thus, we can get SSE as follows.

$SSE=\sum_{i=1}^k \sum_{j=1}^{n_i} (x_{ij}-\bar{x_i})^2$

We now need to go back to the original idea of decomposing variance, starting with total variance Q.

$Q=\sum_{i=1}^k \sum_{j=1}^{n_i} (x_{ij}-\bar{x})^2$

We can expand Q into two components, namely between-groups (SSB) and within-groups (SSE).

\begin{aligned} Q_T &=\sum_{i=1}^k \sum_{j=1}^{n_i} (x_{ij}-\bar{x})^2 \\ &= \sum_{i=1}^k \sum_{j=1}^{n_i} [(x_{ij}-\bar{x_i})+(\bar{x_i}-\bar{x})]^2 \\ &=\sum_{i=1}^k \sum_{j=1}^{n_i} (x_{ij}-\bar{x_i})^2 +\sum_{i=1}^k \sum_{j=1}^{n_i} (\bar{x_i}-\bar{x})^2 \\ &= \sum_{i=1}^k \sum_{j=1}^{n_i} (x_{ij}-\bar{x_i})^2 + \sum_{i=1}^k n_i (\bar{x_i}-\bar{x})^2 \\ &= SSE + SSB \\ &= Q_E +Q_B \end{aligned}

Based on Fisher-Cochran theorem, we can get the following.

$Q_E \sim \sigma^2 \chi ^2 (n-k)$

$Q_B \sim \sigma^2 \chi ^2 (k-1)$

Given that these two follow chi-square distribution, we can write the following.

$\frac{MSB}{MSE}=\frac{\frac{SSB}{k-1}}{\frac{SSE}{n-k}} = \frac{\frac{Q_B}{k-1}}{\frac{Q_E}{n-k}} \sim F(k-1, n-k)$

Side Note:
(1) Here, you can see the connection between one-way ANOVA and Cochran theorem. That is, under the null hypothesis (all group means are equal), $$x_{ij}-\bar{x} \sim N(0, \sigma^2)$$. (However, you can NOT test if $$x_{ij}-\bar{x} \sim N(0, \sigma^2)$$, as it is based on null hypothesis. If your groups means are actually not equal, $$x_{ij}-\bar{x}$$ does not follow normal distribution of $$N(0, \sigma^2)$$).

(2) Further, if $$x_{ij}-\bar{x_i} \sim N(0, \sigma^2)$$, then $$\bar{x_i}-\bar{x} \sim N(0, \sigma^2)$$. Thus, you only need to test if $$\epsilon_{ij} =x_{ij}-\bar{x_i} \sim N(0, \sigma^2)$$. There is a discussion on Stackexchange about this, and I will add the link down below.