## 1. Introduction

One-Way ANOVA is to compare the means of different groups, to see whether the mean difference is statistically significant. This tutorial shows two methods of testing one-way ANOVA in Python.

**Method 1: use scipy.stats** **for one-way ANOVA**

f_oneway(level_1,level_2,level_3,…)

**Method 2: use statsmodel** **for one-way ANOVA**

sm.stats.anova_lm(model, typ=1, 2, or 3)

## 2. Sample Data for one-way ANOVA

Suppose we would like to see whether 3 cities differ in terms of household size. We sample 5 households from each city. The null hypothesis and alternative hypothesis for one-way ANOVA are as follows.

H_{0}: μ_{City1}=μ_{City2}=μ_{City3}

H_{1}: At least one pair of μ_{City1}, μ_{City2}, and μ_{City3} is not equal.

Group 1 | Group 2 | Group 3 |
---|---|---|

6 | 2 | 4 |

2 | 1 | 1 |

3 | 3 | 2 |

4 | 4 | 4 |

5 | 5 | 5 |

## 3. Method 1: Use scipy.stats for One-way ANOVA

We can use Python `scipy.stats`

to conduct the one-way ANOVA. The following is the Python code example and output. The key line of code is `f_oneway(City1,City2,City3)`

.

```
# importing pandas and scipy
import pandas as pd
from scipy.stats import f_oneway
# enter data from scratch
City1= [6,2,3,4,5]
City2= [2,1,3,4,5]
City3= [4,1,2,4,5]
# conduct one-way ANOVA analysis
f_oneway(City1,City2,City3)
```

The following is the output of one-way ANOVA.

F_onewayResult(statistic=0.5454545454545454, pvalue=0.5932921944658782)

We can see the F statistic is 0.55 and the p-value is 0.59. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. Thus, we conclude that these 3 cities have similar household size.

## 4. Method 2: Use statsmodels for One-way ANOVA

We can also use statsmodels for one-way ANOVA. Note that, the data structure used in statsmodels is slightly different from scipy.stats.

That is, instead of separate columns for each group, statsmodels expects the independent varaible (X) to be one column and dependent variable (Y) to be another column. Thus, we need to use `melt()`

function to transform the data before we use statsmodels.

### Step 1 – Change the data format from wide to long

```
import pandas as pd
# enter data from scratch
City1= [6,2,3,4,5]
City2= [2,1,3,4,5]
City3= [4,1,2,4,5]
# combine lists into a dataframe
city_data = pd.DataFrame(
{'City1': City1,
'City2': City2,
'City3': City3
})
# print out the wide format the dataframe
print('wide format of dataframe: \n', city_data)
# reshape the dataframe into long format
city_data=city_data.melt( var_name="Cities",
value_name="Household_size")
# print out the wide format of dataframe
print('long format of dataframe: \n', city_data)
```

wide format of dataframe: City1 City2 City3 0 6 2 4 1 2 1 1 2 3 3 2 3 4 4 4 4 5 5 5 long format of dataframe: Cities Household_size 0 City1 6 1 City1 2 2 City1 3 3 City1 4 4 City1 5 5 City2 2 6 City2 1 7 City2 3 8 City2 4 9 City2 5 10 City3 4 11 City3 1 12 City3 2 13 City3 4 14 City3 5

### Step 2 – use sm.stats.anova_lm() for one-way ANOVA

We need to set the type to be type 2, which specifies how the sum of squares is calculated (see another tutorial about Type 1, Type 2, and Type 3 ANOVA in Python).

In the code below, you notice that we specify that `Cities`

as a categorical variable by using the `C()`

operator.

```
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('Household_size ~ C(Cities)', data=city_data).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print(aov_table)
```

The following is the output. We can see that the F-value and p-value are the same as in Method 1.

sum_sq df F PR(>F) C(Cities) 2.8 2.0 0.545455 0.593292 Residual 30.8 12.0 NaN NaN