How to Do One-way ANOVA in Python (2 Examples)

1. Introduction

One-Way ANOVA is to compare the means of different groups, to see whether the mean difference is statistically significant. This tutorial shows two methods of testing one-way ANOVA in Python.

Method 1: use scipy.stats for one-way ANOVA

f_oneway(level_1,level_2,level_3,…)

Method 2: use statsmodel for one-way ANOVA

sm.stats.anova_lm(model, typ=1, 2, or 3)

2. Sample Data for one-way ANOVA

Suppose we would like to see whether 3 cities differ in terms of household size. We sample 5 households from each city. The null hypothesis and alternative hypothesis for one-way ANOVA are as follows.

H0: μCity1City2City3

H1: At least one pair of μCity1, μCity2, and μCity3 is not equal.

Group 1Group 2Group 3
624
211
332
444
555
Data for One-Way ANOVA in Python

3. Method 1: Use scipy.stats for One-way ANOVA

We can use Python scipy.stats to conduct the one-way ANOVA. The following is the Python code example and output. The key line of code is f_oneway(City1,City2,City3).

# importing pandas and scipy
import pandas as pd
from scipy.stats import f_oneway

# enter data from scratch
City1= [6,2,3,4,5]
City2= [2,1,3,4,5]
City3= [4,1,2,4,5]

# conduct one-way ANOVA analysis
f_oneway(City1,City2,City3)

The following is the output of one-way ANOVA.

F_onewayResult(statistic=0.5454545454545454, pvalue=0.5932921944658782)

We can see the F statistic is 0.55 and the p-value is 0.59. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. Thus, we conclude that these 3 cities have similar household size.

4. Method 2: Use statsmodels for One-way ANOVA

We can also use statsmodels for one-way ANOVA. Note that, the data structure used in statsmodels is slightly different from scipy.stats.

That is, instead of separate columns for each group, statsmodels expects the independent varaible (X) to be one column and dependent variable (Y) to be another column. Thus, we need to use melt() function to transform the data before we use statsmodels.

Step 1 – Change the data format from wide to long

import pandas as pd

# enter data from scratch
City1= [6,2,3,4,5]
City2= [2,1,3,4,5]
City3= [4,1,2,4,5]

# combine lists into a dataframe
city_data = pd.DataFrame(
    {'City1': City1,
     'City2': City2,
     'City3': City3
    })

# print out the wide format the dataframe
print('wide format of dataframe: \n', city_data)

# reshape the dataframe into long format
city_data=city_data.melt( var_name="Cities", 
        value_name="Household_size")

# print out the wide format of dataframe
print('long format of dataframe: \n', city_data)
wide format of dataframe: 
    City1  City2  City3
0      6      2      4
1      2      1      1
2      3      3      2
3      4      4      4
4      5      5      5

long format of dataframe: 
    Cities  Household_size
0   City1               6
1   City1               2
2   City1               3
3   City1               4
4   City1               5
5   City2               2
6   City2               1
7   City2               3
8   City2               4
9   City2               5
10  City3               4
11  City3               1
12  City3               2
13  City3               4
14  City3               5

Step 2 – use sm.stats.anova_lm() for one-way ANOVA

We need to set the type to be type 2, which specifies how the sum of squares is calculated (see another tutorial about Type 1, Type 2, and Type 3 ANOVA in Python).

In the code below, you notice that we specify that Cities as a categorical variable by using the C() operator.

import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('Household_size ~ C(Cities)', data=city_data).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
print(aov_table)

The following is the output. We can see that the F-value and p-value are the same as in Method 1.

           sum_sq    df         F    PR(>F)
C(Cities)     2.8   2.0  0.545455  0.593292
Residual     30.8  12.0       NaN       NaN

Further Reading