How to Perform Two-Way ANOVA in Python

Introduction

A two-way ANOVA is used to test whether the means from the two or more categorical variables are significantly different from one another.

We can use statsmodel.stats.lm() to do two-way ANOVA. The following is the core syntax.

model = ols(‘DV ~ C(factor_1) + C(factor_2) + C(factor_1):C(factor_2)’, data=df_x).fit()

sm.stats.anova_lm(model, typ=1, 2, or 3)

Hypothetical Data

There are two categorical variables, namely city (city 1 and city 2) and store (store 1 and store 2). With these two variables, there are 4 cells.

Suppose that we are interested in comparing whether these 4 sales are significantly different from each other, we can do a two-way ANOVA.

Two-way ANOVA in Python
Two-way ANOVA in Python
CitiesStoresSales
City1store110
City1store220
City1store120
City1store250
City1store130
City2store210
City2store15
City2store24
City2store112
City2store24
Hypothetical Data for two-way ANOVA

Step 1: Prepare the data for two-way ANOVA

# generate two arrays of x_1 and x_2
x_1 = np.repeat(['City1','City2'],5)
x_2 = np.tile(['store1','store2'], 5)

# generate a list of sales
sales=[10,20,20,50,30,10,5,4,12,4]

# put the arrays and list into a dataframe
df_x=pd.DataFrame({'cities':x_1, 'stores':x_2,'sales':sales})
print(df_x)

The following is the data being used in this tutorial.

  cities  stores  sales
0  City1  store1     10
1  City1  store2     20
2  City1  store1     20
3  City1  store2     50
4  City1  store1     30
5  City2  store2     10
6  City2  store1      5
7  City2  store2      4
8  City2  store1     12
9  City2  store2      4

Step 2: conduct two-way ANOVA in Python

We are going to use statsmodels.api to do the two-way ANOVA analysis. The following is the Python code and the output of two-way ANOVA in Python.

import statsmodels.api as sm
from statsmodels.formula.api import ols

# model statement
model = ols('sales ~ C(cities) + C(stores) + C(cities):C(stores)', data=df_x).fit()
# ANOVA Table
aov_table = sm.stats.anova_lm(model, typ=2)
print(aov_table)
                     sum_sq   df         F    PR(>F)
C(cities)            984.15  1.0  8.453686  0.027068
C(stores)             93.75  1.0  0.805297  0.404083
C(cities):C(stores)  183.75  1.0  1.578382  0.255694
Residual             698.50  6.0       NaN       NaN

Step 3: Interpret the results

We need to focus on p-values for the 3 components in the output table. First, focus on the interaction item of C(cities):C(stores), whose p-value is 0.256. That means there is no significant interaction effect in the model.

Next, we look at the other two p-values. In particular, the p-value for cities is 0.027, which is smaller than 0.05. Thus, we conclude that city 1 and city 2 differ significantly on sales. The p-value for stores is 0.404, which is greater than 0.05, suggesting that store 1 and store 2 do not differ significantly on sales.


Further Reading