A two-way ANOVA is used to test whether the means from the two or more categorical variables are significantly different from one another.
We can use statsmodel.stats.lm() to do two-way ANOVA. The following is the core syntax.
model = ols(‘DV ~ C(factor_1) + C(factor_2) + C(factor_1):C(factor_2)’, data=df_x).fit()
sm.stats.anova_lm(model, typ=1, 2, or 3)
Hypothetical Data
There are two categorical variables, namely city (city 1 and city 2) and store (store 1 and store 2). With these two variables, there are 4 cells.
Suppose that we are interested in comparing whether these 4 sales are significantly different from each other, we can do a two-way ANOVA.
Cities | Stores | Sales |
City1 | store1 | 10 |
City1 | store2 | 20 |
City1 | store1 | 20 |
City1 | store2 | 50 |
City1 | store1 | 30 |
City2 | store2 | 10 |
City2 | store1 | 5 |
City2 | store2 | 4 |
City2 | store1 | 12 |
City2 | store2 | 4 |
Step 1: Prepare the data for two-way ANOVA
# generate two arrays of x_1 and x_2
x_1 = np.repeat(['City1','City2'],5)
x_2 = np.tile(['store1','store2'], 5)
# generate a list of sales
# put the arrays and list into a dataframe
df_x=pd.DataFrame({'cities':x_1, 'stores':x_2,'sales':sales})
The following is the data being used in this tutorial.
cities stores sales 0 City1 store1 10 1 City1 store2 20 2 City1 store1 20 3 City1 store2 50 4 City1 store1 30 5 City2 store2 10 6 City2 store1 5 7 City2 store2 4 8 City2 store1 12 9 City2 store2 4
Step 2: conduct two-way ANOVA in Python
We are going to use statsmodels.api to do the two-way ANOVA analysis. The following is the Python code and the output of two-way ANOVA in Python.
import statsmodels.api as sm
from statsmodels.formula.api import ols
# model statement
model = ols('sales ~ C(cities) + C(stores) + C(cities):C(stores)', data=df_x).fit()
# ANOVA Table
aov_table = sm.stats.anova_lm(model, typ=2)
sum_sq df F PR(>F) C(cities) 984.15 1.0 8.453686 0.027068 C(stores) 93.75 1.0 0.805297 0.404083 C(cities):C(stores) 183.75 1.0 1.578382 0.255694 Residual 698.50 6.0 NaN NaN
Step 3: Interpret the results
We need to focus on p-values for the 3 components in the output table. First, focus on the interaction item of C(cities):C(stores), whose p-value is 0.256. That means there is no significant interaction effect in the model.
Next, we look at the other two p-values. In particular, the p-value for cities is 0.027, which is smaller than 0.05. Thus, we conclude that city 1 and city 2 differ significantly on sales. The p-value for stores is 0.404, which is greater than 0.05, suggesting that store 1 and store 2 do not differ significantly on sales.