This tutorial shows how to plot interactions of 2 categorical independent variables in Python. The following shows both the ANOVA and linear regression outputs.
You will see that ANOVA is also a linear regression model.
Thus, it does not matter you use ANOVA or linear regression, you can use the same method (i.e., the same python code) to plot the interaction of 2 categorial variables.
Steps of plotting figure for 2 Categorical Variables Interaction in Python
When two of independent variables are categorical (e.g., 2 cities and 2 store brands) and the DV is a continuous variable, the easiest way to do the analysis is 2-Way ANOVA.
In the following, step 2 uses both 2-Way ANOVA and linear regression to print out the results. You can see that the results are exactly the same.
Step 1: Preparing the data
# import the module of Numpy
import numpy as np
# import needed module of pandas
import pandas as pd
# generate two arrays of City and Brand
# generate IV of city
City = np.repeat(['City1','City2'],10)
# generate IV of store brands
Brand = np.tile(['brand1','brand2'], 10)
# generate DV of sales
Sales=[70,10,100,2,30,2,20,10,20,10,9,10,5,4,4,4,5,4,12,11]
# put IV and DV into a dataframe
df=pd.DataFrame({'City':City, 'Brand':Brand,'sales':Sales})
print(df)
Output:
City Brand sales 0 City1 brand1 70 1 City1 brand2 10 2 City1 brand1 100 3 City1 brand2 2 4 City1 brand1 30 5 City1 brand2 2 6 City1 brand1 20 7 City1 brand2 10 8 City1 brand1 20 9 City1 brand2 10 10 City2 brand1 9 11 City2 brand2 10 12 City2 brand1 5 13 City2 brand2 4 14 City2 brand1 4 15 City2 brand2 4 16 City2 brand1 5 17 City2 brand2 4 18 City2 brand1 12 19 City2 brand2 11
Step 2 (Version 1): Use ANOVA for interaction of 2 categorical variables
We use the anova_lm() in statsmodels to calculate the ANOVA table. Note that we use type 3 ANOVA in the analysis.
import statsmodels.api as sm
from statsmodels.formula.api import ols
# model statement
model = ols('sales ~ City + Brand + City:Brand', data=df).fit()
# ANOVA Table
aov_table = sm.stats.anova_lm(model, typ=3)
print(aov_table)
Output:
sum_sq df F PR(>F) Intercept 11520.0 1.0 35.081842 0.000021 City 4202.5 1.0 12.797868 0.002516 Brand 4243.6 1.0 12.923030 0.002425 City:Brand 2080.8 1.0 6.336658 0.022865 Residual 5254.0 16.0 NaN NaN
Step 2 (Version 2): Use linear regression for interaction of 2 categorical variables
We can also use linear regression for the interaction of 2 categorical variables. We are going to use statsmodels again.
You can see that the model statement is exactly the same as the one in ANOVA, since both are testing the same thing.
import statsmodels.api as sm
from statsmodels.formula.api import ols
# model statement is the same as the one in ANOVA
model = ols('sales ~ City + Brand + City:Brand', data=df).fit()
# print out the linear regression result
print(model.summary())
Output:
OLS Regression Results ============================================================================== Dep. Variable: sales R-squared: 0.548 Model: OLS Adj. R-squared: 0.463 Method: Least Squares F-statistic: 6.462 Date: Wed, 15 Jun 2022 Prob (F-statistic): 0.00451 Time: 15:53:16 Log-Likelihood: -84.089 No. Observations: 20 AIC: 176.2 Df Residuals: 16 BIC: 180.2 Df Model: 3 Covariance Type: nonrobust ================================================================================================= coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------------- Intercept 48.0000 8.104 5.923 0.000 30.820 65.180 City[T.City2] -41.0000 11.461 -3.577 0.003 -65.296 -16.704 Brand[T.brand2] -41.2000 11.461 -3.595 0.002 -65.496 -16.904 City[T.City2]:Brand[T.brand2] 40.8000 16.208 2.517 0.023 6.441 75.159 ============================================================================== Omnibus: 13.534 Durbin-Watson: 1.877 Prob(Omnibus): 0.001 Jarque-Bera (JB): 14.579 Skew: 1.193 Prob(JB): 0.000683 Kurtosis: 6.436 Cond. No. 6.85 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
From the output above, we can see that Linear regression model uses t-values, whereas ANOVA uses F-values. However, if you square t-values, you roughly get the same F-values.
Importantly, P-values are all the same between ANOVA and linear regression model. Thus, we can conclude that type 3 ANOVA and linear regression model are testing the same thing.
Step 3: Summarize the result of interaction of 2 categorical independent variables
Since p-value is 0.023, which is small than 0.05, we can conclude that the interaction is significant. Given that it is significant, we can plot the interaction in the next step.
Side Note: you can still plot an interaction plot even if the interaction is insignificant. However, in general, the major purpose of plot interaction plot is to show that the interaction effect is significant.
Step 4: Plot the interaction of 2-categorical independent variables
import matplotlib.pyplot as plt
# import interaction_plot from statsmodels
from statsmodels.graphics.factorplots import interaction_plot
# set the figure size, and you can change the numbers if you prefer
fig, ax = plt.subplots(figsize=(6, 6))
# the following is the key plot statement
# IV of city on the x-axis
# IV of store on the y-axis
# DV of sales as "response"
fig = interaction_plot(
x=City,
trace=Brand,
response=Sales,
colors=["red", "blue"],
markers=["D", "^"],
ms=10,
ax=ax,
)
Output:
