Plot for Interactions of 2 Categorical Variables in Python (with example)

This tutorial shows how to plot interactions of 2 categorical independent variables in Python. The following shows both the ANOVA and linear regression outputs.

You will see that ANOVA is also a linear regression model.

Thus, it does not matter you use ANOVA or linear regression, you can use the same method (i.e., the same python code) to plot the interaction of 2 categorial variables.

Steps of plotting figure for 2 Categorical Variables Interaction in Python

When two of independent variables are categorical (e.g., 2 cities and 2 store brands) and the DV is a continuous variable, the easiest way to do the analysis is 2-Way ANOVA.

In the following, step 2 uses both 2-Way ANOVA and linear regression to print out the results. You can see that the results are exactly the same.

Step 1: Preparing the data

# import the module of Numpy
import numpy as np

# import needed module of pandas
import pandas as pd

# generate two arrays of City and Brand
# generate IV of city
City = np.repeat(['City1','City2'],10)


# generate IV of store brands
Brand = np.tile(['brand1','brand2'], 10)

# generate DV of sales
Sales=[70,10,100,2,30,2,20,10,20,10,9,10,5,4,4,4,5,4,12,11]

# put IV and DV into a dataframe
df=pd.DataFrame({'City':City, 'Brand':Brand,'sales':Sales})
print(df)

Output:

     City   Brand  sales
0   City1  brand1     70
1   City1  brand2     10
2   City1  brand1    100
3   City1  brand2      2
4   City1  brand1     30
5   City1  brand2      2
6   City1  brand1     20
7   City1  brand2     10
8   City1  brand1     20
9   City1  brand2     10
10  City2  brand1      9
11  City2  brand2     10
12  City2  brand1      5
13  City2  brand2      4
14  City2  brand1      4
15  City2  brand2      4
16  City2  brand1      5
17  City2  brand2      4
18  City2  brand1     12
19  City2  brand2     11

Step 2 (Version 1): Use ANOVA for interaction of 2 categorical variables

We use the anova_lm() in statsmodels to calculate the ANOVA table. Note that we use type 3 ANOVA in the analysis.

import statsmodels.api as sm
from statsmodels.formula.api import ols

# model statement
model = ols('sales ~ City + Brand + City:Brand', data=df).fit()

# ANOVA Table
aov_table = sm.stats.anova_lm(model, typ=3)
print(aov_table)

Output:

             sum_sq    df          F    PR(>F)
Intercept   11520.0   1.0  35.081842  0.000021
City         4202.5   1.0  12.797868  0.002516
Brand        4243.6   1.0  12.923030  0.002425
City:Brand   2080.8   1.0   6.336658  0.022865
Residual     5254.0  16.0        NaN       NaN

Step 2 (Version 2): Use linear regression for interaction of 2 categorical variables

We can also use linear regression for the interaction of 2 categorical variables. We are going to use statsmodels again.

You can see that the model statement is exactly the same as the one in ANOVA, since both are testing the same thing.

import statsmodels.api as sm
from statsmodels.formula.api import ols

# model statement is the same as the one in ANOVA
model = ols('sales ~ City + Brand + City:Brand', data=df).fit()

# print out the linear regression result
print(model.summary())

Output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.548
Model:                            OLS   Adj. R-squared:                  0.463
Method:                 Least Squares   F-statistic:                     6.462
Date:                Wed, 15 Jun 2022   Prob (F-statistic):            0.00451
Time:                        15:53:16   Log-Likelihood:                -84.089
No. Observations:                  20   AIC:                             176.2
Df Residuals:                      16   BIC:                             180.2
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept                        48.0000      8.104      5.923      0.000      30.820      65.180
City[T.City2]                   -41.0000     11.461     -3.577      0.003     -65.296     -16.704
Brand[T.brand2]                 -41.2000     11.461     -3.595      0.002     -65.496     -16.904
City[T.City2]:Brand[T.brand2]    40.8000     16.208      2.517      0.023       6.441      75.159
==============================================================================
Omnibus:                       13.534   Durbin-Watson:                   1.877
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               14.579
Skew:                           1.193   Prob(JB):                     0.000683
Kurtosis:                       6.436   Cond. No.                         6.85
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the output above, we can see that Linear regression model uses t-values, whereas ANOVA uses F-values. However, if you square t-values, you roughly get the same F-values.

Importantly, P-values are all the same between ANOVA and linear regression model. Thus, we can conclude that type 3 ANOVA and linear regression model are testing the same thing.

Step 3: Summarize the result of interaction of 2 categorical independent variables

Since p-value is 0.023, which is small than 0.05, we can conclude that the interaction is significant. Given that it is significant, we can plot the interaction in the next step.

Side Note: you can still plot an interaction plot even if the interaction is insignificant. However, in general, the major purpose of plot interaction plot is to show that the interaction effect is significant.

Step 4: Plot the interaction of 2-categorical independent variables

import matplotlib.pyplot as plt

# import interaction_plot from statsmodels
from statsmodels.graphics.factorplots import interaction_plot

# set the figure size, and you can change the numbers if you prefer
fig, ax = plt.subplots(figsize=(6, 6))

# the following is the key plot statement
# IV of city on the x-axis
# IV of store on the y-axis
# DV of sales as "response"
fig = interaction_plot(
    x=City,                 
    trace=Brand,            
    response=Sales,         
    colors=["red", "blue"],
    markers=["D", "^"],
    ms=10,
    ax=ax,
)

Output:

Plot Figure of Interaction of 2 Categorical Independent Variables
Plot Figure of Interaction of 2 Categorical Independent Variables

Further Reading