Poisson Regression in Python

The function of Poission() from statsmodels can be used to do Poisson regression in Python. The key Python code is as follows.

import statsmodels.api as sm
sm.GLM(Y, X, family=sm.families.Poisson()).fit()

This tutorial uses a dataset posted by Paul Roback and Julie Legler. Two variables are the focus, namely age and total. The following are detailed explanations for these two variables.

  • age = the age of the head of household
  • total = the number of people in the household other than the head

We are going to see if age can be used to predict the number of people in a household (i.e., total).

import pandas as pd
# read the csv file from Github
df=pd.read_csv("https://raw.githubusercontent.com/proback/BeyondMLR/master/data/fHH1.csv",index_col=0)

# print out the first 6 row of the data frame
print(df.head(6))

The following shows sample rows of the data that we are going to use in this tutorial.

       location  age  total  numLT5                           roof
1  CentralLuzon   65      0       0  Predominantly Strong Material
2   MetroManila   75      3       0  Predominantly Strong Material
3   DavaoRegion   54      4       0  Predominantly Strong Material
4       Visayas   49      3       0  Predominantly Strong Material
5   MetroManila   74      3       0  Predominantly Strong Material
6       Visayas   59      6       0  Predominantly Strong Material

Step 1: Prepare data

The following Python code prepares X (age with constant) and Y (total, renamed with to household size).

# add constant to the column of age and save it as a new column
age = sm.add_constant(data_HH['age'])

# save column 'column' as a new column, and renamed as household_size
household_size=df['total']

# print out the first 6 rows for both age and household_size
print(age.head(6))
print(household_size.head(6))

The following is the output. The first part is age with constant, and the second part is household-size.

# age with constant 1s
  const  age
0    1.0   65
1    1.0   75
2    1.0   54
3    1.0   49
4    1.0   74
5    1.0   59

# household size
1    0
2    3
3    4
4    3
5    3
6    6
Name: total, dtype: int64

Step 2: Use statsmodels

The following uses statsmodels to do the Poisson regression and then print out the result.

import statsmodels.api as sm

# use GLM() function to do the Poisson regression
result_Poisson=sm.GLM(household_size, age, family=sm.families.Poisson()).fit()

# print out the result
print(result_Poisson.summary())

The following is the output of Poisson regression in Python.

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  total   No. Observations:                 1500
Model:                            GLM   Df Residuals:                     1498
Model Family:                 Poisson   Df Model:                            1
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -3355.0
Date:                Tue, 24 Jan 2023   Deviance:                       2337.1
Time:                        23:57:22   Pearson chi2:                 2.27e+03
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.5499      0.050     30.829      0.000       1.451       1.648
age           -0.0047      0.001     -5.026      0.000      -0.007      -0.003
==============================================================================

Step 3: Interpret the output

Based on the output above, we can write out the Poisson regression statement.

\[ log (\hat{\lambda}) =b_0+b_1 Age =1.55 -0.0047 Age \]

We can get the following.

\[ \frac{\lambda_{Age+1}}{\lambda_{Age}} =e^{-0.0047}=0.995 \]

We can further get the following.

\[ \lambda_{Age+1} – \lambda_{Age}=0.995 \lambda_{Age}- \lambda_{Age}=-0.005 \lambda_{Age}\]

If age increases by 1 unit, for instance, from 80 to 81, the change in household size is -0.005*80 =-0.4.