The function of Poission() from statsmodels can be used to do Poisson regression in Python. The key Python code is as follows.
import statsmodels.api as sm sm.GLM(Y, X, family=sm.families.Poisson()).fit()
This tutorial uses a dataset posted by Paul Roback and Julie Legler. Two variables are the focus, namely age and total. The following are detailed explanations for these two variables.
age
= the age of the head of householdtotal
= the number of people in the household other than the head
We are going to see if age can be used to predict the number of people in a household (i.e., total).
import pandas as pd
# read the csv file from Github
df=pd.read_csv("https://raw.githubusercontent.com/proback/BeyondMLR/master/data/fHH1.csv",index_col=0)
# print out the first 6 row of the data frame
print(df.head(6))
The following shows sample rows of the data that we are going to use in this tutorial.
location age total numLT5 roof 1 CentralLuzon 65 0 0 Predominantly Strong Material 2 MetroManila 75 3 0 Predominantly Strong Material 3 DavaoRegion 54 4 0 Predominantly Strong Material 4 Visayas 49 3 0 Predominantly Strong Material 5 MetroManila 74 3 0 Predominantly Strong Material 6 Visayas 59 6 0 Predominantly Strong Material
Step 1: Prepare data
The following Python code prepares X (age with constant) and Y (total, renamed with to household size).
# add constant to the column of age and save it as a new column
age = sm.add_constant(data_HH['age'])
# save column 'column' as a new column, and renamed as household_size
household_size=df['total']
# print out the first 6 rows for both age and household_size
print(age.head(6))
print(household_size.head(6))
The following is the output. The first part is age with constant, and the second part is household-size.
# age with constant 1s const age 0 1.0 65 1 1.0 75 2 1.0 54 3 1.0 49 4 1.0 74 5 1.0 59 # household size 1 0 2 3 3 4 4 3 5 3 6 6 Name: total, dtype: int64
Step 2: Use statsmodels
The following uses statsmodels to do the Poisson regression and then print out the result.
import statsmodels.api as sm
# use GLM() function to do the Poisson regression
result_Poisson=sm.GLM(household_size, age, family=sm.families.Poisson()).fit()
# print out the result
print(result_Poisson.summary())
The following is the output of Poisson regression in Python.
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: total No. Observations: 1500 Model: GLM Df Residuals: 1498 Model Family: Poisson Df Model: 1 Link Function: log Scale: 1.0000 Method: IRLS Log-Likelihood: -3355.0 Date: Tue, 24 Jan 2023 Deviance: 2337.1 Time: 23:57:22 Pearson chi2: 2.27e+03 No. Iterations: 4 Covariance Type: nonrobust ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const 1.5499 0.050 30.829 0.000 1.451 1.648 age -0.0047 0.001 -5.026 0.000 -0.007 -0.003 ==============================================================================
Step 3: Interpret the output
Based on the output above, we can write out the Poisson regression statement.
\[ log (\hat{\lambda}) =b_0+b_1 Age =1.55 -0.0047 Age \]
We can get the following.
\[ \frac{\lambda_{Age+1}}{\lambda_{Age}} =e^{-0.0047}=0.995 \]
We can further get the following.
\[ \lambda_{Age+1} – \lambda_{Age}=0.995 \lambda_{Age}- \lambda_{Age}=-0.005 \lambda_{Age}\]
If age increases by 1 unit, for instance, from 80 to 81, the change in household size is -0.005*80 =-0.4.