Introduction
Since we have covered the theoretical basics of t-test (see the tutorial here), it would be interesting to showcase how we can use t-test for a real-world application. In particular, we are going to use Financial well-being survey data to show how we can use independent t-test via Python. In particular, we could test how gender differs in subjective well-being. You can download the CSV file using this link.
Data Explanation
We will combine all these 3 items to form an index of subjective well-being (SWB).
- SWB_1: I am satisfied with my life.
- SWB_2: I am optimistic about my future.
- SWB_3: If I work hard today, I will be more successful in the future.
Before proceeding, we need to clean the data a bit because there are missing values in the responses. In particular, some responses are “Response not written to database” (-4) or “Refused” (-1).
"SWB_1":{ -4: "Response not written to database", -1: "Refused", 1: "1 Strongly disagree", 2: "2", 3: "3", 4: "4", 5: "5", 6: "6", 7: "7 Strongly agree"},
Data Cleaning
The following code is to check whether there are such missing values. If so, we need to remove them before conducting the t-test.
import pandas as pd df=pd.read_csv("./NFWBS_PUF_2016_data.csv") SWB_1_count=df["SWB_1"].value_counts() print(SWB_1_count) SWB_2_count=df["SWB_2"].value_counts() print(SWB_2_count) SWB_3_count=df["SWB_3"].value_counts() print(SWB_3_count)
Below is the output, we can see that indeed, there are some missing values.
6 1926 7 1535 5 1458 4 803 3 335 1 154 2 152 -1 30 -4 1 Name: SWB_1, dtype: int64 6 1846 7 1642 5 1399 4 839 3 335 2 144 1 132 -1 56 -4 1 Name: SWB_2, dtype: int64 7 1991 6 1653 5 1251 4 862 3 267 1 167 2 138 -1 64 -4 1 Name: SWB_3, dtype: int64
The following code is to remove them and then print out to check whether the removal is successful. As we can see the removal is a success.
print("after del") rslt_df = df.loc[df['SWB_1']>=1] rslt_df = rslt_df.loc[df['SWB_2'] >=1] rslt_df = rslt_df.loc[df['SWB_3'] >=1] SWB_1_count=rslt_df["SWB_1"].value_counts() print(SWB_1_count) SWB_2_count=rslt_df["SWB_2"].value_counts() print(SWB_2_count) SWB_3_count=rslt_df["SWB_3"].value_counts() print(SWB_3_count)
after del 6 1912 7 1515 5 1450 4 799 3 333 1 153 2 152 Name: SWB_1, dtype: int64 6 1841 7 1632 5 1394 4 838 3 333 2 144 1 132 Name: SWB_2, dtype: int64 7 1984 6 1649 5 1250 4 860 3 266 1 167 2 138 Name: SWB_3, dtype: int64
The following is the code to form a new column called “Combined_SWB” and the output of “print(rslt_df[“Combined_SWB”]).”
column_names = ['SWB_1', 'SWB_2', 'SWB_3'] df['Fruit Total']= df[column_names].sum(axis=1) rslt_df["Combined_SWB"]=rslt_df[column_names].sum(axis=1) print(rslt_df["Combined_SWB"])
0 16 1 18 2 11 3 18 4 12 .. 6389 20 6390 21 6391 17 6392 15 6393 14 Name: Combined_SWB, Length: 6314, dtype: int64
We also need to check whether there are missing values in X, namely the gender. The following is the coding of gender in the survey.
"PPGENDER":{ 1: "Male", 2: "Female"}
gender_count=rslt_df[“PPGENDER”].value_counts()print(gender_count)
The following is the output, which shows that there are no missing values.
1 3328 2 2986 Name: PPGENDER, dtype: int64
The following is the key code for t-test.
data_men = rslt_df[rslt_df['PPGENDER']==1] data_women = rslt_df[rslt_df['PPGENDER']==2] print("Men's SWB:") print(data_men["Combined_SWB"].mean()) print("\n") print("Women's SWB:") print(data_women["Combined_SWB"].mean()) print("\n") print("t-test results:") ttest_results=scipy.stats.ttest_ind(data_men["Combined_SWB"], data_women["Combined_SWB"], equal_var=False) print(ttest_results)
The following is the output.
Men's SWB: 16.341346153846153 Women's SWB: 16.248827863362358 t-test results: Ttest_indResult(statistic=1.0023437227400076, pvalue=0.3162168765314645)
Based on the p-value, we can see that the difference is not significant. The means are also really very close to each other, SWB men = 16.34 versus SWB women = 16.25, suggesting that men and women do not really differ in terms of subjective well-being. We can also plot the means using bar chart.
sns.barplot(x='PPGENDER', y="Combined_SWB", data=rslt_df) plt.xlabel('Gender', fontsize=18) plt.ylabel('SWB', fontsize=18) plt.show()