Since our goal is to explore what kinds of features might influence and best predict our outcome - ladder score. I am considering a multiple linear regression to see how variables relate to the outcome. Once important question that I am interested in is how much does GDP impacts the Ladder Score, and does anything interact with GDP. Here we mainly focus on our latest data - happiness report from 2021.
regression_df$country_name <- as.factor(regression_df$country_name)
regression_df$regional_indicator<- as.factor(regression_df$regional_indicator)
regression_df <- regression_df %>%
rename(Country = country_name, Region = regional_indicator, Ladder = ladder_score,
SD.Ladder = standard_error_of_ladder_score, GDP = logged_gdp_per_capita, Social.support = social_support,
Life.exp = healthy_life_expectancy, Freedom = freedom_to_make_life_choices,
Corruption = perceptions_of_corruption, Ladder.Dystopia = ladder_score_in_dystopia,
EXP.LOG.GPD = explained_by_log_gdp_per_capita, EXP.SS = explained_by_social_support,
EXP.HLE = explained_by_healthy_life_expectancy, EXP.FREE = explained_by_freedom_to_make_life_choices,
EXP.GEN = explained_by_generosity, EXP.CUR = explained_by_perceptions_of_corruption,
DYS.RES = dystopia_residual)
hap <- regression_df[,-c(4,5,6,13:20)]
colnames(hap)
## [1] "Country" "Region" "Ladder" "GDP"
## [5] "Social.support" "Life.exp" "Freedom" "generosity"
## [9] "Corruption"
hap <- hap %>%
dplyr::mutate(Ladder = row_number())
head(hap)
## # A tibble: 6 × 9
## Country Region Ladder GDP Socia…¹ Life.…² Freedom gener…³ Corru…⁴
## <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Finland Western Euro… 1 10.8 0.954 72 0.949 -0.098 0.186
## 2 Denmark Western Euro… 2 10.9 0.954 72.7 0.946 0.03 0.179
## 3 Switzerland Western Euro… 3 11.1 0.942 74.4 0.919 0.025 0.292
## 4 Iceland Western Euro… 4 10.9 0.983 73 0.955 0.16 0.673
## 5 Netherlands Western Euro… 5 10.9 0.942 72.4 0.913 0.175 0.338
## 6 Norway Western Euro… 6 11.1 0.954 73.3 0.96 0.093 0.27
## # … with abbreviated variable names ¹Social.support, ²Life.exp, ³generosity,
## # ⁴Corruption
sum(is.na(regression_df)) # there is no na
## [1] 0
We first rename the column name for further use and check if there is NA. It turns out that no NA is in our dataset. What’s worth mentioning is that we transform the ladder score according to their rank(largest score as rank 1) to have a better understanding when comparing them.
Now, we check the correlations between each of the variables, and then we print the correlation matrix and plot the correlation graph.
num.var <- select_if(hap, is.numeric)
M<-cor(num.var)
M
## Ladder GDP Social.support Life.exp Freedom
## Ladder 1.00000000 -0.7969875 -0.7510240 -0.7759416 -0.6064113
## GDP -0.79698747 1.0000000 0.7852987 0.8594606 0.4323235
## Social.support -0.75102401 0.7852987 1.0000000 0.7232561 0.4829298
## Life.exp -0.77594156 0.8594606 0.7232561 1.0000000 0.4614939
## Freedom -0.60641135 0.4323235 0.4829298 0.4614939 1.0000000
## generosity 0.03966234 -0.1992864 -0.1149459 -0.1617503 0.1694374
## Corruption 0.40065425 -0.3423374 -0.2032070 -0.3643735 -0.4013630
## generosity Corruption
## Ladder 0.03966234 0.4006543
## GDP -0.19928640 -0.3423374
## Social.support -0.11494585 -0.2032070
## Life.exp -0.16175028 -0.3643735
## Freedom 0.16943737 -0.4013630
## generosity 1.00000000 -0.1639617
## Corruption -0.16396173 1.0000000
corrplot(M)
First, we split the data into a training and validation set and star our analysis
Then, We choose to use North America and ANZ as a reference group for comparision
str(hap$Region)
## Factor w/ 10 levels "Central and Eastern Europe",..: 10 10 10 10 10 10 10 10 6 10 ...
levels(hap$Region)
## [1] "Central and Eastern Europe" "Commonwealth of Independent States"
## [3] "East Asia" "Latin America and Caribbean"
## [5] "Middle East and North Africa" "North America and ANZ"
## [7] "South Asia" "Southeast Asia"
## [9] "Sub-Saharan Africa" "Western Europe"
hap$Region<-relevel(hap$Region, ref="North America and ANZ")
m1 <- lm(Ladder ~ ., data = train[,-1])
print(m1)
##
## Call:
## lm(formula = Ladder ~ ., data = train[, -1])
##
## Coefficients:
## (Intercept)
## 297.8030
## RegionCommonwealth of Independent States
## 25.7677
## RegionEast Asia
## 11.8555
## RegionLatin America and Caribbean
## -9.0051
## RegionMiddle East and North Africa
## 24.5459
## RegionNorth America and ANZ
## -6.8523
## RegionSouth Asia
## 40.0499
## RegionSoutheast Asia
## 28.3348
## RegionSub-Saharan Africa
## 22.9253
## RegionWestern Europe
## -5.7477
## GDP
## -12.7875
## Social.support
## -75.4024
## Life.exp
## 0.0648
## Freedom
## -90.8450
## generosity
## -8.6442
## Corruption
## 22.5704
Looking at the null model, which is just the intercept (mean) of the outcome variable.
null <- lm(Ladder ~ 1, data = train[,-1])
print(null)
##
## Call:
## lm(formula = Ladder ~ 1, data = train[, -1])
##
## Coefficients:
## (Intercept)
## 79.96
step(m1)
## Start: AIC=682.56
## Ladder ~ Region + GDP + Social.support + Life.exp + Freedom +
## generosity + Corruption
##
## Df Sum of Sq RSS AIC
## - Life.exp 1 3.0 38968 680.57
## - generosity 1 138.0 39103 680.95
## <none> 38965 682.56
## - Corruption 1 876.2 39841 683.03
## - Social.support 1 2347.2 41312 687.05
## - GDP 1 4223.1 43188 691.98
## - Freedom 1 5744.3 44709 695.82
## - Region 9 17659.0 56624 706.05
##
## Step: AIC=680.57
## Ladder ~ Region + GDP + Social.support + Freedom + generosity +
## Corruption
##
## Df Sum of Sq RSS AIC
## - generosity 1 145.1 39113 678.98
## <none> 38968 680.57
## - Corruption 1 897.6 39866 681.10
## - Social.support 1 2352.7 41321 685.07
## - GDP 1 5008.9 43977 691.99
## - Freedom 1 5748.4 44717 693.84
## - Region 9 19488.8 58457 707.58
##
## Step: AIC=678.98
## Ladder ~ Region + GDP + Social.support + Freedom + Corruption
##
## Df Sum of Sq RSS AIC
## <none> 39113 678.98
## - Corruption 1 970.9 40084 679.70
## - Social.support 1 2302.5 41416 683.33
## - GDP 1 4889.5 44003 690.06
## - Freedom 1 6111.8 45225 693.10
## - Region 9 19392.6 58506 705.68
##
## Call:
## lm(formula = Ladder ~ Region + GDP + Social.support + Freedom +
## Corruption, data = train[, -1])
##
## Coefficients:
## (Intercept)
## 296.039
## RegionCommonwealth of Independent States
## 26.090
## RegionEast Asia
## 11.988
## RegionLatin America and Caribbean
## -8.385
## RegionMiddle East and North Africa
## 24.992
## RegionNorth America and ANZ
## -8.064
## RegionSouth Asia
## 39.925
## RegionSoutheast Asia
## 27.068
## RegionSub-Saharan Africa
## 23.161
## RegionWestern Europe
## -5.987
## GDP
## -12.128
## Social.support
## -74.255
## Freedom
## -92.682
## Corruption
## 23.077
m1.step <- lm(formula = Ladder ~ Region + GDP + Social.support + Freedom + Corruption, data = train[, -1])
Here we get the best performing model measuring by AIC. The predictor includes Region, GDP, Social.support, Freedom and Corruption.
plot(m1.step)
Overall the residuals seem normally distributed and most assumptions seem to be maintained. However, the Scale-Location plot does show some heteroskedasticity, where the variance increases a bit around the center.
Here are some plots to initiate our hypothesis
ggplot(train[1:10,], aes(x = reorder(Country, Ladder), y = Ladder, fill = GDP)) +
geom_bar(stat = "identity")+
theme(axis.text.x=element_text(angle=45, hjust=1))+
labs(title = "Top Ten Happiest Countries") +
ylab("Ladder Score")+
xlab("Coutries")
ggplot(train[], aes(Ladder,GDP)) +
geom_point()
ggplot(train[], aes(Ladder,Freedom)) +
geom_point()
ggplot(train[], aes(GDP,Freedom)) +
geom_point()
ggplot(data=train[1:30,], aes(x=GDP, y=Ladder))+
geom_line(size=2, aes(color=Freedom))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
sd(hap$Freedom)+mean(hap$Freedom)
## [1] 0.9049291
sd(hap$Freedom)-mean(hap$Freedom)
## [1] -0.6782655
high.Free <- which(hap[,"Freedom"] >=0.9)
low.Free <- which(hap[,"Freedom"] <=.7)
sd(hap$GDP)+mean(hap$GDP)
## [1] 10.59081
sd(hap$GDP)-mean(hap$GDP)
## [1] -8.273607
high.GDP <- which(hap[,"GDP"] >= 10.6)
low.GDP <- which(hap[,"GDP"] <= 8.3)
ggplot(hap[high.GDP,], aes(x=Freedom, y=Ladder))+
geom_line(size=2, aes(color=GDP))
ggplot(hap[low.GDP,], aes(x=Freedom, y=Ladder))+
geom_line(size=2, aes(color=GDP))
It seems like GDP and Freedom may have some interaction.
#interactions
int.mod <- lm(formula = Ladder ~ Region + GDP + Social.support + Freedom + Corruption + GDP*Freedom, data = train[, -1])
summary(int.mod)
##
## Call:
## lm(formula = Ladder ~ Region + GDP + Social.support + Freedom +
## Corruption + GDP * Freedom, data = train[, -1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.698 -14.833 -1.247 12.952 60.233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.568 118.105 0.733 0.465360
## RegionCommonwealth of Independent States 24.255 9.820 2.470 0.015278
## RegionEast Asia 9.014 12.124 0.743 0.459011
## RegionLatin America and Caribbean -7.640 8.430 -0.906 0.367034
## RegionMiddle East and North Africa 25.497 8.540 2.986 0.003591
## RegionNorth America and ANZ -4.380 14.754 -0.297 0.767203
## RegionSouth Asia 39.686 11.333 3.502 0.000703
## RegionSoutheast Asia 27.556 9.766 2.822 0.005808
## RegionSub-Saharan Africa 24.989 9.000 2.776 0.006607
## RegionWestern Europe -3.752 8.908 -0.421 0.674584
## GDP 11.254 13.051 0.862 0.390680
## Social.support -71.552 30.724 -2.329 0.021962
## Freedom 178.889 148.100 1.208 0.230055
## Corruption 17.588 14.982 1.174 0.243319
## GDP:Freedom -29.933 16.117 -1.857 0.066341
##
## (Intercept)
## RegionCommonwealth of Independent States *
## RegionEast Asia
## RegionLatin America and Caribbean
## RegionMiddle East and North Africa **
## RegionNorth America and ANZ
## RegionSouth Asia ***
## RegionSoutheast Asia **
## RegionSub-Saharan Africa **
## RegionWestern Europe
## GDP
## Social.support *
## Freedom
## Corruption
## GDP:Freedom .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.83 on 96 degrees of freedom
## Multiple R-squared: 0.8178, Adjusted R-squared: 0.7912
## F-statistic: 30.77 on 14 and 96 DF, p-value: < 2.2e-16
We found that the slope of Freedom changes for every one increase in GDP, and as GDP increases, slope of freedom decreases. Looking at the p-value, the value of both GDP and Freedom indicate they’re insignificant after we adding the interaction term, but the interaction effect is significant now. At the same time, the p-value for Corruption indicates it’s insignificant.
final_model <- lm(formula = Ladder ~ Region + GDP + Social.support + Freedom + GDP*Freedom, data = train[, -1])
print(final_model)
##
## Call:
## lm(formula = Ladder ~ Region + GDP + Social.support + Freedom +
## GDP * Freedom, data = train[, -1])
##
## Coefficients:
## (Intercept)
## 81.853
## RegionCommonwealth of Independent States
## 20.237
## RegionEast Asia
## 5.442
## RegionLatin America and Caribbean
## -9.024
## RegionMiddle East and North Africa
## 23.026
## RegionNorth America and ANZ
## -11.429
## RegionSouth Asia
## 38.910
## RegionSoutheast Asia
## 25.842
## RegionSub-Saharan Africa
## 22.772
## RegionWestern Europe
## -7.585
## GDP
## 13.434
## Social.support
## -63.324
## Freedom
## 204.900
## GDP:Freedom
## -33.665
plot(final_model)
cor(hap$GDP, hap$Freedom)
## [1] 0.4323235
cor(hap$GDP, hap$Ladder)
## [1] -0.7969875
The correlations between GDP and Ladder Score seemed to be very strong, and the initial regression model would suggest that GDP is a powerful predictor of the Happiness Ladder Score, but after adapting interaction term and refiting model, we found that the relationship between Freedom and GDP couldn’t be ignored: these two seemed to influence each other and we couldn’t eliminate either of them.