PCA and PCR Analysis for World Happiness Report

Instead of using linear regression on the model, we want to see whether other models will can help us predict the happiness score. In this case, we adapt principle component analysis and principle component regression to predict the ladder score at 2021, and compare our pcr model with our regression model to see which one performs the best.

PCA analysis in ladder score

First we filter the data from 2021 and all the predictors we are going to use to compute the principle components.

happy_meta <- read_csv("./Data/happiness.csv")

## Rows: 1816 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country_name, regional_indicator
## dbl (8): year, ladder_score, logged_gdp_per_capita, social_support, healthy_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

happy21 <- happy_meta %>%
  filter(year == 2021) %>%
  select(-year)

head(happy21) %>% knitr::kable()

country_name	regional_indicator	ladder_score	logged_gdp_per_capita	social_support	healthy_life_expectancy	freedom_to_make_life_choices	generosity	perceptions_of_corruption
Afghanistan	South Asia	2.523	7.695	0.463	52.493	0.382	-0.102	0.924
Albania	Central and Eastern Europe	5.117	9.520	0.697	68.999	0.785	-0.030	0.901
Algeria	Middle East and North Africa	4.887	9.342	0.802	66.005	0.480	-0.067	0.752
Argentina	Latin America and Caribbean	5.929	9.962	0.898	69.000	0.828	-0.182	0.834
Armenia	Commonwealth of Independent States	5.283	9.487	0.799	67.055	0.825	-0.168	0.629
Australia	North America and ANZ	7.183	10.796	0.940	73.900	0.914	0.159	0.442

Secondly, we separate our data into a training set(75% of the observations), and a testing set(25% of the observations).

set.seed(1)
train_id <- sample(seq_len(nrow(happy21)), size = floor(0.75*nrow(happy21)))
train_set <- happy21[train_id,]
test_set <- happy21[-train_id,]

Then we apply the training set to do the principle data analysis.

X <- model.matrix(ladder_score ~logged_gdp_per_capita   + social_support + healthy_life_expectancy + freedom_to_make_life_choices + generosity +perceptions_of_corruption, data = train_set)[, -1]
happy_PCA <- prcomp(X, center = T, scale = T) 
summary(happy_PCA)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.8067 1.1347 0.7811 0.70345 0.48008 0.33580
## Proportion of Variance 0.5441 0.2146 0.1017 0.08247 0.03841 0.01879
## Cumulative Proportion  0.5441 0.7586 0.8603 0.94279 0.98121 1.00000

plot(happy_PCA, main = "Scree Plot")

From the scree plot and the summary of the PCA result, we see that three components will largely explain the variances in the data (86.03%). Let’s see the principal component scores for each principal.

happy_PCA$rotation

##                                      PC1        PC2         PC3         PC4
## logged_gdp_per_capita         0.50557942 -0.1963624  0.03593761 -0.29956544
## social_support                0.48591366 -0.1257005 -0.33392130 -0.09087997
## healthy_life_expectancy       0.50017876 -0.1503781  0.02985716 -0.29725350
## freedom_to_make_life_choices  0.40245817  0.2799679 -0.31503571  0.77439435
## generosity                   -0.03310161  0.7821098 -0.41124119 -0.45908163
## perceptions_of_corruption    -0.30826981 -0.4826488 -0.78609591 -0.05656135
##                                      PC5         PC6
## logged_gdp_per_capita         0.11097337 -0.77619921
## social_support               -0.74850206  0.26090102
## healthy_life_expectancy       0.56834485  0.56119575
## freedom_to_make_life_choices  0.22964237 -0.08930672
## generosity                    0.06841811 -0.05149951
## perceptions_of_corruption     0.21678801 -0.06226456

Let’s visualize the results on a biplot.

biplot <- ggbiplot(happy_PCA,
              obs.scale = 1,
              var.scale = 1,
              groups = train_set$regional_indicator,
              ellipse = TRUE,
              ellipse.prob = 0.60)
biplot <- biplot + scale_color_discrete(name = '')
biplot <- biplot + theme(legend.direction = 'horizontal',
               legend.position = 'top')

biplot

The data points closer to each other in the plot have a similar data pattern. As we see the countries from the same region are relatively closer to each other, which means that they are similar in the pattern of the predictors. Furthermore, we can see that the predictors social support, healthy life expectation, and logged gdp per capita is highly positively correlated with each other (shown on the plot their vectors form very small angles with each other). The predictor freedom to make life choices is also relatively correlated with social support, healthy life expectation and logged gdp per capita. This might be a concern when we build our regression model.

Principle Component Regression

Now we want to compute a principle component regression, and compare our pcr model with regression model to see which one is better at predicting the ladder score of 2021.

happy_pcr <- pcr(`ladder_score` ~ logged_gdp_per_capita + social_support + healthy_life_expectancy + freedom_to_make_life_choices + generosity +perceptions_of_corruption, data = train_set, scale = TRUE, validation = "CV")
summary(happy_pcr)

## Data:    X dimension: 111 6 
##  Y dimension: 111 1
## Fit method: svdpc
## Number of components considered: 6
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.095   0.4844   0.4880   0.4938   0.4972   0.5008   0.4971
## adjCV        1.095   0.4838   0.4872   0.4927   0.4959   0.4994   0.4953
## 
## TRAINING: % variance explained
##               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X               54.40    75.86    86.03    94.28    98.12   100.00
## ladder_score    80.68    80.85    80.97    81.03    81.06    81.68

validationplot(happy_pcr, main = "Test Error")

When we look at the plot, we see that when we add one principal component, the test error will be the lowest. So we use one principle component to make predictions on the out of sample observations. Now we build our pcr model based on the optimal value \(k = 1\), and compute the test mse and train mse.

happy21_test<- predict(happy_pcr, newdata = test_set, ncomp = 1) 
PCRTestMSE <- mean((happy21_test - test_set$`ladder_score`)^2) 
PCRTestMSE

## [1] 0.4701414

The test MSE is about:0.47

happy21_train <- predict(happy_pcr, newdata = train_set, ncomp = 1) 
PCRTrainMSE <- mean((happy21_train - train_set$`ladder_score`)^2) 
PCRTrainMSE

## [1] 0.2273465

The train error is about: 0.23

We compare the train MSE and the test MSE with the best linear regression model we found previously by compute test MSE and train MSE for the linear regression model as well.

best_reg <- lm(ladder_score ~ regional_indicator + logged_gdp_per_capita + social_support + freedom_to_make_life_choices +  logged_gdp_per_capita * freedom_to_make_life_choices, data = train_set)

happy21_regtest<- predict(best_reg, newdata = test_set) 
RegTestMSE <- mean((happy21_regtest - test_set$`ladder_score`)^2) 
RegTestMSE

## [1] 0.3817901

The test MSE is about:0.38

happy21_regtrain<- predict(best_reg, newdata = train_set) 
RegTrainMSE <- mean((happy21_regtrain - train_set$`ladder_score`)^2) 
RegTrainMSE

## [1] 0.1802655

The train MSE is about:0.18

As a result, we conclude that in this scenario, the linear regression model we found earlier does a better job predicting the ladder score than the pcr regression.