Descriptive Statistics

First Let’s take a peak of our dataset:

happiness_df %>% head() %>% knitr::kable()

country_name	regional_indicator	year	ladder_score	logged_gdp_per_capita	social_support	healthy_life_expectancy	freedom_to_make_life_choices	generosity	perceptions_of_corruption
Afghanistan	South Asia	2021	2.523	7.695	0.463	52.493	0.382	-0.102	0.924
Afghanistan	South Asia	2019	2.375	7.697	0.420	52.400	0.394	-0.108	0.924
Afghanistan	South Asia	2018	2.694	7.692	0.508	52.600	0.374	-0.094	0.928
Afghanistan	South Asia	2017	2.662	7.697	0.491	52.800	0.427	-0.121	0.954
Afghanistan	South Asia	2016	4.220	7.697	0.559	53.000	0.523	0.042	0.793
Afghanistan	South Asia	2015	3.983	7.702	0.529	53.200	0.389	0.080	0.881

The dataset has 1816 observations and 10 variables, Let’s take a deeper look into each variable:

EDA of each variable

Country name

happiness_df %>%
  group_by(country_name) %>%
  summarize(
    n_obs = n()
  ) %>%
  group_by(n_obs) %>%
  summarize(
    number_of_countries = n()
  ) %>%
  knitr::kable()

n_obs	number_of_countries
1	6
3	2
4	6
5	3
6	2
7	4
8	3
9	6
10	8
11	4
12	7
13	12
14	31
15	31
16	24

There are 149 unique countries in this dataset. We can see that 6 countries appeared only 1 time, while 24 countries appeared 16 times. The countries that appeared only 1 time are China, Hong Kong, Kosovo, Maldives, North Cyprus, and Turkmenistan.

Regional indicator

There are 10 unique regions in this dataset. The regional indicators are South Asia, Central and Eastern Europe, Middle East and North Africa, Latin America and Caribbean, Commonwealth of Independent States, North America and ANZ, Western Europe, Sub-Saharan Africa, Southeast Asia, East Asia.

happiness_df %>%
  group_by(regional_indicator) %>%
  summarize(
    n_obs = n()
  ) %>%
  mutate(regional_indicator = fct_reorder(regional_indicator, n_obs)) %>%
  ggplot(aes(x = regional_indicator, y = n_obs)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  labs(
    x = "regional indicator",
    y = "number of observations",
    title = "Number of observations for each region"
  )

East Asia has the lowest number of observations, while Sub-Saharan Africa has the highest.

Year

happiness_df %>%
  group_by(year) %>%
  summarize(
    n_obs = n()
  ) %>%
  ggplot(aes(x = factor(year), y = n_obs)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  labs(
    x = "year",
    y = "number of observations",
    title = "Number of observations for each year"
  )

Year ranges from 2005 to 2021 in this dataset. The year with the most observation is 2021, and the year with the least observation is 2005, which only has 1 observation.

Ladder score

happiness_df %>%
  ggplot(aes(x = ladder_score)) +
  geom_histogram(bins = 20) +
  labs(title = "Distribution of ladder score")

Ladder score ranges from 2.375 to 7.971. The distribution is approximately symmetric with mean at 5.470168.

Since ladder score is the main outcome of our project, we should dive a little deeper by looking at the mean ladder score for each region and year.

happiness_df %>%
  group_by(regional_indicator) %>%
  summarize(
    mean_ladder_score = mean(ladder_score)
  ) %>%
  arrange(mean_ladder_score) %>%
  knitr::kable()

regional_indicator	mean_ladder_score
Sub-Saharan Africa	4.317518
South Asia	4.512207
Commonwealth of Independent States	5.180830
Middle East and North Africa	5.271464
Southeast Asia	5.368957
Central and Eastern Europe	5.550598
East Asia	5.682980
Latin America and Caribbean	5.988366
Western Europe	6.871429
North America and ANZ	7.246328

Sub-Saharan Africa has the lowest mean ladder score across all years, while North America and ANZ has the highest.

happiness_df %>%
  group_by(year) %>%
  summarize(
    mean_ladder_score = mean(ladder_score)
  ) %>%
  knitr::kable()

year	mean_ladder_score
2005	7.418000
2006	5.183946
2007	5.367533
2008	5.437252
2009	5.480366
2010	5.546394
2011	5.411216
2012	5.446638
2013	5.379185
2014	5.396512
2015	5.404903
2016	5.418459
2017	5.505721
2018	5.542738
2019	5.559349
2020	5.873580
2021	5.532839

Year 2005 has significantly higher ladder score than the others. However, because year 2005 only has one observation, the difference should not be considered significant. Ladder scores for all other years are around 5.5.

Logged GDP per capita

happiness_df %>%
  ggplot(aes(x = logged_gdp_per_capita)) +
  geom_histogram(bins = 15) +
  labs(title = "Distribution of logged GDP per capita")

Logged GDP per capita ranges from 6.635 to 11.648. The distribution is approximately symmetric with mean at 9.3522203.

Social support

happiness_df %>%
  ggplot(aes(x = social_support)) +
  geom_histogram(bins = 15) +
  labs(title = "Distribution of social support")

Social support ranges from 0.291 to 0.987. Since the distribution is skewed right, we will look at the median instead of the mean. The median of social support is 0.836.

Healthy life expectancy

happiness_df %>%
  ggplot(aes(x = healthy_life_expectancy)) +
  geom_histogram(bins = 15) +
  labs(title = "Distribution of healthy life expectancy")

Healthy life expectancy ranges from 32.3 to 77.1. The distribution is skewed right. The median of healthy life expectancy is 65.3015.

Freedom to make life choices

happiness_df %>%
  ggplot(aes(x = freedom_to_make_life_choices)) +
  geom_histogram(bins = 15) +
  labs(title = "Distribution of freedom to make life choices")

Freedom to make life choices ranges from 0.258 to 0.985. The distribution is skewed right. The median of freedom to make life choices is 0.766.

Generosity

happiness_df %>%
  ggplot(aes(x = generosity)) +
  geom_histogram(bins = 15) +
  labs(title = "Distribution of generosity")

Generosity ranges from -0.335 to 0.689. The distribution is approximately symmetric with mean at -0.0031773.

Perceptions of corruption

happiness_df %>%
  ggplot(aes(x = perceptions_of_corruption)) +
  geom_histogram(bins = 15) +
  labs(title = "Distribution of perceptions of corruption")

Perceptions of corruption ranges from 0.035 to 0.983. The distribution is skewed right. The median of perceptions of corruption is 0.804.

Summary

Finally, let’s take a look at the summary tables of mean and standard deviation of each variable.

happiness_df %>%
  select(ladder_score, logged_gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption) %>%
  tbl_summary(missing_text = "(Missing)",
              statistic = list(all_continuous() ~ "{mean} ({sd})"),
              label = list(ladder_score ~ "Ladder score",
                           logged_gdp_per_capita ~ "Logged gdp per capita",
                           social_support ~ "Social support",
                           healthy_life_expectancy ~ "Healthy life exp",
                           freedom_to_make_life_choices ~ "Freedom to make life choices",
                           generosity ~ "Generosity",
                           perceptions_of_corruption ~ "Perceiptions of corruption")) %>% 
  bold_labels() %>%
  italicize_levels()

Characteristic	N = 1,816¹
Ladder score	5.47 (1.13)
Logged gdp per capita	9.35 (1.15)
Social support	0.81 (0.12)
Healthy life exp	64 (8)
Freedom to make life choices	0.75 (0.14)
Generosity	0.00 (0.16)
Perceiptions of corruption	0.75 (0.19)
¹ Mean (SD)

happiness_df %>%
  select(regional_indicator, ladder_score, logged_gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption) %>%
  tbl_summary(by = "regional_indicator",
              missing_text = "(Missing)",
              statistic = list(all_continuous() ~ "{mean} ({sd})"),
              label = list(ladder_score ~ "Ladder score",
                           logged_gdp_per_capita ~ "Logged gdp per capita",
                           social_support ~ "Social support",
                           healthy_life_expectancy ~ "Healthy life exp",
                           freedom_to_make_life_choices ~ "Freedom to make life choices",
                           generosity ~ "Generosity",
                           perceptions_of_corruption ~ "Perceiptions of corruption")) %>% 
  bold_labels() %>%
  italicize_levels()

Characteristic	Central and Eastern Europe, N = 224¹	Commonwealth of Independent States, N = 165¹	East Asia, N = 49¹	Latin America and Caribbean, N = 292¹	Middle East and North Africa, N = 138¹	North America and ANZ, N = 58¹	South Asia, N = 87¹	Southeast Asia, N = 116¹	Sub-Saharan Africa, N = 421¹	Western Europe, N = 266¹
Ladder score	5.55 (0.63)	5.18 (0.62)	5.68 (0.51)	5.99 (0.76)	5.27 (0.96)	7.25 (0.18)	4.51 (0.69)	5.37 (0.75)	4.32 (0.64)	6.87 (0.69)
Logged gdp per capita	10.03 (0.36)	9.28 (0.66)	10.17 (0.64)	9.33 (0.62)	9.61 (0.76)	10.76 (0.14)	8.33 (0.55)	9.26 (0.96)	7.99 (0.76)	10.79 (0.28)
Social support	0.86 (0.08)	0.82 (0.11)	0.87 (0.06)	0.85 (0.07)	0.77 (0.10)	0.94 (0.02)	0.65 (0.13)	0.81 (0.07)	0.71 (0.11)	0.92 (0.05)
Healthy life exp	67 (2)	63 (2)	70 (6)	66 (5)	65 (4)	72 (2)	60 (5)	64 (5)	53 (5)	72 (1)
Freedom to make life choices	0.69 (0.14)	0.70 (0.13)	0.71 (0.09)	0.78 (0.11)	0.64 (0.13)	0.90 (0.04)	0.69 (0.17)	0.87 (0.07)	0.70 (0.11)	0.84 (0.12)
Generosity	-0.10 (0.10)	-0.12 (0.14)	-0.04 (0.12)	-0.03 (0.12)	-0.06 (0.13)	0.23 (0.08)	0.08 (0.11)	0.20 (0.22)	-0.01 (0.11)	0.06 (0.17)
Perceiptions of corruption	0.87 (0.09)	0.78 (0.16)	0.79 (0.11)	0.79 (0.09)	0.77 (0.13)	0.45 (0.16)	0.81 (0.07)	0.73 (0.25)	0.79 (0.13)	0.56 (0.24)
¹ Mean (SD)

happiness_df %>%
  select(year, ladder_score, logged_gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption) %>%
  filter(year != 2005) %>%
  tbl_summary(by = "year",
              missing_text = "(Missing)",
              statistic = list(all_continuous() ~ "{mean} ({sd})"),
              label = list(ladder_score ~ "Ladder score",
                           logged_gdp_per_capita ~ "Logged gdp per capita",
                           social_support ~ "Social support",
                           healthy_life_expectancy ~ "Healthy life exp",
                           freedom_to_make_life_choices ~ "Freedom to make life choices",
                           generosity ~ "Generosity",
                           perceptions_of_corruption ~ "Perceiptions of corruption")) %>% 
  bold_labels() %>%
  italicize_levels()

Characteristic	2006, N = 74¹	2007, N = 92¹	2008, N = 99¹	2009, N = 101¹	2010, N = 109¹	2011, N = 125¹	2012, N = 116¹	2013, N = 119¹	2014, N = 123¹	2015, N = 124¹	2016, N = 122¹	2017, N = 129¹	2018, N = 126¹	2019, N = 126¹	2020, N = 81¹	2021, N = 149¹
Ladder score	5.18 (1.11)	5.37 (1.02)	5.44 (1.20)	5.48 (1.06)	5.55 (1.15)	5.41 (1.10)	5.45 (1.16)	5.38 (1.20)	5.40 (1.21)	5.40 (1.15)	5.42 (1.15)	5.51 (1.13)	5.54 (1.09)	5.56 (1.13)	5.87 (1.00)	5.53 (1.07)
Logged gdp per capita	9.05 (1.22)	9.17 (1.11)	9.16 (1.25)	9.29 (1.14)	9.42 (1.11)	9.25 (1.16)	9.34 (1.10)	9.34 (1.11)	9.36 (1.15)	9.36 (1.17)	9.40 (1.16)	9.39 (1.16)	9.40 (1.18)	9.42 (1.15)	9.75 (0.95)	9.43 (1.16)
Social support	0.83 (0.12)	0.81 (0.11)	0.78 (0.15)	0.81 (0.11)	0.83 (0.10)	0.81 (0.13)	0.81 (0.12)	0.81 (0.11)	0.81 (0.12)	0.80 (0.13)	0.82 (0.12)	0.81 (0.12)	0.81 (0.12)	0.81 (0.12)	0.84 (0.11)	0.81 (0.11)
Healthy life exp	60 (9)	61 (8)	61 (9)	63 (8)	63 (8)	62 (8)	63 (7)	63 (7)	64 (7)	64 (7)	64 (7)	64 (7)	65 (7)	65 (7)	67 (6)	65 (7)
Freedom to make life choices	0.73 (0.13)	0.69 (0.15)	0.68 (0.16)	0.68 (0.15)	0.71 (0.15)	0.73 (0.15)	0.71 (0.15)	0.73 (0.13)	0.74 (0.14)	0.75 (0.13)	0.76 (0.13)	0.78 (0.13)	0.78 (0.12)	0.79 (0.12)	0.82 (0.09)	0.79 (0.11)
Generosity	0.00 (0.17)	0.01 (0.17)	0.02 (0.17)	-0.02 (0.16)	0.00 (0.16)	-0.02 (0.17)	0.00 (0.17)	0.00 (0.17)	0.02 (0.16)	0.02 (0.17)	0.00 (0.16)	-0.01 (0.16)	-0.02 (0.16)	-0.02 (0.15)	0.00 (0.14)	-0.02 (0.15)
Perceiptions of corruption	0.76 (0.18)	0.80 (0.17)	0.77 (0.19)	0.78 (0.18)	0.77 (0.18)	0.76 (0.18)	0.77 (0.18)	0.76 (0.18)	0.74 (0.19)	0.74 (0.20)	0.75 (0.19)	0.72 (0.18)	0.73 (0.19)	0.72 (0.19)	0.71 (0.20)	0.73 (0.18)
¹ Mean (SD)

Note: Since year 2005 only has one observation, it is removed from this summary table.