View on GitHub

project-1-mr_palmer_s_penguins

STA/ISS 313 - Project 1

Beyond the Rings: Analyzing Olympic Athlete and Medal-Winning Country Characteristics ================ by Mr. Palmer’s Penguins

Introduction

The set, which includes 15 variables and 271116 observations, provides insight into the athletes that competed in and the results of the Olympic games from Athens 1896 to Rio 2016. Each observation includes information about an athlete on a per event basis. Therefore, if an athlete competes in multiple different events, there will be multiple different observations reflecting each event an individual participated in. Each observation details medal results (Gold, Silver, Bronze, or NA), the athlete’s background (age, sex, height, weight, team they’re competing for, etc.), and context about the event (where and when it was held, season, sport the athlete competes in, etc.). We chose this dataset because we think it can provide a lot of good insights into some of the trends at the Olympics over the past century.

This dataset was accessed on kaggle.com and includes data scraped in May 2018 from Sports Reference/OlympStats, sports statistics sites, by Randi Griffin, a Senior Data Scientist at Boston Consulting Group. Moreover, it was selected as part of the TidyTuesday challenge on 7/27/21.

Understanding Physical Characteristics of Olympians

Introduction

For this section, we hope to answer the question: do certain physical characteristics (such as body mass index with height and weight, age, and sex) of Olympic participants differ by sport and/or change with time (year)?

The Olympic Games are a unique opportunity for diverse and talented individuals to showcase their skills across a variety of disciplines. As a result, we are interested in better understanding how the diversity of physical characteristics in the set of competing athletes potentially differs by sport and/or has changed with time. Three physical characteristics of interest are body mass index (a factor that takes height and weight into account), sex, and age. It’s often the case that the media/press discuss these physical characteristics as key metrics when describing athletes. Therefore, we are curious about better understanding the extent to which these differ (if at all).

Firstly, we are hoping to understand if the optimal build of an Olympic athlete differs by sex and/or by sport. For example, we suspect that a high performing male Olympic wrestler is more likely to have a different build than a typical female gymnast. Therefore, we hypothesize that there may be different optimal heights and weights for successful athletes of different sexes and depending on the sport.

Secondly, we will assess if the mean age of Olympic participants has changed over time for each sex. In the 2020 Olympics, there was a lot of media attention on younger Olympians. Therefore, we are curious how the mean age of participants has changed over time. With age being a key physical characteristic, we think this will add another key dimension to gaining a better understanding of the athletes participating.

Approach

For our first plot, we plan to use boxplots to plot the distribution of a size characteristic by sex (since characteristics tend to differ among males and females) and faceting by sport. We are unsure which of the height, weight, height / weight, or body mass index ratio (a measure that integrates height and weight into a single metric) will be most telling when examined across sexes and sports. We will start by making 4 plots (each with one of these on the y-axis) and then select the plot where the trends are most insightful. Based on the results from these preliminary plots, in addition to narrowing in on a size characteristic, we might pick a few select sports to focus in on (i.e. most popular sports) or group the sports based on common characteristics (ex: contact vs. non-contact, etc.) since there are 66 sports in the dataset. We think a boxplot makes the most sense because they not only make it is easy to visualize distributions across a range of divisions but they also best control for outliers. Boxplots add outliers as points which is a better approach than ridge lines and violins which look relatively flat with a wide spread distribution.

For our second plot, we are going to make a scatter plot of average age vs. year color mapped by sex. To start, we will find the average age of the participants by sex for each year. This will allow us to make a scatter plot with only two observations (one for Male and one for Female) for each year. We choose a scatter plot because we believe it can best represent the relationship and highlight potential trends. We considered a line plot but ultimately decided that the discreteness of each of the points (rather than them being connected) would better allow us to compare observations year by year. Since there is unique context that surrounds each Olympics game, we feel as though a more year-specific (point based) approach could be more meaningful.

Analysis

In order to plot using BMI, we must first create the variable and select our font:

# Create new calculated variables to represent physical build
olympics <- olympics %>%
  mutate(
    height_weight_ratio = height / weight,
    BMI = 10000 * weight / (height * height)
  )

# load Olympic font
font_add_google(name = "Oswald")
showtext_auto()

After examining each of height, weight, height_weight_ratio, and BMI, we have chosen to proceed with BMI as our size characteristic for this plot, as the visualizations were the most accessible and intuitive to interpret. After examining the plot displaying the BMI for all sports, we are choosing a few select groups of sports to include based on common characteristics of each of them. The next step is to create separate dataframes for each of these categories. We also want to represent each athlete once on the plot, so we will have to use distinct().

# create variable to represent category of sport
# keep only one observation per athlete
olympics <- olympics %>%
  distinct(id, .keep_all = TRUE) %>%
  mutate(sex = factor(sex, labels = c("Female", "Male"))) %>%
  filter(sport %in% c("Boxing", "Judo", "Weightlifting", "Wrestling", "Gymnastics", "Trampolining", "Archery", "Shooting", "Athletics", "Rowing")) %>%
  mutate(
    category = case_when(
      sport %in% c("Boxing", "Judo", "Weightlifting", "Wrestling") ~ "weightclass",
      sport %in% c("Gymnastics", "Trampolining") ~ "acrobatic",
      sport %in% c("Archery", "Shooting") ~ "coordination",
      TRUE ~ "diverse"
    ),
    category = factor(category, levels = c("acrobatic", "diverse", "coordination", "weightclass"))
  )

# create separate dataframes for each of these categories
olympics_weightclass <- olympics %>%
  filter(category == "weightclass")

olympics_coordination <- olympics %>%
  filter(category == "coordination")

olympics_diverse <- olympics %>%
  filter(category == "diverse")

olympics_acrobatic <- olympics %>%
  filter(category == "acrobatic")

We are now ready to create a separate plot for each sport category. We will also silence messages which indicate that rows with NA have been dropped from the dataframe when creating the boxplots.

ggplot(olympics_weightclass, mapping = aes(y = fct_rev(sport), x = BMI, color = sex)) +
  geom_boxplot(position = position_dodge2(10)) +
  labs(
    x = "Body Mass Index (kg/m<sup>2</sup>)",
    y = NULL,
    color = "Sex",
    title = "Sex vs Body Mass Index Distribution",
    subtitle = "For athletes competing in weightclass-based Olympic sports", ,
    caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com"
  ) +
  scale_color_manual(values = c("#7B38EC", "#5CC0AB")) +
  facet_grid(category ~ ., scales = "free_y", space = "free") +
  theme_minimal() +
  theme(
    strip.background = element_blank(),
    strip.text.y = element_blank(),
    panel.spacing.y = unit(0.4, "cm"),
    plot.title = element_text(family = "Oswald", color = "#092260", size = 20, hjust = 0.5),
    plot.caption = element_text(family = "Oswald", color = "#092260", size = 11),
    plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    axis.text.x = element_text(family = "Oswald", color = "#092260"),
    axis.title.x = element_markdown(family = "Oswald", color = "#092260", size = 14),
    axis.text.y = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.text = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.title = element_blank(),
    legend.position = c(0.9, 0.97),
    legend.background = element_rect(size = 0.3),
    legend.margin = margin(1, 5, 5, 5),
    legend.key.size = unit(0.5, "cm")
  )

ggplot(olympics_coordination, mapping = aes(y = fct_rev(sport), x = BMI, color = sex)) +
  geom_boxplot(position = position_dodge2(10)) +
  labs(
    x = "Body Mass Index (kg/m<sup>2</sup>)",
    y = NULL,
    color = "Sex",
    title = "Sex vs Body Mass Index Distribution",
    subtitle = "For athletes competing in coordination-based Olympic sports", ,
    caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com"
  ) +
  scale_color_manual(values = c("#7B38EC", "#5CC0AB")) +
  facet_grid(category ~ ., scales = "free_y", space = "free") +
  theme_minimal() +
  theme(
    strip.background = element_blank(),
    strip.text.y = element_blank(),
    panel.spacing.y = unit(0.4, "cm"),
    plot.title = element_text(family = "Oswald", color = "#092260", size = 20, hjust = 0.5),
    plot.caption = element_text(family = "Oswald", color = "#092260", size = 11),
    plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    axis.text.x = element_text(family = "Oswald", color = "#092260"),
    axis.title.x = element_markdown(family = "Oswald", color = "#092260", size = 14),
    axis.text.y = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.text = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.title = element_blank(),
    legend.position = c(0.9, 0.97),
    legend.background = element_rect(size = 0.3),
    legend.margin = margin(1, 5, 5, 5),
    legend.key.size = unit(0.5, "cm")
  )

ggplot(olympics_diverse, mapping = aes(y = fct_rev(sport), x = BMI, color = sex)) +
  geom_boxplot(position = position_dodge2(10)) +
  labs(
    x = "Body Mass Index (kg/m<sup>2</sup>)",
    y = NULL,
    color = "Sex",
    title = "Sex vs Body Mass Index Distribution",
    subtitle = "For athletes competing in Olympic sports which encompass many skills", ,
    caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com"
  ) +
  scale_color_manual(values = c("#7B38EC", "#5CC0AB")) +
  facet_grid(category ~ ., scales = "free_y", space = "free") +
  theme_minimal() +
  theme(
    strip.background = element_blank(),
    strip.text.y = element_blank(),
    panel.spacing.y = unit(0.4, "cm"),
    plot.title = element_text(family = "Oswald", color = "#092260", size = 20, hjust = 0.5),
    plot.caption = element_text(family = "Oswald", color = "#092260", size = 11),
    plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    axis.text.x = element_text(family = "Oswald", color = "#092260"),
    axis.title.x = element_markdown(family = "Oswald", color = "#092260", size = 14),
    axis.text.y = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.text = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.title = element_blank(),
    legend.position = c(0.9, 0.94),
    legend.background = element_rect(size = 0.3),
    legend.margin = margin(1, 5, 5, 5),
    legend.key.size = unit(0.5, "cm")
  )

ggplot(olympics_acrobatic, mapping = aes(y = fct_rev(sport), x = BMI, color = sex)) +
  geom_boxplot(position = position_dodge2(10)) +
  labs(
    x = "Body Mass Index (kg/m<sup>2</sup>)",
    y = NULL,
    color = "Sex",
    title = "Sex vs Body Mass Index Distribution",
    subtitle = "For athletes competing in acrobatic Olympic sports", ,
    caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com"
  ) +
  scale_color_manual(values = c("#7B38EC", "#5CC0AB")) +
  facet_grid(category ~ ., scales = "free_y", space = "free") +
  theme_minimal() +
  theme(
    strip.background = element_blank(),
    strip.text.y = element_blank(),
    panel.spacing.y = unit(0.4, "cm"),
    plot.title = element_text(family = "Oswald", color = "#092260", size = 20, hjust = 0.5),
    plot.caption = element_text(family = "Oswald", color = "#092260", size = 11),
    plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    axis.text.x = element_text(family = "Oswald", color = "#092260"),
    axis.title.x = element_markdown(family = "Oswald", color = "#092260", size = 14),
    axis.text.y = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.text = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.title = element_blank(),
    legend.position = c(0.9, 0.97),
    legend.background = element_rect(size = 0.3),
    legend.margin = margin(1, 5, 5, 5),
    legend.key.size = unit(0.5, "cm")
  )

We found a citation for the Tokyo 2020 Olympics logo font here and were pointed to an open-source alternative here. We got the hex code used in plot text manually from that logo source. We chose the hex codes for our favorite gender color mapping from Telegraph 2018 here.

We can now visualize these groups on the same plot, to get a better sense of broader variability in size characteristics among an array of Olympic sports.

ggplot(olympics, mapping = aes(y = sport, x = BMI, color = sex)) +
  geom_boxplot(position = position_dodge2(10)) +
  labs(
    x = "Body Mass Index (kg/m<sup>2</sup>)",
    y = NULL,
    color = "Sex",
    title = "Sex vs Body Mass Index Distribution",
    subtitle = "For athletes competing in selected Olympic sports from 1912-2020", ,
    caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com"
  ) +
  scale_color_manual(values = c("#7B38EC", "#5CC0AB")) +
  facet_grid(category ~ ., scales = "free_y", space = "free") +
  theme_minimal() +
  theme(
    strip.background = element_blank(),
    strip.text.y = element_blank(),
    panel.spacing.y = unit(0.4, "cm"),
    plot.title = element_text(family = "Oswald", color = "#092260", size = 20, hjust = 0.5),
    plot.caption = element_text(family = "Oswald", color = "#092260", size = 11),
    plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    axis.text.x = element_text(family = "Oswald", color = "#092260"),
    axis.title.x = element_markdown(family = "Oswald", color = "#092260", size = 14),
    axis.text.y = element_text(family = "Oswald", color = "#092260"),
    legend.text = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.title = element_blank(),
    legend.position = c(0.9, 0.97),
    legend.background = element_rect(size = 0.3),
    legend.margin = margin(1, 5, 5, 5),
    legend.key.size = unit(0.5, "cm")
  )

For our second plot, we will look at how age of Olympic participants by sex has changed with time.

We will re-load the CSV file to clear any manipulations done on the Olympics data frame to make the first plot.

Since the observations in the data are done on a per event competed in basis (rather than a per athlete basis), we will use distinct() to filter the data so we only have unique athletes represented each year. In other words, each year will contain a set of observations representing one athlete each. This will remove cases where the age of an athlete who competed in multiple events for a given Olympics is repeated multiple times.

# keep only one observation per athlete
olympics <- olympics %>%
  distinct(id, year, .keep_all = TRUE) %>%
  mutate(sex = factor(sex, labels = c("Female", "Male")))

Finally, we will make the plot showing the average yearly age of athletes by sex. We will drop observations in which the age is unknown.

olympics %>%
  drop_na(age) %>%
  group_by(year, sex) %>%
  dplyr::summarise(avg_age = mean(age), .groups = "drop_last") %>%
  arrange(desc(sex)) %>%
  ggplot(mapping = aes(y = avg_age, x = year)) +
  geom_point(aes(color = sex)) +
  labs(
    x = "Year",
    y = "Age",
    color = "Sex",
    title = "Average Yearly Age of Athletes By Sex",
    subtitle = "For athletes competing in selected Olympic sports from 1912-2020",
    caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com"
  ) +
  scale_color_manual(values = c("#7B38EC", "#5CC0AB"), guide = guide_legend(reverse = TRUE)) +
  theme_minimal() +
  theme(
    strip.background = element_blank(),
    strip.text.y = element_blank(),
    panel.spacing.y = unit(0.4, "cm"),
    plot.title = element_text(family = "Oswald", color = "#092260", size = 20, hjust = 0.5),
    plot.caption = element_text(family = "Oswald", color = "#092260", size = 11),
    plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    axis.text.x = element_text(family = "Oswald", color = "#092260"),
    axis.title.x = element_text(family = "Oswald", color = "#092260", size = 14),
    axis.title.y = element_text(family = "Oswald", color = "#092260", size = 14),
    axis.text.y = element_text(family = "Oswald", color = "#092260"),
    legend.text = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.title = element_blank(),
    legend.position = c(0.8, 0.68),
    legend.background = element_rect(size = 0.3),
    legend.margin = margin(1, 5, 5, 5),
    legend.key.size = unit(0.5, "cm")
  )

The mean was used, rather than the median, of age as this better captures the range of ages of athletes for a given year.

Discussion

In the “Sex vs Body Mass Index Distribution” plots, men typically have a higher-centered distribution of BMI than women in the same sport, but other trends are more apparent by category. In the first plot, which includes wrestling, weightlifting, judo, and boxing, we see a strong right skew in most distributions. We interpret this as a reflection of the weight classes present in those sports, with many participants entering in lower weight classes, giving a lower center, but a non-trivial number of athletes with BMIs higher than 30. However, there are fewer at these high BMIs, possibly due to the difficulty in maintaining athletic competitiveness at that proportion. Shooting and archery also have generally higher BMIs than other sports, and again a right-skew is observed in the distributions. This reflects those sports’ focus on hand-eye coordination, and thus athlete body type or height or weight are more variable.

The next group of sports, Athletics and Rowing, feature distributions with large variability on both sides. We interpret this as illustrating the diversity of body types considered desirable in this sports: from slender, lithe coxswains and sprinters to heavier, more powerful rowers and throwers. Finally, the largest disparities between sex in a particular sport are observed in Gymnastics and Trampoline. Here, we speculate that the female gymnastics disciplines favor petite, nimble athletes, as opposed to the more upper-body strength oriented male apparatuses, which contribute to a higher average BMI for male Olympic gymnasts.

Finally, for the “Average Yearly Age of Athletes By Sex” plot, it was found that the mean age of athletes by sex has changed pretty significantly over the years. For instance, the range of mean ages participating has converged pretty dramatically. Prior to the 1970s, the mean ages fluctuated in the 20-30 years old range with even some outliers hitting 35 and 50 years old in some cases. However, since the 1970s, it appears that the mean ages have remained steady in the 25-27 years old range. This makes sense because, according to a Britannica article found here, the 1970s were a turning point for the games moving from allowing professional athletes to compete rather than just amateurs. The article also notes that it wasn’t until the 1980s that fully-fledged professional participation was permitted. With this information, we can speculate that there might be an association between the allowing of professional athletes and the convergence of mean ages. Perhaps since most professional athletes are around 25 years old (a point of peak physical ability for many), the mean age reflects the demographic shift (amateur to professional) of participants.

In addition to age converging, it is interesting to note the difference between Males and Females. For all years but 3, the mean age of males was higher than females. Perhaps since males take longer to physically mature (puberty is at a later age as described here), that could explain why their ability to compete at “peak perform” in a professional context is at a later age. Moreover, it’s possible that the extreme variability between mean ages of females prior to 1950 is the result of fewer females competing. With fewer competitors, it’s more likely that outlier events (extremely old participants) can skew the mean. For example, prior to 1920, there were 210 female participants versus 9,790 male participants as depicted below. This is a major limitation for this plot.

olympics %>%
  filter(year <= 1920) %>%
  dplyr::count(sex)
## # A tibble: 2 × 2
##   sex        n
##   <fct>  <int>
## 1 Female   210
## 2 Male    9790

These numbers are the total number of unique participants participating in each Olympics game prior to 1920.

Examining Medal-Winning Countries

Introduction

The Olympics are a historic sports event that have been run for over a 125 years. To really grasp the full story of success at the Olympics, we want to take a broader view and look at which countries have been most successful throughout all the Olympic games and in which sports. Nations take pride in their overall medal counts–in modern times, Google tracks and publishes these medal counts. Despite popular interest in these metrics, fewer people are aware of their nations’ performance extending back over a century. Thus, we are intrigued by these cumulative wins and how countries have earned them in the long view.

We also hypothesize that an association exists between a country’s Olympic success and its economic output because of the investment often required to adequately train in many sports. Testing this theory, that economic power corresponds to greater success at the Olympics, will help us understand the relationship between national resources and overall athletes’ success. A nation’s resources can serve as both a boon and a challenge to its athletes. For example, American wealth helps train some of the best gymnasts in the world, but it also supports corruption and abuse in the organization that is USA Gymnastics. To investigate this, we will merge in external data from the Maddison Project Database, a compilation of researchers’ work estimating economic growth for individual countries (Bolt and van Zanden, 2020).

We will be using a variable we create, total_winners, to visualize total Olympic success for different countries. This variable represents the number of medals won by each country in each Olympics. We will then zoom in and look at one specific country’s success, using the sport, noc, and year variables along with total_winners. For the second plot, we will load in external historical GDP per capita data and use this external data along with total_winners to visualize the relationship between a country’s wins and its economic indicators.

Thus, the question we will examine is the following: which countries are most successful at the games and is this success potentially influenced by high performance in particular sports and/or by the economic status of a country (as measured by GDP)?

Approach

In our first visualization, we will create a bar chart to represent the winningest countries in the history of the Olympics. Country codes will be on the x-axis and total medal count, a new variable we will create, will be on the y-axis. To avoid having an extremely left skewed bar chart with many countries who have won very few medals, we will represent the top 10 winningest countries of all time on our bar chart. This plot will help us gauge the amount of medals the top ten most winningest countries have won, and how they compare to each other. After creating this plot, we will select one country of interest and examine which sports have contributed most heavily to this country’s success. We will then plot out the results of these sports over time to see how their contribution to this country’s medal count may have changed over the years.

For our second visualization, we will create a scatter plot with gdp_per_capita on the x axis and medals_total on the y axis, faceting and animating by year. We plan to highlight interesting outliers with their country names, and we will visualize the winter and summer Olympics separately because of the different natures of these events (for example, because overall medal counts are much higher in the summer games than the winter games). Although we plan to look at data over time, a faceted or animated scatter plot will preserve all the information in our data (importantly our two numeric variables, gdp_per_capita and medals_total), where a line graph would not. We will use a text layer to annotate outliers, but a plot made of entirely geom_text() would overwhelm the reader. Also, to maintain granularity, since individual country identities are important to us, a scatter plot will serve us better than a sole model or smooth line.

Analysis

# reload the Olympic dataset and also the full country names into the noc_regions object
olympics <- read_csv(file = paste0(here::here(), "/data/olympics_data.csv"), show_col_types = FALSE)
noc_regions <- read_csv(file = paste0(here::here(), "/data/noc_regions.csv"), show_col_types = FALSE)

# make appropriate date for chronological plotting and convert characters in medal wins to numeric values
olympics %<>%
  mutate(
    date = ifelse(season == "Winter", paste0(year, "-02-01"), paste0(year, "-07-01")),
    date = ymd(date),
    medal_winner = ifelse(is.na(medal), 0, 1),
    medal_score = case_when(
      medal == "Bronze" ~ 1,
      medal == "Silver" ~ 2,
      medal == "Gold" ~ 3,
      TRUE ~ 0
    )
  )

noc_regions %<>%
  janitor::clean_names() %>%
  # fix typo
  mutate(region = ifelse(region == "Boliva", "Bolivia", region))

olympics <- left_join(olympics, noc_regions, by = "noc")

olympics_filter <- olympics %>%
  filter(season == "Summer") %>%
  group_by(region) %>%
  dplyr::summarise(total_winners = sum(medal_winner), total_score = sum(medal_score)) %>%
  filter(total_winners >= 910)

The data wrangling step above involves adding the month and day to the dates of our summer and winter olympic data to help us plot it as time series data, should we need to do this. We then clean the noc_regions data.

We have now filtered the data to be used in our first visualization for only Summer Olympic success, as different countries may be successful in different seasons. Given the limitations of the scope of analysis for this project, we will only look at the summer Olympics for this visualization as we feel it is more meaningful than combining both the summer and winter medal counts.

ggplot(olympics_filter, aes(x = reorder(region, total_winners), y = total_winners)) +
  geom_bar(stat = "identity", fill = "#5CC0AB") +
  labs(x = "Country",
       y = "Medal Count",
       title = "All-time Summer Olympic Medal Counts",
       subtitle = "For the top 10 most winningest countries in history",
       caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com") +
  theme_minimal() +
  theme(strip.background = element_blank(),
        strip.text.y = element_blank(),
        panel.spacing.y = unit(0.4, "cm"),
        plot.title = element_text(family = "Oswald", color = "#092260", size = 20, hjust = 0.5),
        plot.caption = element_text(family = "Oswald", color = "#092260", size = 11),
        plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
        axis.text.x = element_text(family = "Oswald", color = "#092260"),
        axis.title.x = element_markdown(family = "Oswald", color = "#092260", size = 14),
        axis.text.y = element_text(family = "Oswald", color = "#092260"),
        axis.title.y = element_markdown(family = "Oswald", color = "#092260", size = 14),
        legend.position = "none")

Here is our first plot, which represents the 10 countries that have won the most medals in Summer Olympic history.

After taking a closer look at the data used to create the first visualization, we decided to represent the contribution of the USA’s two most successful sports to our medal count at each summer Olympic games over time. We filter by the country code “USA” and by sport and year, filtering for only swimming and athletics.

usa_fil <- olympics %>%
  filter(noc == "USA") %>%
  group_by(sport, year)%>%
  dplyr::summarize(total_winners = sum(medal_winner), total_score = sum(medal_score), .groups = "drop")%>%
  filter(sport %in% c("Swimming", "Athletics"))

ggplot(usa_fil, aes(x = year, y = total_winners, color = sport, group = sport)) +
  geom_line() +
  geom_point() +
  labs(
    x = "Year",
    y = "Medal Count",
    title = "USA Olympic Medals Won in Athletics and Swimming",
    caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("#7B38EC", "#5CC0AB")) +
  scale_x_continuous(limits = c(1896, 2016), breaks = seq(1896, 2016, by = 8)) +
  theme(
    strip.background = element_blank(),
    strip.text.y = element_blank(),
    panel.spacing.y = unit(0.4, "cm"),
    plot.title = element_text(family = "Oswald", color = "#092260", size = 20, hjust = 0.5),
    plot.caption = element_text(family = "Oswald", color = "#092260", size = 11),
    plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    axis.text.x = element_text(family = "Oswald", color = "#092260"),
    axis.title.x = element_markdown(family = "Oswald", color = "#092260", size = 14),
    axis.text.y = element_text(family = "Oswald", color = "#092260"),
    axis.title.y = element_markdown(family = "Oswald", color = "#092260", size = 14),
    legend.text = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.title = element_blank(),
    legend.position = c(0.8, 0.18),
    legend.background = element_rect(size = 0.3),
    legend.margin = margin(1, 5, 5, 5),
    legend.key.size = unit(0.5, "cm")
  )

Here we have plotted the number of medals won in Athletics and Swimming at each Summer Olympic games over the years.

Next, we will look at the economic characteristics of these winning countries.

# warnings suppressed due to parsing failures of gdp data which do not affect the data of our concern
detach(package:plyr, unload = TRUE)
gdp <- read_csv("data/gdp-per-capita-maddison-2020.csv", show_col_types = FALSE)

gdp %<>%
  janitor::clean_names() %>%
  mutate(entity = case_when(
    entity == "United States" ~ "USA",
    entity == "United Kingdom" ~ "UK",
    entity == "Congo" ~ "Republic of Congo",
    entity == "Democratic Republic of Congo" ~ "Democratic Republic of the Congo",
    entity == "Cote d'Ivoire" ~ "Ivory Coast",
    entity == "Trinidad and Tobago" ~ "Trinidad",
    entity == "North Macedonia" ~ "Macedonia",
    entity == "Czechia" ~ "Czech Republic",
    TRUE ~ entity
  ))

olympics_gdp <- olympics %>%
  group_by(region, date, season) %>%
  dplyr::summarise(total_winners = sum(medal_winner), total_score = sum(medal_score), .groups = "drop") %>%
  rename(entity = region) %>%
  mutate(
    year = year(date),
    summer_highlight = ifelse(total_winners > 100, T, F),
    winter_highlight = ifelse(total_winners > 30, T, F)
  ) %>%
  left_join(gdp, by = c("entity", "year")) %>%
  rename(country = entity)

We load our historical GDP data and join it to Olympic victory data after cleaning country naming discrepancies between the two.

winter_highlight <- olympics_gdp %>%
  filter(!is.na(gdp_per_capita), season == "Winter", year %in% c(2000:2020), total_winners > 30)

olympics_gdp %>%
  filter(!is.na(gdp_per_capita), season == "Winter", year %in% c(2000:2020)) %>%
  ggplot(aes(x = gdp_per_capita, y = total_winners)) +
  geom_point(aes(color = winter_highlight, size = winter_highlight), alpha = .5, size = .75, show.legend = FALSE) +
  geom_text_repel(
    data = winter_highlight, aes(x = gdp_per_capita, y = total_winners, label = country),
    color = "#7B38EC", size = 3.5, family = "Oswald"
  ) +
  facet_wrap(~year, ncol = 1) +
  scale_color_manual(values = c("#5CC0AB", "#7B38EC")) +
  labs(
    title = "Total Medals Won vs. GDP per capita",
    subtitle = "Winter Olympics, 2002-2014",
    y = "Total Medals Won",
    x = "GDP per capita (USD)",
    caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com"
  ) +
  scale_y_continuous(breaks = seq(from = 0, to = 100, by = 50)) +
  scale_x_continuous(
    breaks = seq(from = 0, to = 90000, by = 30000),
    labels = label_dollar(scale = .001, prefix = "$", suffix = "K")
  ) +
  theme_minimal() +
  theme(
    strip.background = element_blank(),
    strip.text.y = element_blank(),
    panel.spacing.y = unit(0.4, "cm"),
    legend.title = element_blank(),
    legend.position = c(0.8, 0.68),
    legend.background = element_rect(size = 0.3),
    legend.margin = margin(1, 5, 5, 5),
    legend.key.size = unit(0.5, "cm"),
    plot.title = element_text(family = "Oswald", color = "#092260", size = 20, hjust = 0.5),
    plot.caption = element_text(family = "Oswald", color = "#092260", size = 11),
    strip.text = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    axis.text.x = element_text(family = "Oswald", color = "#092260"),
    axis.title.x = element_markdown(family = "Oswald", color = "#092260", size = 14),
    axis.text.y = element_text(family = "Oswald", color = "#092260", size = 8),
    axis.title.y = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.text = element_text(family = "Oswald", color = "#092260", size = 14)
  )

summer_highlight <- olympics_gdp %>%
  filter(!is.na(gdp_per_capita), season == "Summer", year %in% c(2004:2020), total_winners > 100)

olympics_gdp %>%
  filter(!is.na(gdp_per_capita), season == "Summer", year %in% c(2004:2020)) %>%
  ggplot(aes(x = gdp_per_capita, y = total_winners)) +
  geom_point(aes(color = summer_highlight, size = summer_highlight),
    alpha = .5, size = .75, show.legend = FALSE
  ) +
  geom_text_repel(
    data = summer_highlight, aes(x = gdp_per_capita, y = total_winners, label = country),
    color = "#7B38EC", size = 3.5, family = "Oswald"
  ) +
  facet_wrap(~year, ncol = 1) +
  scale_color_manual(values = c("#5CC0AB", "#7B38EC")) +
  labs(
    title = "Total Medals Won vs. GDP per capita",
    subtitle = "Summer Olympics, 2004-2016",
    y = "Total Medals Won",
    x = "GDP per capita (USD)",
    caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com"
  ) +
  scale_y_continuous(breaks = seq(from = 0, to = 300, by = 100)) +
  scale_x_continuous(
    breaks = seq(from = 0, to = 90000, by = 30000),
    labels = label_dollar(scale = .001, prefix = "$", suffix = "K")
  ) +
  theme_minimal() +
  theme(
    strip.background = element_blank(),
    strip.text.y = element_blank(),
    panel.spacing.y = unit(0.4, "cm"),
    legend.title = element_blank(),
    legend.position = c(0.8, 0.68),
    legend.background = element_rect(size = 0.3),
    legend.margin = margin(1, 5, 5, 5),
    legend.key.size = unit(0.5, "cm"),
    plot.title = element_text(family = "Oswald", color = "#092260", size = 20, hjust = 0.5),
    plot.caption = element_text(family = "Oswald", color = "#092260", size = 11),
    strip.text = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    axis.text.x = element_text(family = "Oswald", color = "#092260"),
    axis.title.x = element_markdown(family = "Oswald", color = "#092260", size = 14),
    axis.text.y = element_text(family = "Oswald", color = "#092260", size = 8),
    axis.title.y = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.text = element_text(family = "Oswald", color = "#092260", size = 14)
  )

olympics_gdp %>%
  distinct(country, gdp_per_capita) %>%
  arrange(desc(gdp_per_capita)) %>%
  head(10)
## # A tibble: 10 × 2
##    country              gdp_per_capita
##    <chr>                         <dbl>
##  1 Qatar                       156299 
##  2 Qatar                       153922 
##  3 Qatar                       107402.
##  4 Norway                       82814 
##  5 Norway                       82216 
##  6 Norway                       81759 
##  7 Kuwait                       78801 
##  8 Norway                       78476.
##  9 Norway                       76522.
## 10 United Arab Emirates         75876
olympics_gdp %>%
  filter(!is.na(gdp_per_capita), country != "Qatar", season == "Summer", year %in% c(2004:2020)) %>%
  ggplot(aes(x = gdp_per_capita, y = total_winners)) +
  geom_point(aes(color = summer_highlight, size = summer_highlight),
    alpha = .5, size = .75, show.legend = FALSE
  ) +
  geom_text_repel(
    data = summer_highlight, aes(x = gdp_per_capita, y = total_winners, label = country),
    color = "#7B38EC", size = 3.5, family = "Oswald"
  ) +
  facet_wrap(~year, ncol = 1) +
  scale_color_manual(values = c("#5CC0AB", "#7B38EC")) +
  labs(title = "Total Medals Won vs. GDP per capita",
       subtitle = "Summer Olympics, 2004-2016",
       y = "Total Medals Won",
       x = "GDP per capita (USD)",
       caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com") +
  scale_y_continuous(breaks = seq(from = 0, to = 300, by = 100)) +
  scale_x_continuous(
    breaks = seq(from = 0, to = 90000, by = 30000),
    labels = label_dollar(scale = .001, prefix = "$", suffix = "K")
  ) +
  theme_minimal() +
  theme(
    strip.background = element_blank(),
    strip.text.y = element_blank(),
    panel.spacing.y = unit(0.4, "cm"),
    legend.title = element_blank(),
    legend.position = c(0.8, 0.68),
    legend.background = element_rect(size = 0.3),
    legend.margin = margin(1, 5, 5, 5),
    legend.key.size = unit(0.5, "cm"),
    plot.title = element_text(family = "Oswald", color = "#092260", size = 20, hjust = 0.5),
    plot.caption = element_text(family = "Oswald", color = "#092260", size = 11),
    strip.text = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 14, hjust = 0.5),
    axis.text.x = element_text(family = "Oswald", color = "#092260"),
    axis.title.x = element_markdown(family = "Oswald", color = "#092260", size = 14),
    axis.text.y = element_text(family = "Oswald", color = "#092260", size = 8),
    axis.title.y = element_text(family = "Oswald", color = "#092260", size = 14),
    legend.text = element_text(family = "Oswald", color = "#092260", size = 14)
  )

Then we plot GDP per capita against medals won over time in the summer and winter Olympics separately. Qatar, which has a GDP per capita of nearly double the next highest nation, makes it difficult to view trends among remaining countries, so we remove them for a second look at the summer games.

winter_highlight <- olympics_gdp %>%
  filter(!is.na(gdp_per_capita), season == "Winter", year %in% c(1950:2020), total_winners > 30)

winter_gdp <- olympics_gdp %>%
  filter(!is.na(gdp_per_capita), season == "Winter", year %in% c(1950:2020)) %>%
  ggplot(aes(x = gdp_per_capita, y = total_winners)) +
  geom_point(aes(color = winter_highlight), alpha = .5, size = 2, show.legend = FALSE) +
  geom_text(
    data = winter_highlight, aes(x = gdp_per_capita, y = total_winners, label = country),
    color = "#7B38EC", size = 4, family = "Oswald", vjust = 1
  ) +
  scale_color_manual(values = c("#5CC0AB", "#7B38EC")) +
  labs(
    title = "Total Medals Won vs. GDP per capita",
    subtitle = "Winter Olympics, 1950-2020\nYear: {closest_state}",
    y = "Total Medals Won",
    x = "GDP per capita (USD)",
    caption = "Source: Sports Reference & OlympStats\nCompiled by kaggle.com"
  ) +
  scale_x_continuous(
    breaks = seq(from = 0, to = 90000, by = 30000),
    labels = label_dollar(scale = .001, prefix = "$", suffix = "K")
  ) +
  theme_minimal() +
  theme(strip.background = element_blank(),
        strip.text.y = element_blank(),
        panel.spacing.y = unit(0.4, "cm"),
        legend.title = element_blank(),
        legend.position = c(0.8, 0.68),
        legend.background = element_rect(size = 0.3),
        legend.margin = margin(1, 5, 5, 5),
        legend.key.size = unit(0.5, "cm"),
        plot.title = element_text(family = "Oswald", color = "#092260", size = 17, hjust = 0.5),
        plot.caption = element_text(family = "Oswald", color = "#092260", size = 8),
        plot.subtitle = element_text(family = "Oswald", color = "#092260", size = 11, hjust = 0.5),
        axis.text.x = element_text(family = "Oswald", color = "#092260"),
        axis.title.x = element_markdown(family = "Oswald", color = "#092260", size = 11),
        axis.text.y = element_text(family = "Oswald", color = "#092260", size = 7),
        axis.title.y = element_text(family = "Oswald", color = "#092260", size = 11),
        legend.text = element_text(family = "Oswald", color = "#092260", size = 11)) +
  transition_states(year, transition_length = 1, state_length = .3)

animate(winter_gdp, duration = 16, fps = 10, renderer = gifski_renderer())

To look at these economic trends over a longer period of time, we animate our winter GDP plot from the 1950s onward.

Discussion

After plotting the 10 most successful summer Olympic countries in history, it is clear that the USA has a wide lead over the USSR, which is in second place, and other countries behind them, including Great Britain, Germany, France, and Italy. The United States, as seen in the visualization, has won over 5,000 summer Olympic medals, while the USSR has won just over 2,000. Britain has earned a similar amount, and we can see a steady decline after that. Because of the wide gap between first and second place, we decided to examine the data to see which sports contributed most to the success of the Americans. After examining the data and grouping the USA’s medals by sport, we found that the two most successful American summer Olympic sports were swimming and athletics.

The immense contribution of swimming and athletics to American summer Olympic success feels intuitive, as these two sports consist of a myriad of events with many medals up for grabs, whereas sports like basketball offer only a few medal opportunities at each summer Olympics. When we examine how the USA accumulated successes in these sports over time, we find that until the 1960 games, the USA won more athletic medals than swimming medals at every occurrence of the summer Olympics. But, starting in 1960, the USA medal count in swimming overtook athletics, and since then, the United States has won more swimming than athletic medals at every Olympic games. The data is missing values for 1940 and 1944 because of World War II and for 1980 because of the worldwide boycott of the Russian Olympic games. We found this observed trend interesting because in the late 20th century, there was a small gap between the number of medals won in swimming and athletics, but starting in 1996 and continuing through the turn of the century, the gap started to widen. This could be because of a multitude of reasons, including more development of Team USA swimming and coaching and more young athletes interested and with entryway into the sport, or the addition of more swimming events. In the last three Olympics for which we have data, we can see that the gap between the two sports stays relatively steady but success in both is on the rise. Perhaps this tells us that countries that are successful in sports with more medals up for grabs tend to see higher total medals for their country. Of course, more visualizations could be done to understand the sports that other countries like Russia do well in or how this varies by season. However, since we are working withing constraints for the length of the project, this will have to remain as a limitation.

In addition to the specific sports a country excels in being potentially related to medal success, our visualizations of economic indicators against Olympic success reveal both an association we expected and some meaningful outliers we did not predict. In the recent winter Olympics (since 2002), all of the countries with over 30 medal wins per games had a GDP per capita of over 30,000 USD, with the exception of Russia. It seems that perhaps Russia’s strong cultural history in winter sports like figure skating, their legacy of coaches and athletes that transformed the sport, speaks stronger than nationwide production and wealth. When we take the long view, Russia emerges as a high medal winner much earlier on, before the association of high GDP per capita with high medal counts for countries such as the US arises. This further testifies to the powerful history of winter sports in what is now Russia. And in the summer games, another country with GDP per capita under 30,000 USD shows outstanding performance – China. China’s significantly stronger showing in the summer Olympics than the winter games has left a mark in the 2000s. During this period, China’s performance as a nation peaked on its home turf, at the 2008 Beijing games. The Olympics offer nations a significant opportunity to present a positive image of themselves internationally, and it appears that despite economic trends, China has taken full advantage of that chance to demonstrate their power.

It is important to note that there are limitations with GDP per capita in terms of understanding the potential resources and opportunities available in a country as it does not fully take into account how wealth is spread.

Presentation

Our presentation can be found here.

Data

Sports Reference 2018, 120 years of Olympic history: athletes and results, electronic dataset, Kaggle, viewed 16 September 2021, https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

Bolt, J, van Zanden J. L. 2020, GDP per capita, 1820-2018, electronic dataset, OurWorldInData, viewed 30 September 2021, https://ourworldindata.org/grapher/gdp-per-capita-maddison-2020

References

Sports Reference 2018, 120 years of Olympic history: athletes and results, electronic dataset, Kaggle, viewed 16 September 2021, https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

Bolt, J, van Zanden J. L. 2020, GDP per capita, 1820-2018, electronic dataset, OurWorldInData, viewed 30 September 2021, https://ourworldindata.org/grapher/gdp-per-capita-maddison-2020

Packages Used

Claus O. Wilke 2020, ggtext: Improved Text Rendering Support for ‘ggplot2’., R package version 0.1.1. https://CRAN.R-project.org/package=ggtext

Garrett Grolemund, Hadley Wickham 2011, Dates and Times Made Easy with lubridate., Journal of Statistical Software, 40(3), 1-25. https://www.jstatsoft.org/v40/i03/

Hadley Wickham 2011, The Split-Apply-Combine Strategy for Data Analysis., Journal of Statistical Software, 40(1), 1-29. http://www.jstatsoft.org/v40/i01/

Hadley Wickham, Romain François, Lionel Henry and Kirill Müller 2021, dplyr: A Grammar of Data Manipulation., https://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr

Hadley Wickham and Dana Seidel 2020, scales: Scale Functions for Visualization., https://scales.r-lib.org, https://github.com/r-lib/scales

Jeroen Ooms 2021, gifski: Highest Quality GIF Encoder., https://gif.ski/ (upstream), https://github.com/r-rust/gifski (devel).

Kamil Slowikowski 2021, ggrepel: Automatically Position Non-Overlapping Text Labels with ‘ggplot2’., R package version 0.9.1. https://github.com/slowkow/ggrepel

Kirill Müller and Lorenz Walthert 2021, styler: Non-Invasive Pretty Printing of R Code. https://github.com/r-lib/styler, https://styler.r-lib.org

Thomas Lin Pedersen and David Robinson 2020, gganimate: A Grammar of Animated Graphics., https://gganimate.com, https://github.com/thomasp85/gganimate

Stefan Milton Bache and Hadley Wickham 2020, magrittr: A Forward-Pipe Operator for R., https://magrittr.tidyverse.org, https://github.com/tidyverse/magrittr

Wickham et al. 2019, Welcome to the tidyverse., Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Yixuan Qiu et al. 2021, showtext: Using Fonts More Easily in R Graphs., R package version 0.9-4. https://CRAN.R-project.org/package=showtext

Coding & Outside Research

CruzeiroDoSul, 2016, Font used on the Tokyo 2020 logo [online] Available at: https://www.reddit.com/r/identifythisfont/comments/4ig8ua/font_used_on_the_tokyo_2020_logo/ [Accessed 1 Oct. 2021].

Graphic Design Stack Exchange, 2014, *web safe - Is there a DIN font free alternative? * [online] Available at: https://graphicdesign.stackexchange.com/questions/7178/is-there-a-din-font-free-alternative [Accessed 1 Oct. 2021].

GmbH, D, 2018, An alternative to pink & blue: Colors for gender data [online] Chartable. Available at: https://blog.datawrapper.de/gendercolor/

Young, D.C. and Abrahams, H.M., 2019, Olympic Games | History, Locations, & Winners In: Encyclopædia Britannica. [online] Available at: https://www.britannica.com/sports/Olympic-Games

Wikipedia Contributors, Olympic_rings_without_rims. [online] Available at: https://upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Olympic_rings_without_rims.svg/1200px-Olympic_rings_without_rims.svg.png

ChinaPower Project 2018, How dominant is China at the Olympic Games? [online]. Available at: https://chinapower.csis.org/dominant-china-olympic-games/

NHS Choices, 2018, Sexual health [online] NHS. Available at: https://www.nhs.uk/live-well/sexual-health/stages-of-puberty-what-happens-to-boys-and-girls/

Wikipedia Contributors, 2019, Body mass index. [online] Wikipedia. Available at: https://en.wikipedia.org/wiki/Body_mass_index