project-2-tidy_team

Global Deaths by Risk Factors

Tidy Team 12/2/2021

Write Up

Introduction

Our high-level goal was to create an R Shiny app that acts as an interactive spatio-temporal visualization of worldwide deaths related to various risk factors, specifically air pollution, substance use, and lack of sanitation.

Our Shiny App can be found here.

Goals and Motivation

The goal of our Shiny App is to communicate the harmful effects of certain risk factors (i.e., air pollution, substance use, and sanitation) in the world by visually displaying the deaths caused by these risk factors over time. Our interactive maps display the death rate (deaths per 100,000) in countries from 1990 to 2015 caused by different types of air pollution, substance use, and sanitation concerns. The motivation to create this app is to effectively show the deadly effects of risk factors across the globe and how they compare by country and change over time. By consolidating the vast amount of data in an informative way, our Shiny App could address deadly global factors to a greater population. The intended users for this application are the general public who are curious about death rates, and researchers who are interested in gaining a perspective on death rates from specific risk factors compared to other countries and through time.

Data Description

We utilize three different data sets.

The first data set is The World Pollution Data from Our World in Data via Kaggle. This data set contains two separate CSV files. The first includes the death rate by air pollution type and country from 1990 to 2017 - 7 variables and 6468 observations. The second CSV has the raw number of deaths by risk factor and by country from 1990 to 2017 - 32 variables and 6468 observations. The second data set is The World Map Data from Maps Package, which includes 6 variables and 99338 observations. The third data set is Population Data from Our World in Data, which includes 4 different variables and 6814 observations.

Justification of Approach

Data Cleaning and Merging

The data can be found in our data dictionary. We first created a data cleaning script to separate all the data cleaning from the app. We compressed the .csv files into .rds files and combined and separated components of the data sets to create our visualizations.

Specifically, we created compressed_final_data.rds by combining all the data sets above so each region has information about the different death rates by various risk factors. However, we needed to organize death rates by different risk factors for each country for the map. Therefore, we created three rds files for each risk factor category (death by air pollution, substance use, and sanitation). We chose three categories because the Shiny App rendered too slowly with more than three interactive maps. Also, we selected air pollution, substance use, and sanitation because those categories had the most significant effect on the death rate worldwide.

The Data cleaning file can be found here.

Map Dataset

To create the map, we used a series of case_when functions to resolve naming inconsistencies between the two data sets. However, a limitation was that we could not fix all the country names, especially those we could not identify on the map. We aimed to make all three maps show death rate–not total death count–to account for varying population sizes. However, only the air pollution category had the death rate in the original data set, the others had death count. Therefore, we used population data to find the total population of each country between 1990 and 2017. Then we mutated a new column for the death rate by dividing the total number of deaths by total population, then multiplying by 100,000 to show the number of deaths per 100,000 people. However, we did not age-standardize the death rate as we did not feel equipped to undertake this calculation.

Time

For the map, we filtered for every three years between 1990 and 2015 instead of plotting every year to enhance the loading speed. These intervals also show more meaningful change when playing the animation. We chose to keep data for every year for the line graphs for the integrity of the chart. This decision did not slow down the Shiny app since it is plotted in a line graph and not a plotly map. The code for the plotly map and line graph can be found here.

Shiny App

Each of the three main categories of risk factors are given a separate tab: Air pollution, Substance Use, and Sanitation.

The world map is a plotly map highlighting the death rates for the chosen risk factor tab. Each type of risk factor is given a different continuous color scale.

Within the three tabs, the user will interact with a side panel, allowing them to select the subcategory to display. For instance, within the Substance Use tab, the side menu will allow the user to display death rates from secondhand smoking, alcohol use, drug use, or smoking. All subcategories of risk factors are listed in the data dictionary.

After the map reloads according to the user’s selection, the user can interact with the map by sliding the time slider located under the map to select a specific year to display, ranging from 1990 to 2017. Additionally, the user can click the play button near the time slider to animate.

On the map, the user can use their mouse to hover over each country, displaying a small box that shows the country’s name and death rate by the chosen category.

Below the map and time slider, a line graph indicates the changes in the number of deaths for the given risk factor over the time the data was collected. The user will be able to choose up to three regions to view how the death number per risk factor changed from 1990 to 2017. There are 17 regions the user can select from. Instead of the death rate, we used the total number of deaths for the y-axis because this allows absolute number of deaths to be evaluated between regions so decision makers know where to allocate resources. While death rate is helpful for comparative purposes, the absolute number is also essential to understand where most people are suffering. We set the maximum number of regions to three, so the plot displays trends and comparisons with meaningful color differences, time loading, and clarity.

If the user wants to explore our general findings and conclusions, they can click on our hyperlink located at the bottom of each sidebar. This hyperlink brings the user to our GitHub Write-up to find our data sources, approach, and discussion.

Use of Plotly for Map

Initially, our group attempted to utilize Leaflet, but this method did not work because it required spatial data that we could not join with our data given the skills we learned and also limited color filling in the polygons. Next, we tried using the Shiny slider which worked visually but limited interactivity and speed. We were determined to include hover and zoom-in features with detailed death rates, because the color scale alone is not detailed enough to compare death rates across countries and time. We finally chose the plotly package for geom_polygon to enhance interactivity and speed.

Accessibility

All the color scales chosen are color-blind friendly using a Viridis palette so comparing death rates between different countries is more accessible. Each map shows a different Viridis palette, while using a different, yet consistent Viridis palette for the line graphs. For the line graph, each color corresponds with a region instead of corresponding with the death rate. Additionally, we added Alt Text to our app, specifically when the user hovers over the map and the line plot. This way, users who can not see the plots clearly can use their computer’s text reader to listen to our descriptions.

Discussion

Below, we discuss critical findings from our visualizations for each of the three risk factors.

Air Pollution

Death rates attributable to air pollution in total have consistently fallen throughout the 25 years on the time slider. For instance, the US death rate due to air pollution (total) has dropped from 31.19 in 1990 to 19.95 in 2015. This was a surprising finding since our group’s intuition was that air pollution has increased over the years due to industrialization; therefore, air pollution-related deaths would also increase. However, this finding does not suggest that air pollution has necessarily decreased, but rather the possibility that people have developed better methods to prevent death from air pollution.

Regarding the line plots, our group wanted to compare the raw deaths in North America, Eastern Europe, and Latin America and the Caribbean since we wanted to compare two well-developed regions with a less developed one. For instance, in 2017, North America had 115,000 deaths attributed to air pollution, whereas Eastern Europe and Latin America and the Caribbean had approximately 150,000 and 200,000, respectively. This finding is surprising given that Latin America and the Caribbean have a much smaller population than the other two regions.

Substance Use

Alcohol use and secondhand smoking provided fascinating insights. For instance, from 1990 to 2015, death rates due to secondhand smoking in the US continuously decreased, from 1.14 to 0.87. Similarly, Canada’s death rate due to secondhand smoking decreased from 1.13 to 0.93 between 1990 and 2015. China and Russia faced the highest death rates due to secondhand smoking, with the latter experiencing continuously rising death rates until 2008, after which death rates started to decrease

Alcohol use death rates painted a surprisingly different story. Canada and the US in 1990 started with polar opposite death rates of 0.37 and 1.08, respectively. However, both countries faced continuously rising death rates, and by 2015, death rates due to alcohol use in Canada and the US were 1.16 and 1.38. However, the region with the highest death rates due to alcohol in 2015 was Eastern Europe - particularly Russia, Romania, and Ukraine, each with a death rate higher than two.

Comparing the three regions mentioned earlier for the line plots, we can see that Eastern Europe continuously has a more significant number of deaths for alcohol use-related deaths, followed by Latin America and the Caribbean and North America. Interestingly, the number of alcohol-related deaths for Eastern Europe and Latin America and the Caribbean were roughly the same in 1990; however, by 2017, the disparity was vast (100,000 deaths).

Sanitation

This risk factor indicates the most significant difference between developed and less developed countries. For instance, in 1990, the regions with the highest death rates due to unsafe water sources were Africa, South America, and South/South East Asia. Countries in this region would be considered less developed countries; therefore, this aligns with our group’s initial hypothesis. A similar trend can also be seen for death rates attributable to unsafe sanitation and a lack of access to hand washing facilities.

Latin America and the Caribbean have far greater deaths attributed to unsafe water sources than Eastern Europe or North America. In 2017, Eastern Europe and North America had around 300 deaths due to unsafe water sources, whereas Latin America and the Caribbean had around 12,000 deaths. Although Latin America and the Caribbean have a much smaller population, they are less economically developed than the other two regions, and so the death count disparity makes intuitive sense.

Conclusion and Limitations

We hope this tool will be helpful to policy analysts when they are analyzing where policy interventions are required and tracking the progression of risk factors on countries over time.

One significant limitation regards the sanitation tab, which includes a few negative death rates. This must be taken into account when analyzing and comparing countries.

Similarly, the death rates attributable to sanitation and substance use are not age-standardized, so our results cannot be used to compare populations with dramatically different age structures.

Another limitation we faced is the hover feature on the map presents the risk factor as a variable name and the death rate in scientific notation.

Our shiny app takes a long time to load.

All these factors would require a large time investment towards technical troubleshooting, and we thought our time was better spent focusing on the visualizations.

Presentation Slides

Our presentation can be found here

References

Data References:

Although accessed via Kaggle, the risk factor death data’s original source is Our World in Data, which is a public data collection initiative led by the nonprofit Global Change Data Lab. The nonprofit shares knowledge (and data) about the world’s most pressing issues so that anyone has the tools and information they need to work toward combating a particular global problem. Our World in Data notes that their data sources include the Institute for Health Metrics and Evaluation (Global Burden of Disease) and State of Global Air The data was posted to Kaggle about a month ago and includes data spanning the years 1990 to 2017.

Since some of the risk factors had death rates and some included the absolute number of deaths, we used population data over the years of interest via the Our World in Data to make a rate. It appears from the Our World in D

Data website that the data was collected via Gapminder, HYDE, and United Nations Population Division databases.

Package References:

Winston Chang, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert and Barbara Borges (2021). shiny: Web Application Framework for R. R package version 1.7.1. https://shiny.rstudio.com/

Winston Chang (2021). shinythemes: Themes for Shiny. R package version 1.2.0. https://rstudio.github.io/shinythemes/

Eric Bailey (2015). shinyBS: Twitter Bootstrap Components for Shiny. R package version 0.61. https://ebailey78.github.io/shinyBS

David Gohel and Panagiotis Skintzos (2021). ggiraph: Make ‘ggplot2’ Graphics Interactive. R package version 0.7.10. https://davidgohel.github.io/ggiraph/

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Kirill Müller (2020). here: A Simpler Way to Find Your Files. https://here.r-lib.org/, https://github.com/r-lib/here.

Original S code by Richard A. Becker, Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka and Alex Deckmyn. (2021). maps: Draw Geographical Maps. R package version 3.4.0.

Hadley Wickham and Dana Seidel (2020). scales: Scale Functions for Visualization. https://scales.r-lib.org, https://github.com/r-lib/scales.

C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020.

Hadley Wickham and Jim Hester (2021). readr: Read Rectangular Text Data. https://readr.tidyverse.org, https://github.com/tidyverse/readr.

Doug McIlroy. Packaged for R by Ray Brownrigg, Thomas P Minka and transition to Plan 9 codebase by Roger Bivand. (2020). mapproj: Map Projections. R package version 1.2.7.

Aron Atkins, Jonathan McPherson and JJ Allaire (2021). rsconnect: Deployment Interface for R Markdown Documents and Shiny Applications. R package version 0.8.25. https://github.com/rstudio/rsconnect

Kirill Müller and Lorenz Walthert (2021). styler: Non-Invasive Pretty Printing of R Code. https://github.com/r-lib/styler, https://styler.r-lib.org.

Sam Firke (2021). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 2.1.0. https://github.com/sfirke/janitor

Hadley Wickham and Jennifer Bryan (2019). readxl: Read Excel Files. https://readxl.tidyverse.org, https://github.com/tidyverse/readxl.

Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.36.

Garrick Aden-Buie (2021). xaringanthemer: Custom ‘xaringan’ CSS Themes. https://pkg.garrickadenbuie.com/xaringanthemer/, https://github.com/gadenbuie/xaringanthemer.

Shiny App Code References:

The sources used to build the Shiny app can be found in the README.md of the death_risk_factors folder where the app is located. The direct link is here.

Presentation Image References: