data wrangling II

# data wrangling II
### 2021-09-15

---

# Welcome

---

## Announcements

- Project:
  - Take note of deadlines: [vizdata.org/project-1.html#due-dates](https://www.vizdata.org/project-1.html#due-dates)
  - Proposals for peer review due due Fri, Sep 17 at 5pm
  - Any questions about project?

- RQ 2 review:
  - Today: pivot and join functions, setting limits / zooming in, pie charts
  - Friday: Visualizing various models (i.e., `lm()` <-> `geom_smooth()`)

---

## Agenda for today

Wrap up: Data wrangling
  
- Transforming and reshaping a single data frame
- Bringing together multiple data frames

---

## Setup

```r
# load packages
library(tidyverse)
library(glue)
library(lubridate)
library(scales)
library(knitr)
library(palmerpenguins)
library(openintro)
library(ggrepel)
library(waffle)

# set default theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

# set default figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 8, fig.asp = 0.618, fig.retina = 3,
  dpi = 300, out.width = "60%"
)

# dplyr print min and max
options(dplyr.print_max = 6, dplyr.print_min = 6)
```
]

---

# Transforming and reshaping a single data frame

---

## Data: Hotel bookings

- Data from two hotels: one resort and one city hotel
- Observations: Each row represents a hotel booking

```r
hotels <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv")
```

---

## Setup for next example: `hotel_summary`

```r
hotels_summary <- hotels %>%
  group_by(hotel, is_canceled) %>%
  summarise(
    across(
      .cols = starts_with("stays"),
      list(mean = mean),
      .names = "{.fn}_{.col}"
    ),
    .groups = "drop"
  )

hotels_summary
```

```
## # A tibble: 4 × 4
##   hotel        is_canceled mean_stays_in_weekend_nights mean_stays_in_week_nigh…
##   <chr>              <dbl>                        <dbl>                    <dbl>
## 1 City Hotel             0                        0.801                     2.12
## 2 City Hotel             1                        0.788                     2.27
## 3 Resort Hotel           0                        1.13                      3.01
## 4 Resort Hotel           1                        1.34                      3.44
```

---

.task[
Which variables are plotted in the following visualization? Which aesthetics are they mapped to?
]

.panelset[
.panel[.panel-name[Plot]
<img src="07-data-wrangling-II_files/figure-html/stay-nights-1.png" width="88%" />
]
.panel[.panel-name[Discuss]
<iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe>
]
]

---

## .hand[livecoding]

(See next slide for code developed during live coding session)

---

```r
hotels_summary %>%
  mutate(is_canceled = if_else(is_canceled == 0, "Not canceled", "Canceled")) %>%
  pivot_longer(cols = starts_with("mean"),
               names_to = "day_type",
               values_to = "mean_stays",
               names_prefix = "mean_stays_in_") %>%
  mutate(
    day_type = if_else(str_detect(day_type, "weekend"), "Weekend", "Weekday")
    ) %>%
  ggplot(aes(x = str_wrap(is_canceled, 10), y = mean_stays, 
             group = hotel, color = hotel)) +
  geom_point(show.legend = FALSE) +
  geom_line(aes(linetype = hotel), size = 1) +
  facet_wrap(~day_type) +
  labs(
    x = "Booking status",
    y = "Mean number of\nnights of stay",
    color = NULL, linetype = NULL,
    title = "Mean number of stays",
    subtitle = "By hotel type and booking status",
    caption = "Source: Antonio, Almeida and Nunes (2019) | TidyTuesday"
  ) +
  scale_color_manual(values = c("cornsilk4", "deepskyblue3")) +
  scale_y_continuous(limits = c(0, 4), breaks = 0:4) +
  theme(legend.position = "bottom")
```
]

---

## `pivot_wider()` and `pivot_longer()`

.pull-left[
- From **tidyr**
- Incredibly useful for reshaping for plotting
- Lots of extra arguments to help with reshaping pain!
- Refer to [pivoting vignette](https://tidyr.tidyverse.org/articles/pivot.html) when needed
]
.pull-right[
<img src="images/semi-join.gif" width="80%" />
]

---

# Bringing together multiple data frames

---

# .hand[We...]

---

## Data: 10 women in science who changed the world

|name               |
|:------------------|
|Ada Lovelace       |
|Marie Curie        |
|Janaki Ammal       |
|Chien-Shiung Wu    |
|Katherine Johnson  |
|Rosalind Franklin  |
|Vera Rubin         |
|Gladys West        |
|Flossie Wong-Staal |
|Jennifer Doudna    |
]

.footnote[
Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world)
]

---

## Setup update

We have only 10 observations, so let's print them all going forward:

```r
options(
  dplyr.print_min = 10, 
  dplyr.print_max = 10
  )
```

---

## Inputs

```r
professions
```

```
## # A tibble: 10 × 2
##    name               profession                        
##    <chr>              <chr>                             
##  1 Ada Lovelace       Mathematician                     
##  2 Marie Curie        Physicist and Chemist             
##  3 Janaki Ammal       Botanist                          
##  4 Chien-Shiung Wu    Physicist                         
##  5 Katherine Johnson  Mathematician                     
##  6 Rosalind Franklin  Chemist                           
##  7 Vera Rubin         Astronomer                        
##  8 Gladys West        Mathematician                     
##  9 Flossie Wong-Staal Virologist and Molecular Biologist
## 10 Jennifer Doudna    Biochemist
```
]

```r
dates
```

```
## # A tibble: 8 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Rosalind Franklin        1920       1958
## 5 Vera Rubin               1928       2016
## 6 Gladys West              1930         NA
## 7 Flossie Wong-Staal       1947         NA
## 8 Jennifer Doudna          1964         NA
```
]

```r
works
```

```
## # A tibble: 9 × 2
##   name               known_for                                                  
##   <chr>              <chr>                                                      
## 1 Ada Lovelace       first computer algorithm                                   
## 2 Marie Curie        theory of radioactivity,  first woman Nobel Prize win      
## 3 Janaki Ammal       hybrid species, biodiversity protection                    
## 4 Chien-Shiung Wu    experiment overturning theory of parity                    
## 5 Katherine Johnson  orbital mechanics critical to sending first Americans into…
## 6 Vera Rubin         existence of dark matter                                   
## 7 Gladys West        mathematical modeling of the shape of the Earth            
## 8 Flossie Wong-Staal first to clone HIV and map its genes, which led to test fo…
## 9 Jennifer Doudna    one of the primary developers of CRISPR
```
]

]

---

## Desired output

```
## # A tibble: 10 × 5
##    name               profession      birth_year death_year known_for           
##    <chr>              <chr>                <dbl>      <dbl> <chr>               
##  1 Ada Lovelace       Mathematician           NA         NA first computer algo…
##  2 Marie Curie        Physicist and …         NA         NA theory of radioacti…
##  3 Janaki Ammal       Botanist              1897       1984 hybrid species, bio…
##  4 Chien-Shiung Wu    Physicist             1912       1997 experiment overturn…
##  5 Katherine Johnson  Mathematician         1918       2020 orbital mechanics c…
##  6 Rosalind Franklin  Chemist               1920       1958 <NA>                
##  7 Vera Rubin         Astronomer            1928       2016 existence of dark m…
##  8 Gladys West        Mathematician         1930         NA mathematical modeli…
##  9 Flossie Wong-Staal Virologist and…       1947         NA first to clone HIV …
## 10 Jennifer Doudna    Biochemist            1964         NA one of the primary …
```

---

## Inputs, reminder

```r
names(professions)
```

```
## [1] "name"       "profession"
```

```r
names(dates)
```

```
## [1] "name"       "birth_year" "death_year"
```

```r
names(works)
```

```
## [1] "name"      "known_for"
```
]
.pull-right[

```r
nrow(professions)
```

```
## [1] 10
```

```r
nrow(dates)
```

```
## [1] 8
```

```r
nrow(works)
```

```
## [1] 9
```
]

---

## Joining data frames

```r
something_join(x, y)
```

- `left_join()`: all rows from x
- `right_join()`: all rows from y
- `full_join()`: all rows from both x and y
- `semi_join()`: all rows from x where there are matching values in y, keeping just columns from x
- `inner_join()`: all rows from x where there are matching values in y, return 
all combination of multiple matches in the case of multiple matches
- `anti_join()`: return all rows from x where there are not matching values in y, never duplicate rows of x
- ...
 
---

## Setup

For the next few slides...

```r
x
```

```
## # A tibble: 3 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2     
## 3     3 x3
```
]
.pull-right[

```r
y
```

```
## # A tibble: 3 × 2
##      id value_y
##   <dbl> <chr>  
## 1     1 y1     
## 2     2 y2     
## 3     4 y4
```
]

---

## `left_join()`

.pull-left[
<img src="images/left-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
left_join(x, y)
```

```
## Joining, by = "id"
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>
```
]

---

## `left_join()`

```r
professions %>%
* left_join(dates)
```

```
## Joining, by = "name"
```

```
## # A tibble: 10 × 4
##    name               profession                         birth_year death_year
##    <chr>              <chr>                                   <dbl>      <dbl>
##  1 Ada Lovelace       Mathematician                              NA         NA
##  2 Marie Curie        Physicist and Chemist                      NA         NA
##  3 Janaki Ammal       Botanist                                 1897       1984
##  4 Chien-Shiung Wu    Physicist                                1912       1997
##  5 Katherine Johnson  Mathematician                            1918       2020
##  6 Rosalind Franklin  Chemist                                  1920       1958
##  7 Vera Rubin         Astronomer                               1928       2016
##  8 Gladys West        Mathematician                            1930         NA
##  9 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
## 10 Jennifer Doudna    Biochemist                               1964         NA
```

---

## `right_join()`

.pull-left[
<img src="images/right-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
right_join(x, y)
```

```
## Joining, by = "id"
```

```
## # A tibble: 3 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     4 <NA>    y4
```
]

---

## `right_join()`

```r
professions %>%
* right_join(dates)
```

```
## Joining, by = "name"
```

```
## # A tibble: 8 × 4
##   name               profession                         birth_year death_year
##   <chr>              <chr>                                   <dbl>      <dbl>
## 1 Janaki Ammal       Botanist                                 1897       1984
## 2 Chien-Shiung Wu    Physicist                                1912       1997
## 3 Katherine Johnson  Mathematician                            1918       2020
## 4 Rosalind Franklin  Chemist                                  1920       1958
## 5 Vera Rubin         Astronomer                               1928       2016
## 6 Gladys West        Mathematician                            1930         NA
## 7 Flossie Wong-Staal Virologist and Molecular Biologist       1947         NA
## 8 Jennifer Doudna    Biochemist                               1964         NA
```

---

## `full_join()`

.pull-left[
<img src="images/full-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
full_join(x, y)
```

```
## Joining, by = "id"
```

```
## # A tibble: 4 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2     
## 3     3 x3      <NA>   
## 4     4 <NA>    y4
```
]

---

## `full_join()`

```r
dates %>%
* full_join(works)
```

```
## Joining, by = "name"
```

```
## # A tibble: 10 × 4
##    name               birth_year death_year known_for                           
##    <chr>                   <dbl>      <dbl> <chr>                               
##  1 Janaki Ammal             1897       1984 hybrid species, biodiversity protec…
##  2 Chien-Shiung Wu          1912       1997 experiment overturning theory of pa…
##  3 Katherine Johnson        1918       2020 orbital mechanics critical to sendi…
##  4 Rosalind Franklin        1920       1958 <NA>                                
##  5 Vera Rubin               1928       2016 existence of dark matter            
##  6 Gladys West              1930         NA mathematical modeling of the shape …
##  7 Flossie Wong-Staal       1947         NA first to clone HIV and map its gene…
##  8 Jennifer Doudna          1964         NA one of the primary developers of CR…
##  9 Ada Lovelace               NA         NA first computer algorithm            
## 10 Marie Curie                NA         NA theory of radioactivity,  first wom…
```

---

## `inner_join()`

.pull-left[
<img src="images/inner-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
inner_join(x, y)
```

```
## Joining, by = "id"
```

```
## # A tibble: 2 × 3
##      id value_x value_y
##   <dbl> <chr>   <chr>  
## 1     1 x1      y1     
## 2     2 x2      y2
```
]

---

## `inner_join()`

```r
dates %>%
* inner_join(works)
```

```
## Joining, by = "name"
```

```
## # A tibble: 7 × 4
##   name               birth_year death_year known_for                            
##   <chr>                   <dbl>      <dbl> <chr>                                
## 1 Janaki Ammal             1897       1984 hybrid species, biodiversity protect…
## 2 Chien-Shiung Wu          1912       1997 experiment overturning theory of par…
## 3 Katherine Johnson        1918       2020 orbital mechanics critical to sendin…
## 4 Vera Rubin               1928       2016 existence of dark matter             
## 5 Gladys West              1930         NA mathematical modeling of the shape o…
## 6 Flossie Wong-Staal       1947         NA first to clone HIV and map its genes…
## 7 Jennifer Doudna          1964         NA one of the primary developers of CRI…
```

---

## `semi_join()`

.pull-left[
<img src="images/semi-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
semi_join(x, y)
```

```
## Joining, by = "id"
```

```
## # A tibble: 2 × 2
##      id value_x
##   <dbl> <chr>  
## 1     1 x1     
## 2     2 x2
```
]

---

## `semi_join()`

```r
dates %>%
* semi_join(works)
```

```
## Joining, by = "name"
```

```
## # A tibble: 7 × 3
##   name               birth_year death_year
##   <chr>                   <dbl>      <dbl>
## 1 Janaki Ammal             1897       1984
## 2 Chien-Shiung Wu          1912       1997
## 3 Katherine Johnson        1918       2020
## 4 Vera Rubin               1928       2016
## 5 Gladys West              1930         NA
## 6 Flossie Wong-Staal       1947         NA
## 7 Jennifer Doudna          1964         NA
```

---

## `anti_join()`

.pull-left[
<img src="images/anti-join.gif" width="80%" style="background-color: #FDF6E3" />
]
.pull-right[

```r
anti_join(x, y)
```

```
## Joining, by = "id"
```

```
## # A tibble: 1 × 2
##      id value_x
##   <dbl> <chr>  
## 1     3 x3
```
]

---

## `anti_join()`

```r
dates %>%
* anti_join(works)
```

```
## Joining, by = "name"
```

```
## # A tibble: 1 × 3
##   name              birth_year death_year
##   <chr>                  <dbl>      <dbl>
## 1 Rosalind Franklin       1920       1958
```

---

## Putting it altogether

```r
scientists <- professions %>%
  left_join(dates) %>%
  left_join(works)
```

```
## Joining, by = "name"
## Joining, by = "name"
```

```r
scientists
```

---

## `*_join()` functions

- From **dplyr**
- Incredibly useful for bringing datasets with common information (e.g., unique identifier) together
- Use `by` argument when the names of the column containing the common information are not the same across datasets
- Always check that the numbers of rows and columns of the result dataset makes sense
- Refer to [two-table verbs vignette](https://dplyr.tidyverse.org/articles/two-table.html) when needed

---

## Visualizing joined data

---

## .hand[livecoding]

(See next two slides for code developed during live coding session)

---

```r
scientists_longer <- scientists %>%
  mutate(
    birth_year = case_when(
      name == "Ada Lovelace" ~ 1815,
      name == "Marie Curie" ~ 1867,
      TRUE ~ birth_year
    ),
    death_year = case_when(
      name == "Ada Lovelace" ~ 1852,
      name == "Marie Curie" ~ 1934,
      name == "Flossie Wong-Staal" ~ 2020,
      TRUE ~ death_year
    ),
    status = if_else(is.na(death_year), "alive", "deceased"),
    death_year = if_else(is.na(death_year), 2021, death_year),
    known_for = if_else(name == "Rosalind Franklin", "understanding of the molecular structures of DNA ", known_for)
  ) %>%
  pivot_longer(
    cols = contains("year"),
    names_to = "year_type",
    values_to = "year"
  ) %>%
  mutate(death_year_fake = if_else(year == 2021, TRUE, FALSE))
```
]

---

```r
library(colorblindr) # clauswilke/colorblindr

ggplot(scientists_longer, 
       aes(x = year, y = fct_reorder(name, as.numeric(factor(profession))), group = name, color = profession)) +
  geom_point(aes(shape = death_year_fake), show.legend = FALSE) +
  geom_line(aes(linetype = status), show.legend = FALSE) +
  scale_shape_manual(values = c("circle", NA)) +
  scale_linetype_manual(values = c("dashed", "solid")) +
  scale_color_OkabeIto(darken = 0.4) +
  scale_x_continuous(expand = c(0.01, 0), breaks = seq(1820, 2020, 50)) +
  geom_text(aes(y = name, label = known_for), x = 2030, show.legend = FALSE, hjust = 0) +
  geom_text(aes(label = profession), x = 1809, y = Inf, hjust = 1, vjust = 1, show.legend = FALSE) +
  coord_cartesian(clip = "off") +
  labs(
    x = "Year", y = NULL,
    title = "10 women in science who changed the world",
    caption = "Source: Discover magazine"
  ) +
  facet_grid(profession ~ ., scales = "free_y", space = "free_y", switch = "x") +
  theme(
    plot.margin = unit(c(1, 23, 1, 4), "lines"),
    plot.title.position = "plot",
    plot.caption.position = "plot",
    plot.caption = element_text(hjust = 2), # manual hack
    strip.background = element_blank(),
    strip.text = element_blank(),
    axis.title.x = element_text(hjust = 0),
    panel.background = element_rect(fill = "#f0f0f0", color = "white"),
    panel.grid.major = element_line(color = "white", size = 0.5)
  )
```

```
## Warning: Removed 2 rows containing missing values (geom_point).
```
]

---

<center>
<blockquote class="twitter-tweet" width="400"><p lang="en" dir="ltr">TIL: debug = TRUE argument for element_text() in <a href="https://twitter.com/hashtag/ggplot2?src=hash&amp;ref_src=twsrc%5Etfw">#ggplot2</a> &quot;aids visual debugging by drawing a solid rectangle behind the complete text area, and a point where each label is anchored&quot; 🤯 <br><br>I don&#39;t know how I&#39;ve never seen this before, it&#39;s super helpful! <a href="https://t.co/87VVhMpeLm">pic.twitter.com/87VVhMpeLm</a></p>&mdash; Mine Çetinkaya-Rundel (@minebocek) <a href="https://twitter.com/minebocek/status/1437844443588136960?ref_src=twsrc%5Etfw">September 14, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</center>

---

## Gantt charts

<center>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Oh! Also, if this is for a blogpost could I recommend noting that the datviz you’ve made is called a Gantt chart? <br><br>In my experience that name isn’t very well known, but knowing it helps when googling how to customise these charts 😊</p>&mdash; Charlie 👩‍💻 (@charliejhadley) <a href="https://twitter.com/charliejhadley/status/1437849467793952775?ref_src=twsrc%5Etfw">September 14, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</center>

---