data wrangling I

# data wrangling I
### 2021-09-10

---

# Welcome

---

## Announcements

- Lab: Start work on Project 1

- Reading quiz 2 next week

- Homework 1 grading in process

---

## Agenda for today:

Review: Data wrangling

Transforming and reshaping a single data frame

---

## Setup

```r
# load packages
library(tidyverse)
library(glue)
library(lubridate)
library(scales)
library(knitr)

# set default theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

# set default figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 8,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "60%"
)

# dplyr print min and max
options(dplyr.print_max = 6, dplyr.print_min = 6)
```
]

---

# Transforming and reshaping a single data frame

---

## Data: Hotel bookings

- Data from two hotels: one resort and one city hotel
- Observations: Each row represents a hotel booking

```r
hotels <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv")
```

---

# .hand[We...]

.huge[.yellow[want]] .hand[to slice it, and dice it, and juice it, and process it, so we can plot it]

---

## **dplyr** 101

- `distinct()`
- `select()`, `relocate()`
- `arrange()`, `arrange(desc())`
- `slice()`, `slice_head()`, `slice_tail()`, `slice_sample()`
- `filter()`
- `mutate()`
- `summarise()`, `count()`

---

---

## .hand[livecoding]

(See next slide for code developed during live coding session)

---

```r
hotels %>%
  mutate(
    arrival_date = glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),
    arrival_date = ymd(arrival_date)
    ) %>%
  group_by(hotel, arrival_date) %>%
  summarise(mean_adr = mean(adr), .groups = "drop") %>%
  ggplot(aes(x = arrival_date, y = mean_adr, group = hotel, color = hotel)) +
  geom_line() +
  scale_color_manual(values = c("cornsilk4", "deepskyblue3")) +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    x = "Arrival date",
    y = "Mean average\ndaily rate (USD)",
    color = NULL,
    title = "Cost of daily hotel stay",
    subtitle = "July 2015 to August 2017",
    caption = "Source: Antonio, Almeida and Nunes (2019) | TidyTuesday"
  ) +
  theme(
    legend.position = c(0.15, 0.9),
    legend.box.background = element_rect(fill = "white",
                                         color = "white"),
    plot.subtitle = element_text(color = "cornsilk4"),
    plot.caption = element_text(color = "cornsilk4")
  )
```
]

---

.panelset[
.panel[.panel-name[Plot]
<img src="06-data-wrangling-I_files/figure-html/monthly-bookings-1.png" width="88%" />
]
.panel[.panel-name[Discuss]
<iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe>
]
]

---

## .hand[livecoding]

(See next slide for code developed during live coding session)

---

```r
hotels %>%
  mutate(
    arrival_date_month = fct_relevel(arrival_date_month, month.name),
    season = case_when(
      arrival_date_month %in% c("December", "January", "February") ~ "Winter",
      arrival_date_month %in% c("March", "April", "May")           ~ "Spring",
      arrival_date_month %in% c("June", "July", "August")          ~ "Summer",
      TRUE                                                         ~ "Fall"
    ),
    season = fct_relevel(season, "Winter", "Spring", "Summer", "Fall"),
    season_emoji = case_when(
      season == "Winter" ~ "❄️",
      season == "Spring" ~ "⛅️️",
      season == "Summer" ~ "☀️",
      season == "Fall"   ~ "☂️"
    )
  ) %>%
  count(season_emoji, hotel, arrival_date_month) %>%
  ggplot(aes(x = arrival_date_month, y = n, group = hotel, linetype = hotel)) +
  geom_line(size = 0.8, color = "cornsilk4") +
  geom_text(aes(label = season_emoji), size = 6, show.legend = FALSE) +
  scale_x_discrete(labels = month.abb) +
  labs(
    x = "Arrival month", y = "Number of bookings", linetype = NULL,
    title = "Number of monthly bookings",
    subtitle = "July 2015 to August 2017",
    caption = "Source: Antonio, Almeida and Nunes (2019) | TidyTuesday"
  ) +
  coord_cartesian(clip = "off") +
  theme(
    legend.position = c(0.12, 0.9),
    legend.box.background = element_rect(fill = "white", color = "white"),
    plot.subtitle = element_text(color = "cornsilk4"),
    plot.caption = element_text(color = "cornsilk4")
  )
```
]

---

## A few takeaways

- `forcats::fct_relevel()` in a `mutate()` is useful for custom ordering of levels of a factor variable

--
- `summarise()` after `group_by()` with multiple variables results in a message about the grouping structure of the resulting data frame -- the message can be supressed by defining `.groups` (e.g., `.groups = "drop"` or `.groups = "keep"`)

--
- `summarise()` also lets you get away with being sloppy and not naming your new column, but that's not recommended!

---

## Rowwise operations

.task[
We want to calculate the total number of guests for each booking. Why does the following not work?
]

```r
hotels %>%
  select(adults, children, babies) %>%
  mutate(guests = sum(c(adults, children, babies)))
```

```
## # A tibble: 119,390 × 4
##   adults children babies guests
##    <dbl>    <dbl>  <dbl>  <dbl>
## 1      2        0      0     NA
## 2      2        0      0     NA
## 3      1        0      0     NA
## 4      1        0      0     NA
## 5      2        0      0     NA
## 6      2        0      0     NA
## # … with 119,384 more rows
```

---

## Rowwise operations

```r
hotels %>%
  select(adults, children, babies) %>%
* rowwise() %>%
  mutate(guests = sum(c(adults, children, babies))) %>%
  filter(adults > 0, children > 0, babies > 0) # to show sum works
```

```
## # A tibble: 172 × 4
## # Rowwise: 
##   adults children babies guests
##    <dbl>    <dbl>  <dbl>  <dbl>
## 1      2        1      1      4
## 2      2        1      1      4
## 3      2        1      1      4
## 4      2        1      1      4
## 5      2        1      1      4
## 6      2        1      1      4
## # … with 166 more rows
```

---

## Columnwise operations

Use `across()` combined with `summarise()` to calculate summary statistics for multiple columns at once:

```r
hotels %>%
* summarise(across(.cols = starts_with("stays"),  mean))
```

```
## # A tibble: 1 × 2
##   stays_in_weekend_nights stays_in_week_nights
##                     <dbl>                <dbl>
## 1                   0.928                 2.50
```

```r
hotels %>%
* summarise(across(.cols = starts_with("stays"),  list(mean, sd)))
```

```
## # A tibble: 1 × 4
##   stays_in_weekend_nights_1 stays_in_weekend… stays_in_week_ni… stays_in_week_n…
##                       <dbl>             <dbl>             <dbl>            <dbl>
## 1                     0.928             0.999              2.50             1.91
```

---

## Select helpers

- `starts_with()`: Starts with a prefix
- `ends_with()`: Ends with a suffix
- `contains()`: Contains a literal string
- `num_range()`: Matches a numerical range like x01, x02, x03
- `one_of()`: Matches variable names in a character vector
- `everything()`: Matches all variables
- `last_col()`: Select last variable, possibly with an offset
- `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)

---

## Columnwise operations

```r
hotels %>%
  group_by(hotel, is_canceled) %>%
  summarise(
*   across(.cols = starts_with("stays"),  list(mean = mean, sd = sd), .names = "{.fn}_{.col}")
    )
```

```
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
```

```
## # A tibble: 4 × 6
## # Groups:   hotel [2]
##   hotel        is_canceled mean_stays_in_wee… sd_stays_in_week… mean_stays_in_w…
##   <chr>              <dbl>              <dbl>             <dbl>            <dbl>
## 1 City Hotel             0              0.801             0.862             2.12
## 2 City Hotel             1              0.788             0.917             2.27
## 3 Resort Hotel           0              1.13              1.14              3.01
## 4 Resort Hotel           1              1.34              1.14              3.44
## # … with 1 more variable: sd_stays_in_week_nights <dbl>
```

---

## Columnwise operations

```r
hotels %>%
  group_by(hotel, is_canceled) %>%
  summarise(
    across(.cols = starts_with("stays"),  list(mean = mean, sd = sd), .names = "{.fn}_{.col}"),
*   .groups = "drop"
    )
```

```
## # A tibble: 4 × 6
##   hotel        is_canceled mean_stays_in_wee… sd_stays_in_week… mean_stays_in_w…
##   <chr>              <dbl>              <dbl>             <dbl>            <dbl>
## 1 City Hotel             0              0.801             0.862             2.12
## 2 City Hotel             1              0.788             0.917             2.27
## 3 Resort Hotel           0              1.13              1.14              3.01
## 4 Resort Hotel           1              1.34              1.14              3.44
## # … with 1 more variable: sd_stays_in_week_nights <dbl>
```