class: center, middle, inverse, title-slide # data wrangling I ### 2021-09-10 --- class: middle, inverse # Welcome --- ## Announcements - Lab: Start work on Project 1 - Reading quiz 2 next week - Homework 1 grading in process --- ## Agenda for today: Review: Data wrangling Transforming and reshaping a single data frame --- ## Setup .midi[ ```r # load packages library(tidyverse) library(glue) library(lubridate) library(scales) library(knitr) # set default theme for ggplot2 ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16)) # set default figure parameters for knitr knitr::opts_chunk$set( fig.width = 8, fig.asp = 0.618, fig.retina = 3, dpi = 300, out.width = "60%" ) # dplyr print min and max options(dplyr.print_max = 6, dplyr.print_min = 6) ``` ] --- class: middle, inverse # Transforming and reshaping a single data frame --- ## Data: Hotel bookings - Data from two hotels: one resort and one city hotel - Observations: Each row represents a hotel booking ```r hotels <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv") ``` --- class: middle # .hand[We...] .huge[.blue[have]] .hand[a single data frame] .huge[.yellow[want]] .hand[to slice it, and dice it, and juice it, and process it, so we can plot it] --- ## **dplyr** 101 .task[ Which of the following (if any) are unfamiliar to you? ] - `distinct()` - `select()`, `relocate()` - `arrange()`, `arrange(desc())` - `slice()`, `slice_head()`, `slice_tail()`, `slice_sample()` - `filter()` - `mutate()` - `summarise()`, `count()` --- .task[ Let's recreate this visualization! ] <img src="06-data-wrangling-I_files/figure-html/daily-stay-cost-1.png" width="95%" /> --- class: middle ## .hand[livecoding] (See next slide for code developed during live coding session) --- .small[ ```r hotels %>% mutate( arrival_date = glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"), arrival_date = ymd(arrival_date) ) %>% group_by(hotel, arrival_date) %>% summarise(mean_adr = mean(adr), .groups = "drop") %>% ggplot(aes(x = arrival_date, y = mean_adr, group = hotel, color = hotel)) + geom_line() + scale_color_manual(values = c("cornsilk4", "deepskyblue3")) + scale_y_continuous(labels = label_dollar()) + labs( x = "Arrival date", y = "Mean average\ndaily rate (USD)", color = NULL, title = "Cost of daily hotel stay", subtitle = "July 2015 to August 2017", caption = "Source: Antonio, Almeida and Nunes (2019) | TidyTuesday" ) + theme( legend.position = c(0.15, 0.9), legend.box.background = element_rect(fill = "white", color = "white"), plot.subtitle = element_text(color = "cornsilk4"), plot.caption = element_text(color = "cornsilk4") ) ``` ] --- .task[ Come up with a plan for making the following visualization and write the pseudocode. ] .panelset[ .panel[.panel-name[Plot] <img src="06-data-wrangling-I_files/figure-html/monthly-bookings-1.png" width="88%" /> ] .panel[.panel-name[Discuss] <iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe> ] ]
05
:
00
--- class: middle ## .hand[livecoding] (See next slide for code developed during live coding session) --- .tiny[ ```r hotels %>% mutate( arrival_date_month = fct_relevel(arrival_date_month, month.name), season = case_when( arrival_date_month %in% c("December", "January", "February") ~ "Winter", arrival_date_month %in% c("March", "April", "May") ~ "Spring", arrival_date_month %in% c("June", "July", "August") ~ "Summer", TRUE ~ "Fall" ), season = fct_relevel(season, "Winter", "Spring", "Summer", "Fall"), season_emoji = case_when( season == "Winter" ~ "❄️", season == "Spring" ~ "⛅️️", season == "Summer" ~ "☀️", season == "Fall" ~ "☂️" ) ) %>% count(season_emoji, hotel, arrival_date_month) %>% ggplot(aes(x = arrival_date_month, y = n, group = hotel, linetype = hotel)) + geom_line(size = 0.8, color = "cornsilk4") + geom_text(aes(label = season_emoji), size = 6, show.legend = FALSE) + scale_x_discrete(labels = month.abb) + labs( x = "Arrival month", y = "Number of bookings", linetype = NULL, title = "Number of monthly bookings", subtitle = "July 2015 to August 2017", caption = "Source: Antonio, Almeida and Nunes (2019) | TidyTuesday" ) + coord_cartesian(clip = "off") + theme( legend.position = c(0.12, 0.9), legend.box.background = element_rect(fill = "white", color = "white"), plot.subtitle = element_text(color = "cornsilk4"), plot.caption = element_text(color = "cornsilk4") ) ``` ] --- ## A few takeaways - `forcats::fct_relevel()` in a `mutate()` is useful for custom ordering of levels of a factor variable -- - `summarise()` after `group_by()` with multiple variables results in a message about the grouping structure of the resulting data frame -- the message can be supressed by defining `.groups` (e.g., `.groups = "drop"` or `.groups = "keep"`) -- - `summarise()` also lets you get away with being sloppy and not naming your new column, but that's not recommended! --- ## Rowwise operations .task[ We want to calculate the total number of guests for each booking. Why does the following not work? ] ```r hotels %>% select(adults, children, babies) %>% mutate(guests = sum(c(adults, children, babies))) ``` ``` ## # A tibble: 119,390 × 4 ## adults children babies guests ## <dbl> <dbl> <dbl> <dbl> ## 1 2 0 0 NA ## 2 2 0 0 NA ## 3 1 0 0 NA ## 4 1 0 0 NA ## 5 2 0 0 NA ## 6 2 0 0 NA ## # … with 119,384 more rows ``` --- ## Rowwise operations ```r hotels %>% select(adults, children, babies) %>% * rowwise() %>% mutate(guests = sum(c(adults, children, babies))) %>% filter(adults > 0, children > 0, babies > 0) # to show sum works ``` ``` ## # A tibble: 172 × 4 ## # Rowwise: ## adults children babies guests ## <dbl> <dbl> <dbl> <dbl> ## 1 2 1 1 4 ## 2 2 1 1 4 ## 3 2 1 1 4 ## 4 2 1 1 4 ## 5 2 1 1 4 ## 6 2 1 1 4 ## # … with 166 more rows ``` --- ## Columnwise operations Use `across()` combined with `summarise()` to calculate summary statistics for multiple columns at once: ```r hotels %>% * summarise(across(.cols = starts_with("stays"), mean)) ``` ``` ## # A tibble: 1 × 2 ## stays_in_weekend_nights stays_in_week_nights ## <dbl> <dbl> ## 1 0.928 2.50 ``` ```r hotels %>% * summarise(across(.cols = starts_with("stays"), list(mean, sd))) ``` ``` ## # A tibble: 1 × 4 ## stays_in_weekend_nights_1 stays_in_weekend… stays_in_week_ni… stays_in_week_n… ## <dbl> <dbl> <dbl> <dbl> ## 1 0.928 0.999 2.50 1.91 ``` --- ## Select helpers - `starts_with()`: Starts with a prefix - `ends_with()`: Ends with a suffix - `contains()`: Contains a literal string - `num_range()`: Matches a numerical range like x01, x02, x03 - `one_of()`: Matches variable names in a character vector - `everything()`: Matches all variables - `last_col()`: Select last variable, possibly with an offset - `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text) .footnote[ See help for any of these functions for more info, e.g. `?everything`. ] --- ## Columnwise operations ```r hotels %>% group_by(hotel, is_canceled) %>% summarise( * across(.cols = starts_with("stays"), list(mean = mean, sd = sd), .names = "{.fn}_{.col}") ) ``` ``` ## `summarise()` has grouped output by 'hotel'. You can override using the ## `.groups` argument. ``` ``` ## # A tibble: 4 × 6 ## # Groups: hotel [2] ## hotel is_canceled mean_stays_in_wee… sd_stays_in_week… mean_stays_in_w… ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 City Hotel 0 0.801 0.862 2.12 ## 2 City Hotel 1 0.788 0.917 2.27 ## 3 Resort Hotel 0 1.13 1.14 3.01 ## 4 Resort Hotel 1 1.34 1.14 3.44 ## # … with 1 more variable: sd_stays_in_week_nights <dbl> ``` --- ## Columnwise operations ```r hotels %>% group_by(hotel, is_canceled) %>% summarise( across(.cols = starts_with("stays"), list(mean = mean, sd = sd), .names = "{.fn}_{.col}"), * .groups = "drop" ) ``` ``` ## # A tibble: 4 × 6 ## hotel is_canceled mean_stays_in_wee… sd_stays_in_week… mean_stays_in_w… ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 City Hotel 0 0.801 0.862 2.12 ## 2 City Hotel 1 0.788 0.917 2.27 ## 3 Resort Hotel 0 1.13 1.14 3.01 ## 4 Resort Hotel 1 1.34 1.14 3.44 ## # … with 1 more variable: sd_stays_in_week_nights <dbl> ```