class: center, middle, inverse, title-slide # data wrangling II ### 2021-09-15 --- class: middle, inverse # Welcome --- ## Announcements - Project: - Take note of deadlines: [vizdata.org/project-1.html#due-dates](https://www.vizdata.org/project-1.html#due-dates) - Proposals for peer review due due Fri, Sep 17 at 5pm - Any questions about project? - RQ 2 review: - Today: pivot and join functions, setting limits / zooming in, pie charts - Friday: Visualizing various models (i.e., `lm()` <-> `geom_smooth()`) --- ## Agenda for today Wrap up: Data wrangling - Transforming and reshaping a single data frame - Bringing together multiple data frames --- ## Setup .midi[ ```r # load packages library(tidyverse) library(glue) library(lubridate) library(scales) library(knitr) library(palmerpenguins) library(openintro) library(ggrepel) library(waffle) # set default theme for ggplot2 ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16)) # set default figure parameters for knitr knitr::opts_chunk$set( fig.width = 8, fig.asp = 0.618, fig.retina = 3, dpi = 300, out.width = "60%" ) # dplyr print min and max options(dplyr.print_max = 6, dplyr.print_min = 6) ``` ] --- class: middle, inverse # Transforming and reshaping a single data frame --- ## Data: Hotel bookings - Data from two hotels: one resort and one city hotel - Observations: Each row represents a hotel booking ```r hotels <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv") ``` --- ## Setup for next example: `hotel_summary` ```r hotels_summary <- hotels %>% group_by(hotel, is_canceled) %>% summarise( across( .cols = starts_with("stays"), list(mean = mean), .names = "{.fn}_{.col}" ), .groups = "drop" ) hotels_summary ``` ``` ## # A tibble: 4 × 4 ## hotel is_canceled mean_stays_in_weekend_nights mean_stays_in_week_nigh… ## <chr> <dbl> <dbl> <dbl> ## 1 City Hotel 0 0.801 2.12 ## 2 City Hotel 1 0.788 2.27 ## 3 Resort Hotel 0 1.13 3.01 ## 4 Resort Hotel 1 1.34 3.44 ``` --- .task[ Which variables are plotted in the following visualization? Which aesthetics are they mapped to? ] .panelset[ .panel[.panel-name[Plot] <img src="07-data-wrangling-II_files/figure-html/stay-nights-1.png" width="88%" /> ] .panel[.panel-name[Discuss] <iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe> ] ]
05
:
00
--- class: middle ## .hand[livecoding] (See next slide for code developed during live coding session) --- .small[ ```r hotels_summary %>% mutate(is_canceled = if_else(is_canceled == 0, "Not canceled", "Canceled")) %>% pivot_longer(cols = starts_with("mean"), names_to = "day_type", values_to = "mean_stays", names_prefix = "mean_stays_in_") %>% mutate( day_type = if_else(str_detect(day_type, "weekend"), "Weekend", "Weekday") ) %>% ggplot(aes(x = str_wrap(is_canceled, 10), y = mean_stays, group = hotel, color = hotel)) + geom_point(show.legend = FALSE) + geom_line(aes(linetype = hotel), size = 1) + facet_wrap(~day_type) + labs( x = "Booking status", y = "Mean number of\nnights of stay", color = NULL, linetype = NULL, title = "Mean number of stays", subtitle = "By hotel type and booking status", caption = "Source: Antonio, Almeida and Nunes (2019) | TidyTuesday" ) + scale_color_manual(values = c("cornsilk4", "deepskyblue3")) + scale_y_continuous(limits = c(0, 4), breaks = 0:4) + theme(legend.position = "bottom") ``` ] --- ## `pivot_wider()` and `pivot_longer()` .pull-left[ - From **tidyr** - Incredibly useful for reshaping for plotting - Lots of extra arguments to help with reshaping pain! - Refer to [pivoting vignette](https://tidyr.tidyverse.org/articles/pivot.html) when needed ] .pull-right[ <img src="images/semi-join.gif" width="80%" /> ] --- class: middle, inverse # Bringing together multiple data frames --- class: middle # .hand[We...] .huge[.blue[have]] .hand[multiple data frames] .huge[.yellow[want]] .hand[to bring them together so we can plot them] --- ## Data: 10 women in science who changed the world .small[ |name | |:------------------| |Ada Lovelace | |Marie Curie | |Janaki Ammal | |Chien-Shiung Wu | |Katherine Johnson | |Rosalind Franklin | |Vera Rubin | |Gladys West | |Flossie Wong-Staal | |Jennifer Doudna | ] .footnote[ Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world) ] --- ## Setup update We have only 10 observations, so let's print them all going forward: ```r options( dplyr.print_min = 10, dplyr.print_max = 10 ) ``` --- ## Inputs .panelset[ .panel[.panel-name[professions] ```r professions ``` ``` ## # A tibble: 10 × 2 ## name profession ## <chr> <chr> ## 1 Ada Lovelace Mathematician ## 2 Marie Curie Physicist and Chemist ## 3 Janaki Ammal Botanist ## 4 Chien-Shiung Wu Physicist ## 5 Katherine Johnson Mathematician ## 6 Rosalind Franklin Chemist ## 7 Vera Rubin Astronomer ## 8 Gladys West Mathematician ## 9 Flossie Wong-Staal Virologist and Molecular Biologist ## 10 Jennifer Doudna Biochemist ``` ] .panel[.panel-name[dates] ```r dates ``` ``` ## # A tibble: 8 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Rosalind Franklin 1920 1958 ## 5 Vera Rubin 1928 2016 ## 6 Gladys West 1930 NA ## 7 Flossie Wong-Staal 1947 NA ## 8 Jennifer Doudna 1964 NA ``` ] .panel[.panel-name[works] ```r works ``` ``` ## # A tibble: 9 × 2 ## name known_for ## <chr> <chr> ## 1 Ada Lovelace first computer algorithm ## 2 Marie Curie theory of radioactivity, first woman Nobel Prize win ## 3 Janaki Ammal hybrid species, biodiversity protection ## 4 Chien-Shiung Wu experiment overturning theory of parity ## 5 Katherine Johnson orbital mechanics critical to sending first Americans into… ## 6 Vera Rubin existence of dark matter ## 7 Gladys West mathematical modeling of the shape of the Earth ## 8 Flossie Wong-Staal first to clone HIV and map its genes, which led to test fo… ## 9 Jennifer Doudna one of the primary developers of CRISPR ``` ] ] --- ## Desired output ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematician NA NA first computer algo… ## 2 Marie Curie Physicist and … NA NA theory of radioacti… ## 3 Janaki Ammal Botanist 1897 1984 hybrid species, bio… ## 4 Chien-Shiung Wu Physicist 1912 1997 experiment overturn… ## 5 Katherine Johnson Mathematician 1918 2020 orbital mechanics c… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existence of dark m… ## 8 Gladys West Mathematician 1930 NA mathematical modeli… ## 9 Flossie Wong-Staal Virologist and… 1947 NA first to clone HIV … ## 10 Jennifer Doudna Biochemist 1964 NA one of the primary … ``` --- ## Inputs, reminder .pull-left[ ```r names(professions) ``` ``` ## [1] "name" "profession" ``` ```r names(dates) ``` ``` ## [1] "name" "birth_year" "death_year" ``` ```r names(works) ``` ``` ## [1] "name" "known_for" ``` ] .pull-right[ ```r nrow(professions) ``` ``` ## [1] 10 ``` ```r nrow(dates) ``` ``` ## [1] 8 ``` ```r nrow(works) ``` ``` ## [1] 9 ``` ] --- ## Joining data frames ```r something_join(x, y) ``` - `left_join()`: all rows from x - `right_join()`: all rows from y - `full_join()`: all rows from both x and y - `semi_join()`: all rows from x where there are matching values in y, keeping just columns from x - `inner_join()`: all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches - `anti_join()`: return all rows from x where there are not matching values in y, never duplicate rows of x - ... --- ## Setup For the next few slides... .pull-left[ ```r x ``` ``` ## # A tibble: 3 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ## 3 3 x3 ``` ] .pull-right[ ```r y ``` ``` ## # A tibble: 3 × 2 ## id value_y ## <dbl> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4 ``` ] --- ## `left_join()` .pull-left[ <img src="images/left-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r left_join(x, y) ``` ``` ## Joining, by = "id" ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ] --- ## `left_join()` ```r professions %>% * left_join(dates) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 10 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Ada Lovelace Mathematician NA NA ## 2 Marie Curie Physicist and Chemist NA NA ## 3 Janaki Ammal Botanist 1897 1984 ## 4 Chien-Shiung Wu Physicist 1912 1997 ## 5 Katherine Johnson Mathematician 1918 2020 ## 6 Rosalind Franklin Chemist 1920 1958 ## 7 Vera Rubin Astronomer 1928 2016 ## 8 Gladys West Mathematician 1930 NA ## 9 Flossie Wong-Staal Virologist and Molecular Biologist 1947 NA ## 10 Jennifer Doudna Biochemist 1964 NA ``` --- ## `right_join()` .pull-left[ <img src="images/right-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r right_join(x, y) ``` ``` ## Joining, by = "id" ``` ``` ## # A tibble: 3 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y4 ``` ] --- ## `right_join()` ```r professions %>% * right_join(dates) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 8 × 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Janaki Ammal Botanist 1897 1984 ## 2 Chien-Shiung Wu Physicist 1912 1997 ## 3 Katherine Johnson Mathematician 1918 2020 ## 4 Rosalind Franklin Chemist 1920 1958 ## 5 Vera Rubin Astronomer 1928 2016 ## 6 Gladys West Mathematician 1930 NA ## 7 Flossie Wong-Staal Virologist and Molecular Biologist 1947 NA ## 8 Jennifer Doudna Biochemist 1964 NA ``` --- ## `full_join()` .pull-left[ <img src="images/full-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r full_join(x, y) ``` ``` ## Joining, by = "id" ``` ``` ## # A tibble: 4 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ## 4 4 <NA> y4 ``` ] --- ## `full_join()` ```r dates %>% * full_join(works) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 10 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biodiversity protec… ## 2 Chien-Shiung Wu 1912 1997 experiment overturning theory of pa… ## 3 Katherine Johnson 1918 2020 orbital mechanics critical to sendi… ## 4 Rosalind Franklin 1920 1958 <NA> ## 5 Vera Rubin 1928 2016 existence of dark matter ## 6 Gladys West 1930 NA mathematical modeling of the shape … ## 7 Flossie Wong-Staal 1947 NA first to clone HIV and map its gene… ## 8 Jennifer Doudna 1964 NA one of the primary developers of CR… ## 9 Ada Lovelace NA NA first computer algorithm ## 10 Marie Curie NA NA theory of radioactivity, first wom… ``` --- ## `inner_join()` .pull-left[ <img src="images/inner-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r inner_join(x, y) ``` ``` ## Joining, by = "id" ``` ``` ## # A tibble: 2 × 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` ] --- ## `inner_join()` ```r dates %>% * inner_join(works) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 7 × 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Ammal 1897 1984 hybrid species, biodiversity protect… ## 2 Chien-Shiung Wu 1912 1997 experiment overturning theory of par… ## 3 Katherine Johnson 1918 2020 orbital mechanics critical to sendin… ## 4 Vera Rubin 1928 2016 existence of dark matter ## 5 Gladys West 1930 NA mathematical modeling of the shape o… ## 6 Flossie Wong-Staal 1947 NA first to clone HIV and map its genes… ## 7 Jennifer Doudna 1964 NA one of the primary developers of CRI… ``` --- ## `semi_join()` .pull-left[ <img src="images/semi-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r semi_join(x, y) ``` ``` ## Joining, by = "id" ``` ``` ## # A tibble: 2 × 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ``` ] --- ## `semi_join()` ```r dates %>% * semi_join(works) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 7 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Vera Rubin 1928 2016 ## 5 Gladys West 1930 NA ## 6 Flossie Wong-Staal 1947 NA ## 7 Jennifer Doudna 1964 NA ``` --- ## `anti_join()` .pull-left[ <img src="images/anti-join.gif" width="80%" style="background-color: #FDF6E3" /> ] .pull-right[ ```r anti_join(x, y) ``` ``` ## Joining, by = "id" ``` ``` ## # A tibble: 1 × 2 ## id value_x ## <dbl> <chr> ## 1 3 x3 ``` ] --- ## `anti_join()` ```r dates %>% * anti_join(works) ``` ``` ## Joining, by = "name" ``` ``` ## # A tibble: 1 × 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Rosalind Franklin 1920 1958 ``` --- ## Putting it altogether .midi[ ```r scientists <- professions %>% left_join(dates) %>% left_join(works) ``` ``` ## Joining, by = "name" ## Joining, by = "name" ``` ```r scientists ``` ``` ## # A tibble: 10 × 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lovelace Mathematician NA NA first computer algo… ## 2 Marie Curie Physicist and … NA NA theory of radioacti… ## 3 Janaki Ammal Botanist 1897 1984 hybrid species, bio… ## 4 Chien-Shiung Wu Physicist 1912 1997 experiment overturn… ## 5 Katherine Johnson Mathematician 1918 2020 orbital mechanics c… ## 6 Rosalind Franklin Chemist 1920 1958 <NA> ## 7 Vera Rubin Astronomer 1928 2016 existence of dark m… ## 8 Gladys West Mathematician 1930 NA mathematical modeli… ## 9 Flossie Wong-Staal Virologist and… 1947 NA first to clone HIV … ## 10 Jennifer Doudna Biochemist 1964 NA one of the primary … ``` ] --- ## `*_join()` functions - From **dplyr** - Incredibly useful for bringing datasets with common information (e.g., unique identifier) together - Use `by` argument when the names of the column containing the common information are not the same across datasets - Always check that the numbers of rows and columns of the result dataset makes sense - Refer to [two-table verbs vignette](https://dplyr.tidyverse.org/articles/two-table.html) when needed --- ## Visualizing joined data <img src="07-data-wrangling-II_files/figure-html/scientists-gantt-plot-1.png" width="100%" /> --- class: middle ## .hand[livecoding] (See next two slides for code developed during live coding session) --- .small[ ```r scientists_longer <- scientists %>% mutate( birth_year = case_when( name == "Ada Lovelace" ~ 1815, name == "Marie Curie" ~ 1867, TRUE ~ birth_year ), death_year = case_when( name == "Ada Lovelace" ~ 1852, name == "Marie Curie" ~ 1934, name == "Flossie Wong-Staal" ~ 2020, TRUE ~ death_year ), status = if_else(is.na(death_year), "alive", "deceased"), death_year = if_else(is.na(death_year), 2021, death_year), known_for = if_else(name == "Rosalind Franklin", "understanding of the molecular structures of DNA ", known_for) ) %>% pivot_longer( cols = contains("year"), names_to = "year_type", values_to = "year" ) %>% mutate(death_year_fake = if_else(year == 2021, TRUE, FALSE)) ``` ] --- .small[ ```r library(colorblindr) # clauswilke/colorblindr ggplot(scientists_longer, aes(x = year, y = fct_reorder(name, as.numeric(factor(profession))), group = name, color = profession)) + geom_point(aes(shape = death_year_fake), show.legend = FALSE) + geom_line(aes(linetype = status), show.legend = FALSE) + scale_shape_manual(values = c("circle", NA)) + scale_linetype_manual(values = c("dashed", "solid")) + scale_color_OkabeIto(darken = 0.4) + scale_x_continuous(expand = c(0.01, 0), breaks = seq(1820, 2020, 50)) + geom_text(aes(y = name, label = known_for), x = 2030, show.legend = FALSE, hjust = 0) + geom_text(aes(label = profession), x = 1809, y = Inf, hjust = 1, vjust = 1, show.legend = FALSE) + coord_cartesian(clip = "off") + labs( x = "Year", y = NULL, title = "10 women in science who changed the world", caption = "Source: Discover magazine" ) + facet_grid(profession ~ ., scales = "free_y", space = "free_y", switch = "x") + theme( plot.margin = unit(c(1, 23, 1, 4), "lines"), plot.title.position = "plot", plot.caption.position = "plot", plot.caption = element_text(hjust = 2), # manual hack strip.background = element_blank(), strip.text = element_blank(), axis.title.x = element_text(hjust = 0), panel.background = element_rect(fill = "#f0f0f0", color = "white"), panel.grid.major = element_line(color = "white", size = 0.5) ) ``` ``` ## Warning: Removed 2 rows containing missing values (geom_point). ``` ] --- <center> <blockquote class="twitter-tweet" width="400"><p lang="en" dir="ltr">TIL: debug = TRUE argument for element_text() in <a href="https://twitter.com/hashtag/ggplot2?src=hash&ref_src=twsrc%5Etfw">#ggplot2</a> "aids visual debugging by drawing a solid rectangle behind the complete text area, and a point where each label is anchored" 🤯 <br><br>I don't know how I've never seen this before, it's super helpful! <a href="https://t.co/87VVhMpeLm">pic.twitter.com/87VVhMpeLm</a></p>— Mine Çetinkaya-Rundel (@minebocek) <a href="https://twitter.com/minebocek/status/1437844443588136960?ref_src=twsrc%5Etfw">September 14, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> </center> --- ## Gantt charts <center> <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Oh! Also, if this is for a blogpost could I recommend noting that the datviz you’ve made is called a Gantt chart? <br><br>In my experience that name isn’t very well known, but knowing it helps when googling how to customise these charts 😊</p>— Charlie 👩💻 (@charliejhadley) <a href="https://twitter.com/charliejhadley/status/1437849467793952775?ref_src=twsrc%5Etfw">September 14, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> </center> ---