class: center, middle, inverse, title-slide # looking at data ### 2021-09-03 --- class: middle, inverse # Announcements --- ## Announcements - Thank you for completing RQ 1 -- review of challenging topics in lecture today - COVID updates: - Lectures: Planning to continue in person, recordings posted on Duke Capture, linked from course website - Labs: Not recorded, plan for teamwork in place - Regularly check email for any updates - Any questions from lab? --- ## Midori <img src="images/midori.jpeg" title="Photo of my cat Midori" alt="Photo of my cat Midori" width="50%" style="display: block; margin: auto;" /> --- class: middle, inverse # A/B testing --- ## Data: Sale prices of houses in Duke Forest .pull-left[ - Data on houses that were sold in the Duke Forest neighborhood of Durham, NC around November 2020 - Scraped from Zillow - Source: `openintro::duke_forest` ] .pull-right[ <img src="images/duke_forest_home.jpg" title="Home in Duke Forest" alt="Home in Duke Forest" width="100%" style="display: block; margin: auto 0 auto auto;" /> ] --- ## `openintro::duke_forest` ```r library(tidyverse) library(openintro) glimpse(duke_forest) ``` ``` ## Rows: 98 ## Columns: 13 ## $ address <chr> "1 Learned Pl, Durham, NC 27705", "1616 Pinecrest Rd, Durha… ## $ price <dbl> 1520000, 1030000, 420000, 680000, 428500, 456000, 1270000, … ## $ bed <dbl> 3, 5, 2, 4, 4, 3, 5, 4, 4, 3, 4, 4, 3, 5, 4, 5, 3, 4, 4, 3,… ## $ bath <dbl> 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 5.0, 3.0, 5.0, 2.0, 3.0, 3.0,… ## $ area <dbl> 6040, 4475, 1745, 2091, 1772, 1950, 3909, 2841, 3924, 2173,… ## $ type <chr> "Single Family", "Single Family", "Single Family", "Single … ## $ year_built <dbl> 1972, 1969, 1959, 1961, 2020, 2014, 1968, 1973, 1972, 1964,… ## $ heating <chr> "Other, Gas", "Forced air, Gas", "Forced air, Gas", "Heat p… ## $ cooling <fct> central, central, central, central, central, central, centr… ## $ parking <chr> "0 spaces", "Carport, Covered", "Garage - Attached, Covered… ## $ lot <dbl> 0.97, 1.38, 0.51, 0.84, 0.16, 0.45, 0.94, 0.79, 0.53, 0.73,… ## $ hoa <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ url <chr> "https://www.zillow.com/homedetails/1-Learned-Pl-Durham-NC-… ``` --- ## A simple visualization .panelset[ .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_point(alpha = 0.7, size = 2) + geom_smooth(method = "lm", se = FALSE, size = 0.7) + labs( x = "Area (square feet)", y = "Sale price (USD)", title = "Price and area of houses in Duke Forest" ) ``` ] .panel[.panel-name[Plot] <img src="03-look-at-data_files/figure-html/unnamed-chunk-4-1.png" width="70%" /> ] ] --- ## New variable: `decade_built` ```r duke_forest <- duke_forest %>% mutate(decade_built = (year_built %/% 10) * 10) duke_forest %>% select(year_built, decade_built) ``` ``` ## # A tibble: 98 × 2 ## year_built decade_built ## <dbl> <dbl> ## 1 1972 1970 ## 2 1969 1960 ## 3 1959 1950 ## 4 1961 1960 ## 5 2020 2020 ## 6 2014 2010 ## 7 1968 1960 ## 8 1973 1970 ## 9 1972 1970 ## 10 1964 1960 ## # … with 88 more rows ``` --- ## Distribution of `decade_built` ```r duke_forest <- duke_forest %>% mutate( decade_built = (year_built %/% 10) * 10 ) duke_forest %>% count(decade_built) ``` ``` ## # A tibble: 11 × 2 ## decade_built n ## <dbl> <int> ## 1 1920 1 ## 2 1930 2 ## 3 1940 5 ## 4 1950 26 ## 5 1960 32 ## 6 1970 11 ## 7 1980 13 ## 8 1990 1 ## 9 2000 1 ## 10 2010 5 ## 11 2020 1 ``` --- ## New variable: `decade_built_cat` ```r duke_forest <- duke_forest %>% mutate( decade_built_cat = case_when( decade_built <= 1940 ~ "1940 or before", decade_built >= 1990 ~ "1990 or after", TRUE ~ as.character(decade_built) ) ) duke_forest %>% count(decade_built_cat) ``` ``` ## # A tibble: 6 × 2 ## decade_built_cat n ## <chr> <int> ## 1 1940 or before 8 ## 2 1950 26 ## 3 1960 32 ## 4 1970 11 ## 5 1980 13 ## 6 1990 or after 8 ``` --- ## A slightly more complex visualization .panelset[ .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price, color = decade_built_cat)) + geom_point(alpha = 0.7, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(~decade_built_cat) + labs( x = "Area (square feet)", y = "Sale price (USD)", color = "Decade built", title = "Price and area of houses in Duke Forest" ) ``` ] .panel[.panel-name[Plot] <img src="03-look-at-data_files/figure-html/unnamed-chunk-8-1.png" width="90%" /> ] ] --- class: middle .task[ In the next two slides, the same plots are created with different "cosmetic" choices. Examine the plots two given (Plot A and Plot B), and indicate your preference by voting for one of them in the Vote tab. ] --- ## Test 1 .panelset[ .panel[.panel-name[Plot A] <img src="03-look-at-data_files/figure-html/test-1-a-1.png" width="90%" /> ] .panel[.panel-name[Plot B] <img src="03-look-at-data_files/figure-html/test-1-b-1.png" width="90%" /> ] .panel[.panel-name[Vote] <iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe> ] ] --- ## Test 2 .panelset[ .panel[.panel-name[Plot A] <img src="03-look-at-data_files/figure-html/unnamed-chunk-9-1.png" width="90%" /> ] .panel[.panel-name[Plot B] <img src="03-look-at-data_files/figure-html/test-2-b-1.png" width="90%" /> ] .panel[.panel-name[Vote] <iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe> ] ] --- class: middle .large[ .hand[ a deeper look at the plotting code... ] ] --- ## Minimal theme + viridis scale, default option .panelset[ .panel[.panel-name[Plot] <img src="03-look-at-data_files/figure-html/unnamed-chunk-10-1.png" width="90%" /> ] .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price, color = decade_built_cat)) + geom_point(alpha = 0.7, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(~decade_built_cat) + labs( x = "Area (square feet)", y = "Sale price (USD)", color = "Decade built", title = "Price and area of houses in Duke Forest" ) + * theme_minimal(base_size = 16) + * scale_color_viridis_d(end = 0.9) ``` ] ] --- ## Viridis scale, option A (magma) .panelset[ .panel[.panel-name[Plot] <img src="03-look-at-data_files/figure-html/unnamed-chunk-12-1.png" width="90%" /> ] .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price, color = decade_built_cat)) + geom_point(alpha = 0.5, size = 2, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(~decade_built_cat) + labs( x = "Area (square feet)", y = "Sale price (USD)", color = "Decade built", title = "Price and area of houses in Duke Forest" ) + * scale_color_viridis_d(end = 0.8, option = "A") ``` ] ] --- ## Dark theme + further theme customization .panelset[ .panel[.panel-name[Plot] <img src="03-look-at-data_files/figure-html/unnamed-chunk-14-1.png" width="90%" /> ] .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price, color = decade_built_cat)) + geom_point(alpha = 0.7, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(~decade_built_cat) + labs( x = "Area (square feet)", y = "Sale price (USD)", color = "Decade built", title = "Price and area of houses in Duke Forest", ) + * theme_dark(base_size = 16) + * scale_color_manual(values = c("yellow", "blue", "orange", "red", "green", "white")) + * theme( * text = element_text(color = "red", face = "bold.italic"), * plot.background = element_rect(fill = "yellow") * ) ``` ] ] --- class: middle, inverse # What makes bad figures bad? --- ## Bad taste <img src="03-look-at-data_files/figure-html/unnamed-chunk-16-1.png" width="90%" /> --- ## Data-to-ink ratio .pull-left-wide[ Tufte strongly recommends maximizing the **data-to-ink ratio** this in the Visual Display of Quantitative Information (Tufte, 1983). > Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency. … [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space … [It] is nearly always multivariate … And graphical excellence requires telling the truth about the data. (Tufte, 1983, p. 51). ] .pull-right-narrow[ <img src="images/tufte-visual-display-cover.png" title="Cover of Visual Display of Quantitative Information, Tufte (1983)." alt="Cover of Visual Display of Quantitative Information, Tufte (1983)." width="100%" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Review: RQ 1 .task[ **Question 8.** Which of the following is **FALSE** about the data-to-ink ratio? ] a. Data-to-ink ratio can be increased by removing content-free decoration. (TRUE) b. **Data-to-ink ratio can be increased by choosing a more accessible color palette.** **(FALSE)** c. Visualizations with high data-to-ink ratio can be cognitively difficult for viewers to interpret. (TRUE) d. Visualizations with excessive 'chart junk' tend to have low data-to-ink ratio. (TRUE) .footnote[ Error on quiz: Question asked for TRUE when it should have asked for FALSE. Everyone will get a point for this question regardless of their answer. ] --- .task[ Which of the plots has higher data-to-ink ratio? ] .panelset[ .panel[.panel-name[Plot A] <img src="03-look-at-data_files/figure-html/mean-area-decade-a-1.png" width="70%" /> ] .panel[.panel-name[Plot B] <img src="03-look-at-data_files/figure-html/mean-area-decade-b-1.png" width="70%" /> ] .panel[.panel-name[Vote] <iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe> ] ] --- class: middle .large[ .hand[ a deeper look at the plotting code... ] ] --- ## Summary statistics ```r mean_area_decade <- duke_forest %>% group_by(decade_built_cat) %>% summarise(mean_area = mean(area)) mean_area_decade ``` ``` ## # A tibble: 6 × 2 ## decade_built_cat mean_area ## <chr> <dbl> ## 1 1940 or before 2072. ## 2 1950 2545. ## 3 1960 2873. ## 4 1970 3413. ## 5 1980 2889. ## 6 1990 or after 2822. ``` --- ## Barplot .panelset[ .panel[.panel-name[Plot] <img src="03-look-at-data_files/figure-html/unnamed-chunk-19-1.png" width="70%" /> ] .panel[.panel-name[Code] ```r ggplot(mean_area_decade, aes(y = decade_built_cat, x = mean_area)) + * geom_col() + labs( x = "Mean area (square feet)", y = "Decade built", title = "Mean area of houses in Duke Forest, by decade built" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Scatterplot .panelset[ .panel[.panel-name[Plot] <img src="03-look-at-data_files/figure-html/unnamed-chunk-21-1.png" width="70%" /> ] .panel[.panel-name[Code] ```r ggplot(mean_area_decade, aes(y = decade_built_cat, x = mean_area)) + * geom_point(size = 4) + labs( x = "Mean area (square feet)", y = "Decade built", title = "Mean area of houses in Duke Forest, by decade built" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Lollipop plot -- a happy medium? .panelset[ .panel[.panel-name[Plot] <img src="03-look-at-data_files/figure-html/mean-area-decade-lollipop-1.png" width="70%" /> ] .panel[.panel-name[Code] ```r ggplot(mean_area_decade, aes(y = decade_built_cat, x = mean_area)) + geom_point(size = 4) + * geom_segment( * aes( * x = 0, xend = mean_area, * y = decade_built_cat, yend = decade_built_cat * ) * ) + labs( x = "Mean area (square feet)", y = "Decade built", title = "Mean area of houses in Duke Forest, by decade built" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Activity: Napoleon’s retreat .pull-left-wide[ .task[ .small[ This is Minard’s visualization of Napoleon’s retreat. Discuss in a pair (or group) the features of the following visualization. What are the variables plotted? Which aesthetics are they mapped to? Click [here](https://app.sli.do/event/rxg9buzy) to submit summary of discussion. ] ] ] <img src="images/minard.jpg" title="Minard’s visualization of Napoleon’s retreat" alt="Minard’s visualization of Napoleon’s retreat" width="83%" style="display: block; margin: auto;" /> ```r countdown(minutes = 5, top = 0) ```
05
:
00
--- ## Bad data .panelset[ .panel[.panel-name[Original] <img src="images/healy-democracy-nyt-version.png" title="A crisis of faith in democracy? New York Times." alt="A crisis of faith in democracy? New York Times." width="50%" /> ] .panel[.panel-name[Improved] <img src="images/healy-democracy-voeten-version-2.png" title="A crisis of faith in democracy? New York Times." alt="A crisis of faith in democracy? New York Times." width="50%" /> ] ] .footnote[ Healy, Data Visualization: A practical introduction. [Chapter 1](https://socviz.co/lookatdata.html). Figures 1.8 and 1.9. ] --- ## Bad perception <img src="images/healy-perception-curves.png" title="Aspect ratios affect our perception of rates of change, modeled after an example by William S. Cleveland." alt="Aspect ratios affect our perception of rates of change, modeled after an example by William S. Cleveland." width="80%" /> .footnote[ Healy, Data Visualization: A practical introduction. [Chapter 1](https://socviz.co/lookatdata.html). Figure 1.12. ] --- class: middle, inverse # Aesthetic mappings in ggplot2 --- ## A second look: lollipop plot .panelset[ .panel[.panel-name[Plot] <img src="03-look-at-data_files/figure-html/mean-area-decade-lollipop-layer-1.png" width="70%" /> ] .panel[.panel-name[Code] ```r ggplot(mean_area_decade, aes(y = decade_built_cat, x = mean_area)) + geom_point(size = 4) + geom_segment(aes( x = 0, xend = mean_area, y = decade_built_cat, yend = decade_built_cat )) + labs( x = "Mean area (square feet)", y = "Decade built", title = "Mean area of houses in Duke Forest, by decade built" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Activity: Spot the difference I .task[ Can you spot the differences between the code here and the one provided in the previous slide? Are there any differences in the resulting plot? Work in a pair (or group) to answer. Click [here](https://app.sli.do/event/rxg9buzy) to submit summary of discussion. ] .panelset[ .panel[.panel-name[Plot] <img src="03-look-at-data_files/figure-html/mean-area-decade-lollipop-global-1.png" width="50%" /> ] .panel[.panel-name[Code] ```r ggplot(mean_area_decade, aes(y = decade_built_cat, x = mean_area)) + geom_point(size = 4) + geom_segment(aes( xend = 0, yend = decade_built_cat )) + labs( x = "Mean area (square feet)", y = "Decade built", title = "Mean area of houses in Duke Forest, by decade built" ) + theme_minimal(base_size = 16) ``` ] ]
03
:
00
--- ## Global vs. layer-specific aesthetics - Aesthetic mappings can be supplied in the initial `ggplot()` call, in individual layers, or in some combination of both. - Within each layer, you can add, override, or remove mappings. - If you only have one layer in the plot, the way you specify aesthetics doesn’t make any difference. However, the distinction is important when you start adding additional layers. --- ## Wrap up .task[ Think back to all the plots you saw in the lecture, without flipping back through the slides. Which plot first comes to mind? Describe it in words. ]