class: center, middle, inverse, title-slide # deep dive: geoms ### 2021-09-08 --- class: middle, inverse # Announcements --- ## Announcements - HW 1 is posted: - Go to the course GitHub org and look for a repo named `homework-01-YOUR_GITHUB_NAME` - Instructions are in the README - All work must be completed in `hw-01.Rmd` document, knit, and submitted along with the output `hw-01.md` - Due Wednesday, 8 Sep, by 12pm - At the deadline you'll lose push access to your repo, if you want to submit late, email me to enable late submission - Post questions on the discussion forum - Check email before lab on Monday for your team assignments, sit with your teams in class on Monday --- class: middle, inverse # Setup --- ## Packages ```r # load packages library(tidyverse) ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ``` ``` ## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.4 ✓ dplyr 1.0.7 ## ✓ tidyr 1.1.3 ✓ stringr 1.4.0 ## ✓ readr 2.0.1 ✓ forcats 0.5.1 ``` ``` ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ``` ```r library(openintro) ``` ``` ## Loading required package: airports ``` ``` ## Loading required package: cherryblossom ``` ``` ## Loading required package: usdata ``` --- ## ggplot2 theme ```r # set default theme for ggplot2 ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16)) ``` --- ## Figure sizing For more on including figures in R Markdown documents with the right size, resolution, etc. the following resources are great: - [R for Data Science - Graphics for communication](https://r4ds.had.co.nz/graphics-for-communication.html) - [Tips and tricks for working with images and figures in R Markdown documents](https://www.zevross.com/blog/2017/06/19/tips-and-tricks-for-working-with-images-and-figures-in-r-markdown-documents/) ```r # set default figure parameters for knitr knitr::opts_chunk$set( fig.width = 8, # 8" fig.asp = 0.618, # the golden ratio fig.retina = 3, # dpi multiplier for displaying HTML output on retina dpi = 300, # higher dpi, sharper image out.width = "60%" ) ``` --- ## Data prep: new variables From last class... ```r duke_forest <- duke_forest %>% mutate( decade_built = (year_built %/% 10) * 10, decade_built_cat = case_when( decade_built <= 1940 ~ "1940 or before", decade_built >= 1990 ~ "1990 or after", TRUE ~ as.character(decade_built) ), decade_built_cat = factor(decade_built_cat, ordered = TRUE) ) duke_forest %>% select(year_built, decade_built, decade_built_cat) ``` ``` ## # A tibble: 98 × 3 ## year_built decade_built decade_built_cat ## <dbl> <dbl> <ord> ## 1 1972 1970 1970 ## 2 1969 1960 1960 ## 3 1959 1950 1950 ## 4 1961 1960 1960 ## 5 2020 2020 1990 or after ## 6 2014 2010 1990 or after ## 7 1968 1960 1960 ## 8 1973 1970 1970 ## 9 1972 1970 1970 ## 10 1964 1960 1960 ## # … with 88 more rows ``` --- ## Data prep: summary table From last class... ```r mean_area_decade <- duke_forest %>% group_by(decade_built_cat) %>% summarise(mean_area = mean(area)) mean_area_decade ``` ``` ## # A tibble: 6 × 2 ## decade_built_cat mean_area ## <ord> <dbl> ## 1 1940 or before 2072. ## 2 1950 2545. ## 3 1960 2873. ## 4 1970 3413. ## 5 1980 2889. ## 6 1990 or after 2822. ``` --- class: middle, inverse # Aesthetic mappings in ggplot2 --- ## Global vs. layer-specific aesthetics - Aesthetic mappings can be supplied in the initial `ggplot()` call, in individual layers, or in some combination of both. - Within each layer, you can add, override, or remove mappings. - If you only have one layer in the plot, the way you specify aesthetics doesn’t make any difference. However, the distinction is important when you start adding additional layers. --- ## Activity: Spot the difference I .task[ Do you expect the following plots to be the same or different? If different, how? Discuss in a pair (or group) without running the code and sketch the resulting plots based on what you think the code will produce. ] .panelset[ .panel[.panel-name[Plots] ```r # Plot A ggplot(duke_forest, aes(x = area, y = price, color = decade_built_cat)) + geom_point(alpha = 0.7) + geom_smooth(method = "lm", se = FALSE, size = 0.5) ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` ```r # Plot B ggplot(duke_forest, aes(x = area, y = price)) + geom_point(aes(color = decade_built_cat), alpha = 0.7) + geom_smooth(method = "lm", se = FALSE, size = 0.5) ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` ] .panel[.panel-name[Discuss] <iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe> ] ]
05
:
00
--- ## Activity: Spot the difference II .task[ Do you expect the following plots to be the same or different? If different, how? Discuss in a pair (or group) without running the code and sketch the resulting plots based on what you think the code will produce. ] .panelset[ .panel[.panel-name[Plots] ```r # Plot A ggplot(duke_forest, aes(x = area, y = price)) + geom_point(aes(color = decade_built_cat)) ``` ```r # Plot B ggplot(duke_forest, aes(x = area, y = price)) + geom_point(color = "blue") ``` ```r # Plot C ggplot(duke_forest, aes(x = area, y = price)) + geom_point(color = "#a493ba") ``` ] .panel[.panel-name[Discuss] <iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe> ] ]
03
:
00
--- class: middle, inverse # Geoms --- ## Geoms - Geometric objects, or geoms for short, perform the actual rendering of the layer, controlling the type of plot that you create - You can think of them as "the geometric shape used to represent the data" --- ## One variable - Discrete: - `geom_bar()`: display distribution of discrete variable. - Continuous - `geom_histogram()`: bin and count continuous variable, display with bars - `geom_density()`: smoothed density estimate - `geom_dotplot()`: stack individual points into a dot plot - `geom_freqpoly()`: bin and count continuous variable, display with lines --- ## .hand[aside...] Always use "typewriter text" (monospace font) when writing function names, and follow with `()`, e.g., - `geom_freqpoly()` - `mean()` - `lm()` --- ## `geom_bar()` ```r ggplot(duke_forest, aes(x = decade_built_cat)) + geom_bar() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-10-1.png" width="60%" /> --- ## `geom_bar()` ```r ggplot(duke_forest, aes(y = decade_built_cat)) + geom_bar() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-11-1.png" width="60%" /> --- ## `geom_histogram()` ```r ggplot(duke_forest, aes(x = price)) + geom_histogram() ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="04-geoms_files/figure-html/unnamed-chunk-12-1.png" width="60%" /> --- ## `geom_histogram()` and `binwidth` .panelset[ .panel[.panel-name[20K] ```r ggplot(duke_forest, aes(x = price)) + geom_histogram(binwidth = 20000) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-13-1.png" width="60%" /> ] .panel[.panel-name[50K] ```r ggplot(duke_forest, aes(x = price)) + geom_histogram(binwidth = 50000) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-14-1.png" width="60%" /> ] .panel[.panel-name[100K] ```r ggplot(duke_forest, aes(x = price)) + geom_histogram(binwidth = 100000) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-15-1.png" width="60%" /> ] .panel[.panel-name[200K] ```r ggplot(duke_forest, aes(x = price)) + geom_histogram(binwidth = 200000) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-16-1.png" width="60%" /> ] ] --- ## `geom_density()` ```r ggplot(duke_forest, aes(x = price)) + geom_density() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-17-1.png" width="60%" /> --- ## `geom_density()` and bandwidth (`bw`) .panelset[ .panel[.panel-name[1] ```r ggplot(duke_forest, aes(x = price)) + geom_density(bw = 1) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-18-1.png" width="60%" /> ] .panel[.panel-name[1000] ```r ggplot(duke_forest, aes(x = price)) + geom_density(bw = 1000) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-19-1.png" width="60%" /> ] .panel[.panel-name[50000] ```r ggplot(duke_forest, aes(x = price)) + geom_density(bw = 50000) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-20-1.png" width="60%" /> ] .panel[.panel-name[500000] ```r ggplot(duke_forest, aes(x = price)) + geom_density(bw = 500000) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-21-1.png" width="60%" /> ] ] --- ## `geom_density()` outlines .panelset[ .panel[.panel-name[full] ```r ggplot(duke_forest, aes(x = price)) + geom_density(outline.type = "full") ``` <img src="04-geoms_files/figure-html/unnamed-chunk-22-1.png" width="60%" /> ] .panel[.panel-name[both] ```r ggplot(duke_forest, aes(x = price)) + geom_density(outline.type = "both") ``` <img src="04-geoms_files/figure-html/unnamed-chunk-23-1.png" width="60%" /> ] .panel[.panel-name[upper] ```r ggplot(duke_forest, aes(x = price)) + geom_density(outline.type = "upper") ``` <img src="04-geoms_files/figure-html/unnamed-chunk-24-1.png" width="60%" /> ] .panel[.panel-name[lower] ```r ggplot(duke_forest, aes(x = price)) + geom_density(outline.type = "lower") ``` <img src="04-geoms_files/figure-html/unnamed-chunk-25-1.png" width="60%" /> ] ] --- ## `geom_dotplot()` .task[ What does each point represent? How are their locations determined? What do the x and y axes represent? ] .panelset[ .panel[.panel-name[Plot] ```r ggplot(duke_forest, aes(x = price)) + geom_dotplot(binwidth = 50000) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-26-1.png" width="60%" /> ] .panel[.panel-name[Discuss] <iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe> ] ]
03
:
00
--- ## `geom_freqpoly()` ```r ggplot(duke_forest, aes(x = price)) + geom_freqpoly(binwidth = 50000) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-28-1.png" width="60%" /> --- ## `geom_freqpoly()` for comparisons .panelset[ .panel[.panel-name[Histogram] ```r ggplot(duke_forest, aes(x = price, fill = decade_built_cat)) + geom_histogram(binwidth = 100000) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-29-1.png" width="60%" /> ] .panel[.panel-name[Frequency polygon] ```r ggplot(duke_forest, aes(x = price, color = decade_built_cat)) + geom_freqpoly(binwidth = 100000, size = 1) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-30-1.png" width="60%" /> ] ] --- ## Two variables - both continuous - `geom_point()`: scatterplot - `geom_quantile()`: smoothed quantile regression - `geom_rug()`: marginal rug plots - `geom_smooth()`: smoothed line of best fit - `geom_text()`: text labels --- ## `geom_rug()` ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_point() + geom_rug() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-31-1.png" width="60%" /> --- ## `geom_rug()` on the outside ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_point() + geom_rug(outside = TRUE) + coord_cartesian(clip = "off") ``` <img src="04-geoms_files/figure-html/unnamed-chunk-32-1.png" width="60%" /> --- ## `geom_rug()` on the outside, but better ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_point() + geom_rug(outside = TRUE, sides = "tr") + coord_cartesian(clip = "off") + theme(plot.margin = margin(1, 1, 1, 1, "cm")) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-33-1.png" width="60%" /> --- ## `geom_text()` ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_text(aes(label = bed)) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-34-1.png" width="60%" /> --- ## `geom_text()` and more ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_text(aes(label = bed, size = bed, color = bed)) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-35-1.png" width="60%" /> --- ## `geom_text()` and even more ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_text( aes(label = bed, size = bed, color = bed), show.legend = FALSE ) ``` <img src="04-geoms_files/figure-html/unnamed-chunk-36-1.png" width="60%" /> --- ## Two variables - show distribution - `geom_bin2d()`: bin into rectangles and count - `geom_density2d()`: smoothed 2d density estimate - `geom_hex()`: bin into hexagons and count --- ## `geom_hex()` ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_hex() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-37-1.png" width="60%" /> --- ## `geom_hex()` and warnings - Requires installing the [**hexbin**](https://cran.r-project.org/web/packages/hexbin/index.html) package separately! ```r install.packages("hexbin") ``` - Otherwise you might see ``` Warning: Computation failed in `stat_binhex()` ``` --- ## Two variables - At least one discrete - `geom_count()`: count number of point at distinct locations - `geom_jitter()`: randomly jitter overlapping points - One continuous, one discrete - `geom_col()`: a bar chart of pre-computed summaries - `geom_boxplot()`: boxplots - `geom_violin()`: show density of values in each group --- ## `geom_jitter()` .task[ How are the following three plots different? ] .panelset[ .panel[.panel-name[Plot A] ```r ggplot(duke_forest, aes(x = bed, y = price)) + geom_point() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-39-1.png" width="60%" /> ] .panel[.panel-name[Plot B] ```r ggplot(duke_forest, aes(x = bed, y = price)) + geom_jitter() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-40-1.png" width="60%" /> ] .panel[.panel-name[Plot C] ```r ggplot(duke_forest, aes(x = bed, y = price)) + geom_jitter() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-41-1.png" width="60%" /> ] .panel[.panel-name[Discuss] <iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe> ] ]
03
:
00
--- ## `geom_jitter()` and `set.seed()` .panelset[ .panel[.panel-name[Plot A] ```r set.seed(1234) ggplot(duke_forest, aes(x = bed, y = price)) + geom_jitter() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-43-1.png" width="60%" /> ] .panel[.panel-name[Plot B] ```r set.seed(1234) ggplot(duke_forest, aes(x = bed, y = price)) + geom_jitter() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-44-1.png" width="60%" /> ] ] --- ## Two variables - One time, one continuous - `geom_area()`: area plot - `geom_line()`: line plot - `geom_step()`: step plot - Display uncertainty: - `geom_crossbar()`: vertical bar with center - `geom_errorbar()`: error bars - `geom_linerange()`: vertical line - `geom_pointrange()`: vertical line with center - Spatial - `geom_map()`: fast version of `geom_polygon()` for map data (more on this later...) --- ## Average price per year built ```r mean_price_year <- duke_forest %>% group_by(year_built) %>% summarise( n = n(), mean_price = mean(price), sd_price = sd(price) ) mean_price_year ``` ``` ## # A tibble: 44 × 4 ## year_built n mean_price sd_price ## <dbl> <int> <dbl> <dbl> ## 1 1923 1 285000 NA ## 2 1934 1 600000 NA ## 3 1938 1 265000 NA ## 4 1940 1 105000 NA ## 5 1941 2 432500 28284. ## 6 1945 2 525000 530330. ## 7 1951 2 567500 258094. ## 8 1952 2 531250 469165. ## 9 1953 2 575000 35355. ## 10 1954 4 600000 33912. ## # … with 34 more rows ``` --- ## `geom_line()` ```r ggplot(mean_price_year, aes(x = year_built, y = mean_price)) + geom_line() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-46-1.png" width="60%" /> --- ## `geom_area()` ```r ggplot(mean_price_year, aes(x = year_built, y = mean_price)) + geom_area() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-47-1.png" width="60%" /> --- ## `geom_step()` ```r ggplot(mean_price_year, aes(x = year_built, y = mean_price)) + geom_step() ``` <img src="04-geoms_files/figure-html/unnamed-chunk-48-1.png" width="60%" /> --- ## `geom_errorbar()` .task[ Describe how this plot is constructed and what the points and the lines (error bars) correspond to. ] .panelset[ .panel[.panel-name[Code] ```r ggplot(mean_price_year, aes(x = year_built, y = mean_price)) + geom_point() + geom_errorbar(aes(ymin = mean_price - sd_price, ymax = mean_price + sd_price)) ``` ] .panel[.panel-name[Plot] <img src="04-geoms_files/figure-html/unnamed-chunk-49-1.png" width="60%" /> ] .panel[.panel-name[Discuss] <iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe> ] ]
03
:
00
--- ## Let's clean things up a bit! Meet your new best friend, the [**scales**](https://scales.r-lib.org/) package! ```r library(scales) ``` ``` ## ## Attaching package: 'scales' ``` ``` ## The following object is masked from 'package:purrr': ## ## discard ``` ``` ## The following object is masked from 'package:readr': ## ## col_factor ``` --- ## Let's clean things up a bit! .panelset[ .panel[.panel-name[Code] ```r ggplot(duke_forest, aes(x = area, y = price)) + geom_point(alpha = 0.6, size = 2, color = "#012169") + scale_x_continuous(labels = label_number(big.mark = ",")) + scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) + labs( x = "Area (square feet)", y = "Sale price (USD)", title = "Sale prices of homes in Duke Forest", subtitle = "As of November 2020", caption = "Source: Zillow.com" ) ``` ] .panel[.panel-name[Plot] <img src="04-geoms_files/figure-html/unnamed-chunk-52-1.png" width="60%" /> ] ] --- class: inverse, middle # Homework 1 --- ## Homework 1 .task[ - Locate your Homework 1 repo, clone it in RStudio (using the containers provided for you) - Put your name in the YAML of `hw-01.Rmd`, knit, commit (with a meaningful message), push - Skim the README to get a sense of the instructions and use the remaining tome to ask any clarifying questions and to get started on the first question - Make sure you know how to ask a question on the course discussion forum, and set up notifications ] --- ## Homework 1 tips - Focus on Questions 1-4 before Monday - During Monday's lab you'll get a chance to work on Question 5 - But, of course, do what feels right for you in terms of pace, ordering, etc.! - And ask questions on the course discussion forum