class: center, middle, inverse, title-slide # visualizing
time series data ### 2021-10-06 --- class: middle, inverse # Welcome --- ## Announcements - HW 3 posted - Midterm grades available on DukeHub - Detailed Project 1 feedback posted on GitHub as issue - Want to improve your Project 1 write-up and reproducibility scores? - If all team members agree, you can make one more round of improvements - Final grade will be average of what you have now on these components and updated (hopefully improved!) grade - Other component scores will stay the same - Due 12pm (beginning of lab) on Monday, Oct 11 - All team members must contribute (will check commits) - Must let me know if you want to do it by end the of class on Friday --- ## High level Project 1 feedback - Outline your two questions in your introduction - Remove packages that are not being used from `library()` calls - Adjust figure size based on how they look on the project website - Read data from the `data` folder - Use `here::here()` to construct file paths - Don't use the word "significant" without a qualifier like statistical or practical - Avoid for loops, prefer vectorized approaches - Add project description to your repo details - Use tidyverse styling for your code, if you need help, use the styler package --- ## Setup .midi[ ```r # load packages library(tidyverse) library(fs) library(lubridate) library(broom) # set default theme for ggplot2 ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16)) # set default figure parameters for knitr knitr::opts_chunk$set( fig.width = 8, fig.asp = 0.618, fig.retina = 3, dpi = 300, out.width = "60%" ) # dplyr print min and max options(dplyr.print_max = 6, dplyr.print_min = 6) ``` ] --- class: middle, inverse # Working with dates --- ## Air Quality Index - The AQI is the Environmental Protection Agency's index for reporting air quality - Higher values of AQI indicate worse air quality <img src="images/aqi-levels.png" title="AQI Basics for Ozone and Particle Pollution" alt="AQI Basics for Ozone and Particle Pollution" width="100%" /> .footnote[ Source: https://www.airnow.gov/aqi-basics ] --- ## AQI levels The previous graphic in tibble form, to be used later... ```r aqi_levels <- tribble( ~aqi_min, ~aqi_max, ~color, ~level, 0, 50, "#D8EEDA", "Good", 51, 100, "#F1E7D4", "Moderate", 101, 150, "#F8E4D8", "Unhealthy for sensitive groups", 151, 200, "#FEE2E1", "Unhealthy", 201, 300, "#F4E3F7", "Very unhealthy", 301, 400, "#F9D0D4", "Hazardous" ) ``` --- ## AQI data - Source: [EPA's Daily Air Quality Tracker](https://www.epa.gov/outdoor-air-quality-data/air-data-daily-air-quality-tracker) - 2016 - 2021 AQI (Ozone and PM2.5 combined) for Durham-Chapel Hill, NC core-based statistical area (CBSA), one file per year - 2016 - 2021 AQI (Ozone and PM2.5 combined) for San Francisco-Oakland-Hayward, CA CBSA, one file per year --- ## 2021 Durham-Chapel Hill - Load data ```r dch_2021 <- read_csv(here::here("data/durham-chapel-hill/ad_aqi_tracker_data-2021.csv")) ``` -- - Metadata ```r dim(dch_2021) ``` ``` ## [1] 365 11 ``` ```r names(dch_2021) ``` ``` ## [1] "Date" "AQI Value" ## [3] "Main Pollutant" "Site Name" ## [5] "Site ID" "Source" ## [7] "20-year High (2000-2019)" "20-year Low (2000-2019)" ## [9] "5-year Average (2015-2019)" "Date of 20-year High" ## [11] "Date of 20-year Low" ``` --- ## Clean variable names ```r dch_2021 <- dch_2021 %>% janitor::clean_names() names(dch_2021) ``` ``` ## [1] "date" "aqi_value" ## [3] "main_pollutant" "site_name" ## [5] "site_id" "source" ## [7] "x20_year_high_2000_2019" "x20_year_low_2000_2019" ## [9] "x5_year_average_2015_2019" "date_of_20_year_high" ## [11] "date_of_20_year_low" ``` --- ## First look .task[ This plot looks quite bizarre. What might be going on? ] ```r ggplot(dch_2021, aes(x = date, y = aqi_value, group = 1)) + geom_line() ``` <img src="11-visualize-time-series_files/figure-html/unnamed-chunk-8-1.png" width="60%" /> --- ## Peek at data ```r dch_2021 %>% select(date, aqi_value, site_name, site_id) ``` ``` ## # A tibble: 365 × 4 ## date aqi_value site_name site_id ## <chr> <chr> <chr> <chr> ## 1 01/01/2021 25 Durham Armory 37-063-0015 ## 2 01/02/2021 22 Durham Armory 37-063-0015 ## 3 01/03/2021 19 Durham Armory 37-063-0015 ## 4 01/04/2021 16 Durham Armory 37-063-0015 ## 5 01/05/2021 38 Durham Armory 37-063-0015 ## 6 01/06/2021 19 Durham Armory 37-063-0015 ## # … with 359 more rows ``` --- ## Transforming date Using `lubridate::mdy()`: ```r dch_2021 %>% mutate(date = mdy(date)) ``` ``` ## # A tibble: 365 × 11 ## date aqi_value main_pollutant site_name site_id source x20_year_high_2… ## <date> <chr> <chr> <chr> <chr> <chr> <dbl> ## 1 2021-01-01 25 PM2.5 Durham Armory 37-063-0015 AQS 111 ## 2 2021-01-02 22 PM2.5 Durham Armory 37-063-0015 AQS 76 ## 3 2021-01-03 19 PM2.5 Durham Armory 37-063-0015 AQS 66 ## 4 2021-01-04 16 PM2.5 Durham Armory 37-063-0015 AQS 61 ## 5 2021-01-05 38 PM2.5 Durham Armory 37-063-0015 AQS 83 ## 6 2021-01-06 19 PM2.5 Durham Armory 37-063-0015 AQS 71 ## # … with 359 more rows, and 4 more variables: x20_year_low_2000_2019 <dbl>, ## # x5_year_average_2015_2019 <dbl>, date_of_20_year_high <chr>, ## # date_of_20_year_low <chr> ``` --- ## Transforming AQI values .task[ What does this warning mean? ] ```r dch_2021 %>% mutate(aqi_value = as.numeric(aqi_value)) ``` ``` ## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion ``` ``` ## # A tibble: 365 × 11 ## date aqi_value main_pollutant site_name site_id source x20_year_high_2… ## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 01/01/2021 25 PM2.5 Durham Armory 37-063-0015 AQS 111 ## 2 01/02/2021 22 PM2.5 Durham Armory 37-063-0015 AQS 76 ## 3 01/03/2021 19 PM2.5 Durham Armory 37-063-0015 AQS 66 ## 4 01/04/2021 16 PM2.5 Durham Armory 37-063-0015 AQS 61 ## 5 01/05/2021 38 PM2.5 Durham Armory 37-063-0015 AQS 83 ## 6 01/06/2021 19 PM2.5 Durham Armory 37-063-0015 AQS 71 ## # … with 359 more rows, and 4 more variables: x20_year_low_2000_2019 <dbl>, ## # x5_year_average_2015_2019 <dbl>, date_of_20_year_high <chr>, ## # date_of_20_year_low <chr> ``` --- ## Investigating AQI values - Take a peek at distinct values of AQI ```r dch_2021 %>% distinct(aqi_value) %>% pull() ``` ``` ## [1] "25" "22" "19" "16" "38" "31" "40" "32" "33" "49" "44" "57" "18" "23" "34" ## [16] "45" "20" "43" "51" "27" "30" "17" "8" "21" "39" "41" "28" "42" "24" "29" ## [31] "35" "36" "46" "58" "84" "54" "53" "48" "71" "77" "90" "64" "50" "52" "61" ## [46] "47" "37" "80" "74" "68" "55" "87" "75" "60" "56" "62" "67" "26" "63" "." ``` - `"."` likely indicates `NA`, and it's causing the entire column to be read in as characters --- ## Rewind, and start over ```r dch_2021 <- read_csv( here::here("data/durham-chapel-hill/ad_aqi_tracker_data-2021.csv"), na = c(".", "") ) ``` ```r glimpse(dch_2021) ``` ``` ## Rows: 365 ## Columns: 11 ## $ Date <chr> "01/01/2021", "01/02/2021", "01/03/2021",… ## $ `AQI Value` <dbl> 25, 22, 19, 16, 38, 19, 31, 40, 32, 33, 4… ## $ `Main Pollutant` <chr> "PM2.5", "PM2.5", "PM2.5", "PM2.5", "PM2.… ## $ `Site Name` <chr> "Durham Armory", "Durham Armory", "Durham… ## $ `Site ID` <chr> "37-063-0015", "37-063-0015", "37-063-001… ## $ Source <chr> "AQS", "AQS", "AQS", "AQS", "AQS", "AQS",… ## $ `20-year High (2000-2019)` <dbl> 111, 76, 66, 61, 83, 71, 75, 76, 57, 71, … ## $ `20-year Low (2000-2019)` <dbl> 10, 8, 14, 9, 8, 15, 18, 15, 19, 20, 14, … ## $ `5-year Average (2015-2019)` <dbl> 39.2, 36.8, 38.2, 30.4, 26.0, 32.4, 37.0,… ## $ `Date of 20-year High` <chr> "01/01/2000", "01/02/2005", "01/03/2004",… ## $ `Date of 20-year Low` <chr> "01/01/2007", "01/02/2012", "01/03/2012",… ``` --- ## Data cleaning ```r dch_2021 <- dch_2021 %>% janitor::clean_names() %>% mutate(date = mdy(date)) dch_2021 ``` ``` ## # A tibble: 365 × 11 ## date aqi_value main_pollutant site_name site_id source x20_year_high_2… ## <date> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 2021-01-01 25 PM2.5 Durham Armory 37-063-0015 AQS 111 ## 2 2021-01-02 22 PM2.5 Durham Armory 37-063-0015 AQS 76 ## 3 2021-01-03 19 PM2.5 Durham Armory 37-063-0015 AQS 66 ## 4 2021-01-04 16 PM2.5 Durham Armory 37-063-0015 AQS 61 ## 5 2021-01-05 38 PM2.5 Durham Armory 37-063-0015 AQS 83 ## 6 2021-01-06 19 PM2.5 Durham Armory 37-063-0015 AQS 71 ## # … with 359 more rows, and 4 more variables: x20_year_low_2000_2019 <dbl>, ## # x5_year_average_2015_2019 <dbl>, date_of_20_year_high <chr>, ## # date_of_20_year_low <chr> ``` --- ## Another look ```r ggplot(dch_2021, aes(x = date, y = aqi_value, group = 1)) + geom_line() ``` ``` ## Warning: Removed 87 row(s) containing missing values (geom_path). ``` <img src="11-visualize-time-series_files/figure-html/dch-2021-first-viz-1.png" width="60%" /> --- .task[ How would you improve this visualization? ] .panelset[ .panel[.panel-name[Plots] ``` ## Warning: Removed 87 row(s) containing missing values (geom_path). ``` <img src="11-visualize-time-series_files/figure-html/unnamed-chunk-17-1.png" width="60%" /> ] .panel[.panel-name[Discuss] <iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe> ] ] --- class: middle .large[.hand[livecoding]] (See next slide for code developed during live coding session) --- class: middle .hand[to be used in the upcoming plots] ```r aqi_levels <- aqi_levels %>% mutate(aqi_mid = ((aqi_min + aqi_max) / 2)) ``` --- .small[ .panelset.sideways[ .panel[.panel-name[Code] ```r dch_2021 %>% filter(!is.na(aqi_value)) %>% ggplot(aes(x = date, y = aqi_value, group = 1)) + geom_rect(data = aqi_levels, aes(ymin = aqi_min, ymax = aqi_max, xmin = as.Date(-Inf), xmax = as.Date(Inf), fill = color, y = NULL, x = NULL)) + geom_line(size = 1) + scale_fill_identity() + scale_x_date(name = NULL, date_labels = "%b", limits = c(ymd("2021-01-01"), ymd("2022-01-01"))) + geom_text(data = aqi_levels, aes(x = ymd("2021-12-31"), y = aqi_mid, label = level), hjust = 1, size = 6, fontface = "bold", color = "white") + annotate(geom = "text", x = c(ymd("2021-01-01"), ymd("2022-01-01")), y = -60, label = c("2021", "2022"), size = 4) + coord_cartesian(clip = "off", ylim = c(0, 400)) + labs(x = NULL, y = "AQI", title = "Ozone and PM2.5 Daily AQI Values", subtitle = "Durham-Chapel Hill, NC", caption = "\nSource: EPA Daily Air Quality Tracker") + theme(plot.title.position = "plot", panel.grid.minor.y = element_blank(), panel.grid.minor.x = element_blank(), plot.margin = unit(c(1, 1, 3, 1), "lines")) ``` ] .panel[.panel-name[Plot] <img src="11-visualize-time-series_files/figure-html/unnamed-chunk-19-1.png" width="100%" /> ] ] ] --- .small[ .panelset.sideways[ .panel[.panel-name[Code] ```r dch_2021 %>% filter(!is.na(aqi_value)) %>% ggplot(aes(x = date, y = aqi_value, group = 1)) + geom_line(size = 1) + scale_x_date(name = NULL, date_labels = "%b", limits = c(ymd("2021-01-01"), ymd("2022-01-01"))) + scale_y_continuous(breaks = c(0, 50, 100, 150, 200, 300, 400)) + geom_text(data = aqi_levels, aes(x = ymd("2021-12-31"), y = aqi_mid, label = level, color = color), hjust = 1, size = 6, fontface = "bold") + scale_color_identity() + annotate(geom = "text", x = c(ymd("2021-01-01"), ymd("2022-01-01")), y = -60, label = c("2021", "2022"), size = 4) + coord_cartesian(clip = "off", ylim = c(0, 400)) + labs(x = NULL, y = "AQI", title = "Ozone and PM2.5 Daily AQI Values", subtitle = "Durham-Chapel Hill, NC", caption = "\nSource: EPA Daily Air Quality Tracker") + theme(plot.title.position = "plot", panel.grid.minor.y = element_blank(), panel.grid.minor.x = element_blank()) ``` ] .panel[.panel-name[Plot] <img src="11-visualize-time-series_files/figure-html/unnamed-chunk-20-1.png" width="100%" /> ] ] ] --- ## Highlights - The **lubridate** package is useful for converting to dates from character strings in a given format, e.g. `mdy()`, `ymd()`, etc. - `scale_x_date`: Set `date_labels` as `"%b %y"` for month-2 digit year, `"%D"` for date format such as `%m/%d/%y`, etc. See help for `strptime()` for more. - `scale_color_identity()` or `scale_fill_identity()` can be useful when your data already represents aesthetic values that ggplot2 can handle directly. By default doesn't produce a legend. --- class: middle # Calculating cumulatives --- ## Cumulatives over time - When visualizing time series data, a somewhat common task is to calculate cumulatives over time and plot them - In our example we'll calculate the number of days with "good" AQI ($\le$ 50) and plot that value on the y-axis and the date on the x-axis --- ## Calculating cumulatives Step 1. Arrange your data ```r dch_2021 %>% select(date, aqi_value) %>% filter(!is.na(aqi_value)) %>% arrange(date) ``` ``` ## # A tibble: 278 × 2 ## date aqi_value ## <date> <dbl> ## 1 2021-01-01 25 ## 2 2021-01-02 22 ## 3 2021-01-03 19 ## 4 2021-01-04 16 ## 5 2021-01-05 38 ## 6 2021-01-06 19 ## # … with 272 more rows ``` --- ## Calculating cumulatives Step 2. Identify good days ```r dch_2021 %>% select(date, aqi_value) %>% filter(!is.na(aqi_value)) %>% arrange(date) %>% mutate(good_aqi = if_else(aqi_value <= 50, 1, 0)) ``` ``` ## # A tibble: 278 × 3 ## date aqi_value good_aqi ## <date> <dbl> <dbl> ## 1 2021-01-01 25 1 ## 2 2021-01-02 22 1 ## 3 2021-01-03 19 1 ## 4 2021-01-04 16 1 ## 5 2021-01-05 38 1 ## 6 2021-01-06 19 1 ## # … with 272 more rows ``` --- ## Calculating cumulatives Step 3. Sum over time ```r dch_2021 %>% select(date, aqi_value) %>% filter(!is.na(aqi_value)) %>% arrange(date) %>% mutate( good_aqi = if_else(aqi_value <= 50, 1, 0), cum_good_aqi = cumsum(good_aqi) ) ``` ``` ## # A tibble: 278 × 4 ## date aqi_value good_aqi cum_good_aqi ## <date> <dbl> <dbl> <dbl> ## 1 2021-01-01 25 1 1 ## 2 2021-01-02 22 1 2 ## 3 2021-01-03 19 1 3 ## 4 2021-01-04 16 1 4 ## 5 2021-01-05 38 1 5 ## 6 2021-01-06 19 1 6 ## # … with 272 more rows ``` --- ## Plotting cumulatives .panelset.sideways[ .panel[.panel-name[Code] ```r dch_2021 %>% select(date, aqi_value) %>% filter(!is.na(aqi_value)) %>% arrange(date) %>% mutate( good_aqi = if_else(aqi_value <= 50, 1, 0), cum_good_aqi = cumsum(good_aqi) ) %>% ggplot(aes(x = date, y = cum_good_aqi, group = 1)) + geom_line() + scale_x_date(date_labels = "%b %Y") + labs( x = NULL, y = "Number of days", title = "Cumulative number of good AQI days (AQI < 50)", subtitle = "Durham-Chapel Hill, NC", caption = "\nSource: EPA Daily Air Quality Tracker" ) + theme(plot.title.position = "plot") ``` ] .panel[.panel-name[Plot] <img src="11-visualize-time-series_files/figure-html/unnamed-chunk-24-1.png" width="90%" /> ] ] --- class: middle # Detrending --- ## Detrending - Detrending is removing prominent long-term trend in time series to specifically highlight any notable deviations - Let's demonstrate using multiple years of AQI data --- ## Read in multiple years of Durham-Chapel Hill data .midi[ ```r dch_files <- fs::dir_ls(here::here("data/durham-chapel-hill")) ``` ] .midi[ ```r dch <- read_csv(dch_files, na = c(".", "")) dch <- dch %>% janitor::clean_names() %>% mutate( date = mdy(date), good_aqi = if_else(aqi_value <= 50, 1, 0) ) %>% filter(!is.na(aqi_value)) %>% arrange(date) %>% mutate(cum_good_aqi = cumsum(good_aqi), .after = aqi_value) dch ``` ``` ## # A tibble: 2,096 × 13 ## date aqi_value cum_good_aqi main_pollutant site_name site_id source ## <date> <dbl> <dbl> <chr> <chr> <chr> <chr> ## 1 2016-01-01 32 1 PM2.5 Durham Armo… 37-063-0… AQS ## 2 2016-01-02 37 2 PM2.5 Durham Armo… 37-063-0… AQS ## 3 2016-01-03 45 3 PM2.5 Durham Armo… 37-063-0… AQS ## 4 2016-01-04 33 4 PM2.5 Durham Armo… 37-063-0… AQS ## 5 2016-01-05 27 5 PM2.5 Durham Armo… 37-063-0… AQS ## 6 2016-01-06 39 6 PM2.5 Durham Armo… 37-063-0… AQS ## # … with 2,090 more rows, and 6 more variables: x20_year_high_2000_2019 <dbl>, ## # x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>, ## # date_of_20_year_high <chr>, date_of_20_year_low <chr>, good_aqi <dbl> ``` ] --- ## Plot trend since 2016 .panelset.sideways[ .panel[.panel-name[Code] ```r dch %>% ggplot(aes(x = date, y = cum_good_aqi, group = 1)) + geom_smooth(method = "lm", color = "pink") + geom_line() + scale_x_date( expand = expansion(mult = 0.07), date_labels = "%Y" ) + labs( x = NULL, y = "Number of days", title = "Cumulative number of good AQI days (AQI < 50)", subtitle = "Durham-Chapel Hill, NC", caption = "\nSource: EPA Daily Air Quality Tracker" ) + theme(plot.title.position = "plot") ``` ] .panel[.panel-name[Plot] <img src="11-visualize-time-series_files/figure-html/unnamed-chunk-28-1.png" width="90%" /> ] ] --- ## Detrend Step 1. Fit a simple linear regression ```r m <- lm(cum_good_aqi ~ date, data = dch) m ``` ``` ## ## Call: ## lm(formula = cum_good_aqi ~ date, data = dch) ## ## Coefficients: ## (Intercept) date ## -1.306e+04 7.758e-01 ``` --- ## Detrend Step 2. Augment the data with model results (using `broom::augment()`) ```r dch_aug <- augment(m) dch_aug ``` ``` ## # A tibble: 2,096 × 8 ## cum_good_aqi date .fitted .resid .hat .sigma .cooksd .std.resid ## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 2016-01-01 -27.6 28.6 0.00190 20.8 0.00180 1.37 ## 2 2 2016-01-02 -26.8 28.8 0.00190 20.8 0.00183 1.39 ## 3 3 2016-01-03 -26.0 29.0 0.00190 20.8 0.00185 1.40 ## 4 4 2016-01-04 -25.3 29.3 0.00189 20.8 0.00188 1.41 ## 5 5 2016-01-05 -24.5 29.5 0.00189 20.8 0.00190 1.42 ## 6 6 2016-01-06 -23.7 29.7 0.00189 20.8 0.00193 1.43 ## # … with 2,090 more rows ``` --- ## Detrend Step 3. Divide the observed value of `cum_good_aqi` by the respective value in the long-term trend (i.e., `.fitted`) ```r dch_aug <- dch_aug %>% mutate(ratio = cum_good_aqi / .fitted, .after = .fitted) dch_aug ``` ``` ## # A tibble: 2,096 × 9 ## cum_good_aqi date .fitted ratio .resid .hat .sigma .cooksd ## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 2016-01-01 -27.6 -0.0362 28.6 0.00190 20.8 0.00180 ## 2 2 2016-01-02 -26.8 -0.0746 28.8 0.00190 20.8 0.00183 ## 3 3 2016-01-03 -26.0 -0.115 29.0 0.00190 20.8 0.00185 ## 4 4 2016-01-04 -25.3 -0.158 29.3 0.00189 20.8 0.00188 ## 5 5 2016-01-05 -24.5 -0.204 29.5 0.00189 20.8 0.00190 ## 6 6 2016-01-06 -23.7 -0.253 29.7 0.00189 20.8 0.00193 ## # … with 2,090 more rows, and 1 more variable: .std.resid <dbl> ``` --- ## Visualize detrended data .panelset.sideways[ .panel[.panel-name[Code] ```r dch_aug %>% ggplot(aes(x = date, y = ratio, group = 1)) + geom_hline(yintercept = 1, color = "gray") + geom_line() + scale_x_date( expand = expansion(mult = 0.07), date_labels = "%Y" ) + labs( x = NULL, y = "Number of days (detrended)", title = "Cumulative number of good AQI days (AQI < 50)", subtitle = "Durham-Chapel Hill, NC", caption = "\nSource: EPA Daily Air Quality Tracker" ) + theme(plot.title.position = "plot") ``` ] .panel[.panel-name[Plot] <img src="11-visualize-time-series_files/figure-html/unnamed-chunk-32-1.png" width="90%" /> ] ] --- class: middle .large[.hand[barely anything interesting happening!]] --- class: middle .large[.hand[let's look at data from somewhere with a bit more "interesting" air quality data...]] --- ## Read in multiple years of SF data .midi[ ```r sf_files <- fs::dir_ls(here::here("data/san-francisco")) ``` ] .midi[ ```r sf <- read_csv(sf_files, na = c(".", "")) sf <- sf %>% janitor::clean_names() %>% mutate( date = mdy(date), good_aqi = if_else(aqi_value <= 50, 1, 0) ) %>% filter(!is.na(aqi_value)) %>% arrange(date) %>% mutate(cum_good_aqi = cumsum(good_aqi), .after = aqi_value) sf ``` ``` ## # A tibble: 2,105 × 13 ## date aqi_value cum_good_aqi main_pollutant site_name site_id source ## <date> <dbl> <dbl> <chr> <chr> <chr> <chr> ## 1 2016-01-01 32 1 PM2.5 Durham Armo… 37-063-0… AQS ## 2 2016-01-02 37 2 PM2.5 Durham Armo… 37-063-0… AQS ## 3 2016-01-03 45 3 PM2.5 Durham Armo… 37-063-0… AQS ## 4 2016-01-04 33 4 PM2.5 Durham Armo… 37-063-0… AQS ## 5 2016-01-05 27 5 PM2.5 Durham Armo… 37-063-0… AQS ## 6 2016-01-06 39 6 PM2.5 Durham Armo… 37-063-0… AQS ## # … with 2,099 more rows, and 6 more variables: x20_year_high_2000_2019 <dbl>, ## # x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>, ## # date_of_20_year_high <chr>, date_of_20_year_low <chr>, good_aqi <dbl> ``` ] --- ## Plot trend since 2016 .panelset.sideways[ .panel[.panel-name[Code] ```r sf %>% ggplot(aes(x = date, y = cum_good_aqi, group = 1)) + geom_smooth(method = "lm", color = "pink") + geom_line() + scale_x_date( expand = expansion(mult = 0.07), date_labels = "%Y" ) + labs( x = NULL, y = "Number of days", title = "Cumulative number of good AQI days (AQI < 50)", subtitle = "San Francisco-Oakland-Hayward, CA", caption = "\nSource: EPA Daily Air Quality Tracker" ) + theme(plot.title.position = "plot") ``` ] .panel[.panel-name[Plot] <img src="11-visualize-time-series_files/figure-html/unnamed-chunk-36-1.png" width="90%" /> ] ] --- ## Detrend 1. Fit a simple linear regression ```r m_sf <- lm(cum_good_aqi ~ date, data = sf) ``` -- 2. Augment the data with model results ```r sf_aug <- augment(m_sf) ``` -- 3. Divide the observed value of `cum_good_aqi` by the respective value in the long-term trend (i.e., `.fitted`) ```r sf_aug <- sf_aug %>% mutate(ratio = cum_good_aqi / .fitted, .after = .fitted) ``` --- ## Visualize detrended data .panelset.sideways[ .panel[.panel-name[Code] ```r sf_aug %>% ggplot(aes(x = date, y = ratio, group = 1)) + geom_hline(yintercept = 1, color = "gray") + geom_line() + scale_x_date( expand = expansion(mult = 0.07), date_labels = "%Y" ) + labs( x = NULL, y = "Number of days (detrended)", title = "Cumulative number of good AQI days (AQI < 50)", subtitle = "San Francisco-Oakland-Hayward, CA", caption = "\nSource: EPA Daily Air Quality Tracker" ) + theme(plot.title.position = "plot") ``` ] .panel[.panel-name[Plot] <img src="11-visualize-time-series_files/figure-html/unnamed-chunk-40-1.png" width="90%" /> ] ] --- ## Detrending - In step 2 we fit a very simple model - Depending on the complexity you're trying to capture you might choose to fit a much more complex model - You can also decompose the trend into multiple trends, e.g. monthly, long-term, seasonal, etc. <br> .note[ Interested in learning more? Take STA 344 - Spatio-temporal analysis! ]