visualizing time series data

# visualizing<br>time series data
### 2021-10-06

---

# Welcome

---

## Announcements

- HW 3 posted

- Midterm grades available on DukeHub

- Detailed Project 1 feedback posted on GitHub as issue

- Want to improve your Project 1 write-up and reproducibility scores?
  - If all team members agree, you can make one more round of improvements
  - Final grade will be average of what you have now on these components and updated (hopefully improved!) grade
  - Other component scores will stay the same
  - Due 12pm (beginning of lab) on Monday, Oct 11
  - All team members must contribute (will check commits)
  - Must let me know if you want to do it by end the of class on Friday

---

## High level Project 1 feedback

- Outline your two questions in your introduction
- Remove packages that are not being used from `library()` calls
- Adjust figure size based on how they look on the project website
- Read data from the `data` folder
- Use `here::here()` to construct file paths
- Don't use the word "significant" without a qualifier like statistical or practical
- Avoid for loops, prefer vectorized approaches
- Add project description to your repo details
- Use tidyverse styling for your code, if you need help, use the styler package

---

## Setup

```r
# load packages
library(tidyverse)
library(fs)
library(lubridate)
library(broom)

# set default theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

# set default figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 8, fig.asp = 0.618, fig.retina = 3,
  dpi = 300, out.width = "60%"
)

# dplyr print min and max
options(dplyr.print_max = 6, dplyr.print_min = 6)
```
]

---

# Working with dates

---

## Air Quality Index

- The AQI is the Environmental Protection Agency's index for reporting air quality

- Higher values of AQI indicate worse air quality

---

## AQI levels

The previous graphic in tibble form, to be used later...

```r
aqi_levels <- tribble(
  ~aqi_min, ~aqi_max, ~color,    ~level,
  0,        50,       "#D8EEDA", "Good",
  51,       100,      "#F1E7D4", "Moderate",
  101,      150,      "#F8E4D8", "Unhealthy for sensitive groups",
  151,      200,      "#FEE2E1", "Unhealthy",
  201,      300,      "#F4E3F7", "Very unhealthy",
  301,      400,      "#F9D0D4", "Hazardous"
)
```

---

## AQI data

- Source: [EPA's  Daily Air Quality Tracker](https://www.epa.gov/outdoor-air-quality-data/air-data-daily-air-quality-tracker)

- 2016 - 2021 AQI (Ozone and PM2.5 combined) for Durham-Chapel Hill, NC core-based statistical area (CBSA), one file per year

- 2016 - 2021 AQI (Ozone and PM2.5 combined) for San Francisco-Oakland-Hayward, CA CBSA, one file per year

---

## 2021 Durham-Chapel Hill

- Load data

```r
dch_2021 <- read_csv(here::here("data/durham-chapel-hill/ad_aqi_tracker_data-2021.csv"))
```

- Metadata

```r
dim(dch_2021)
```

```
## [1] 365  11
```

```r
names(dch_2021)
```

```
##  [1] "Date"                       "AQI Value"                 
##  [3] "Main Pollutant"             "Site Name"                 
##  [5] "Site ID"                    "Source"                    
##  [7] "20-year High (2000-2019)"   "20-year Low (2000-2019)"   
##  [9] "5-year Average (2015-2019)" "Date of 20-year High"      
## [11] "Date of 20-year Low"
```

---

## Clean variable names

```r
dch_2021 <- dch_2021 %>%
  janitor::clean_names()

names(dch_2021)
```

```
##  [1] "date"                      "aqi_value"                
##  [3] "main_pollutant"            "site_name"                
##  [5] "site_id"                   "source"                   
##  [7] "x20_year_high_2000_2019"   "x20_year_low_2000_2019"   
##  [9] "x5_year_average_2015_2019" "date_of_20_year_high"     
## [11] "date_of_20_year_low"
```

---

## First look

```r
ggplot(dch_2021, aes(x = date, y = aqi_value, group = 1)) +
  geom_line()
```

---

## Peek at data

```r
dch_2021 %>%
  select(date, aqi_value, site_name, site_id)
```

```
## # A tibble: 365 × 4
##   date       aqi_value site_name     site_id    
##   <chr>      <chr>     <chr>         <chr>      
## 1 01/01/2021 25        Durham Armory 37-063-0015
## 2 01/02/2021 22        Durham Armory 37-063-0015
## 3 01/03/2021 19        Durham Armory 37-063-0015
## 4 01/04/2021 16        Durham Armory 37-063-0015
## 5 01/05/2021 38        Durham Armory 37-063-0015
## 6 01/06/2021 19        Durham Armory 37-063-0015
## # … with 359 more rows
```

---

## Transforming date

Using `lubridate::mdy()`:

```r
dch_2021 %>%
  mutate(date = mdy(date))
```

```
## # A tibble: 365 × 11
##   date       aqi_value main_pollutant site_name     site_id     source x20_year_high_2…
##   <date>     <chr>     <chr>          <chr>         <chr>       <chr>             <dbl>
## 1 2021-01-01 25        PM2.5          Durham Armory 37-063-0015 AQS                 111
## 2 2021-01-02 22        PM2.5          Durham Armory 37-063-0015 AQS                  76
## 3 2021-01-03 19        PM2.5          Durham Armory 37-063-0015 AQS                  66
## 4 2021-01-04 16        PM2.5          Durham Armory 37-063-0015 AQS                  61
## 5 2021-01-05 38        PM2.5          Durham Armory 37-063-0015 AQS                  83
## 6 2021-01-06 19        PM2.5          Durham Armory 37-063-0015 AQS                  71
## # … with 359 more rows, and 4 more variables: x20_year_low_2000_2019 <dbl>,
## #   x5_year_average_2015_2019 <dbl>, date_of_20_year_high <chr>,
## #   date_of_20_year_low <chr>
```

---

## Transforming AQI values

```r
dch_2021 %>%
  mutate(aqi_value = as.numeric(aqi_value))
```

```
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
```

```
## # A tibble: 365 × 11
##   date       aqi_value main_pollutant site_name     site_id     source x20_year_high_2…
##   <chr>          <dbl> <chr>          <chr>         <chr>       <chr>             <dbl>
## 1 01/01/2021        25 PM2.5          Durham Armory 37-063-0015 AQS                 111
## 2 01/02/2021        22 PM2.5          Durham Armory 37-063-0015 AQS                  76
## 3 01/03/2021        19 PM2.5          Durham Armory 37-063-0015 AQS                  66
## 4 01/04/2021        16 PM2.5          Durham Armory 37-063-0015 AQS                  61
## 5 01/05/2021        38 PM2.5          Durham Armory 37-063-0015 AQS                  83
## 6 01/06/2021        19 PM2.5          Durham Armory 37-063-0015 AQS                  71
## # … with 359 more rows, and 4 more variables: x20_year_low_2000_2019 <dbl>,
## #   x5_year_average_2015_2019 <dbl>, date_of_20_year_high <chr>,
## #   date_of_20_year_low <chr>
```

---

## Investigating AQI values

- Take a peek at distinct values of AQI

```r
dch_2021 %>%
  distinct(aqi_value) %>%
  pull()
```

```
##  [1] "25" "22" "19" "16" "38" "31" "40" "32" "33" "49" "44" "57" "18" "23" "34"
## [16] "45" "20" "43" "51" "27" "30" "17" "8"  "21" "39" "41" "28" "42" "24" "29"
## [31] "35" "36" "46" "58" "84" "54" "53" "48" "71" "77" "90" "64" "50" "52" "61"
## [46] "47" "37" "80" "74" "68" "55" "87" "75" "60" "56" "62" "67" "26" "63" "."
```

- `"."` likely indicates `NA`, and it's causing the entire column to be read in as characters

---

## Rewind, and start over

```r
dch_2021 <- read_csv(
  here::here("data/durham-chapel-hill/ad_aqi_tracker_data-2021.csv"),
  na = c(".", "")
  )
```

```r
glimpse(dch_2021)
```

```
## Rows: 365
## Columns: 11
## $ Date                         <chr> "01/01/2021", "01/02/2021", "01/03/2021",…
## $ `AQI Value`                  <dbl> 25, 22, 19, 16, 38, 19, 31, 40, 32, 33, 4…
## $ `Main Pollutant`             <chr> "PM2.5", "PM2.5", "PM2.5", "PM2.5", "PM2.…
## $ `Site Name`                  <chr> "Durham Armory", "Durham Armory", "Durham…
## $ `Site ID`                    <chr> "37-063-0015", "37-063-0015", "37-063-001…
## $ Source                       <chr> "AQS", "AQS", "AQS", "AQS", "AQS", "AQS",…
## $ `20-year High (2000-2019)`   <dbl> 111, 76, 66, 61, 83, 71, 75, 76, 57, 71, …
## $ `20-year Low (2000-2019)`    <dbl> 10, 8, 14, 9, 8, 15, 18, 15, 19, 20, 14, …
## $ `5-year Average (2015-2019)` <dbl> 39.2, 36.8, 38.2, 30.4, 26.0, 32.4, 37.0,…
## $ `Date of 20-year High`       <chr> "01/01/2000", "01/02/2005", "01/03/2004",…
## $ `Date of 20-year Low`        <chr> "01/01/2007", "01/02/2012", "01/03/2012",…
```

---

## Data cleaning

```r
dch_2021 <- dch_2021 %>%
  janitor::clean_names() %>%
  mutate(date = mdy(date))

dch_2021
```

```
## # A tibble: 365 × 11
##   date       aqi_value main_pollutant site_name     site_id     source x20_year_high_2…
##   <date>         <dbl> <chr>          <chr>         <chr>       <chr>             <dbl>
## 1 2021-01-01        25 PM2.5          Durham Armory 37-063-0015 AQS                 111
## 2 2021-01-02        22 PM2.5          Durham Armory 37-063-0015 AQS                  76
## 3 2021-01-03        19 PM2.5          Durham Armory 37-063-0015 AQS                  66
## 4 2021-01-04        16 PM2.5          Durham Armory 37-063-0015 AQS                  61
## 5 2021-01-05        38 PM2.5          Durham Armory 37-063-0015 AQS                  83
## 6 2021-01-06        19 PM2.5          Durham Armory 37-063-0015 AQS                  71
## # … with 359 more rows, and 4 more variables: x20_year_low_2000_2019 <dbl>,
## #   x5_year_average_2015_2019 <dbl>, date_of_20_year_high <chr>,
## #   date_of_20_year_low <chr>
```

---

## Another look

```r
ggplot(dch_2021, aes(x = date, y = aqi_value, group = 1)) +
  geom_line()
```

```
## Warning: Removed 87 row(s) containing missing values (geom_path).
```

---

```
## Warning: Removed 87 row(s) containing missing values (geom_path).
```

<img src="11-visualize-time-series_files/figure-html/unnamed-chunk-17-1.png" width="60%" />
]
.panel[.panel-name[Discuss]
<iframe src="https://app.sli.do/event/rxg9buzy" height="100%" width="100%" frameBorder="0" style="min-height: 560px;" title="Slido"></iframe>
]
]

---

(See next slide for code developed during live coding session)

---

```r
aqi_levels <- aqi_levels %>%
  mutate(aqi_mid = ((aqi_min + aqi_max) / 2))
```

---

```r
dch_2021 %>%
  filter(!is.na(aqi_value)) %>%
  ggplot(aes(x = date, y = aqi_value, group = 1)) +
  geom_rect(data = aqi_levels,
            aes(ymin = aqi_min, ymax = aqi_max,
                xmin = as.Date(-Inf), xmax = as.Date(Inf),
                fill = color, y = NULL, x = NULL)) +
  geom_line(size = 1) +
  scale_fill_identity() +
  scale_x_date(name = NULL, date_labels = "%b",
               limits = c(ymd("2021-01-01"), ymd("2022-01-01"))) +
  geom_text(data = aqi_levels,
            aes(x = ymd("2021-12-31"), y = aqi_mid, label = level),
            hjust = 1, size = 6, fontface = "bold", color = "white") +
  annotate(geom = "text",
           x = c(ymd("2021-01-01"), ymd("2022-01-01")), y = -60,
           label = c("2021", "2022"), size = 4) +
  coord_cartesian(clip = "off", ylim = c(0, 400)) +
  labs(x = NULL, y = "AQI",
       title = "Ozone and PM2.5 Daily AQI Values",
       subtitle = "Durham-Chapel Hill, NC",
       caption = "\nSource: EPA Daily Air Quality Tracker") +
  theme(plot.title.position = "plot",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        plot.margin = unit(c(1, 1, 3, 1), "lines"))
```

]

]
]
]

---

```r
dch_2021 %>%
  filter(!is.na(aqi_value)) %>%
  ggplot(aes(x = date, y = aqi_value, group = 1)) +
  geom_line(size = 1) +
  scale_x_date(name = NULL, date_labels = "%b",
               limits = c(ymd("2021-01-01"), ymd("2022-01-01"))) +
  scale_y_continuous(breaks = c(0, 50, 100, 150, 200, 300, 400)) +
  geom_text(data = aqi_levels,
            aes(x = ymd("2021-12-31"), y = aqi_mid, label = level, color = color),
            hjust = 1, size = 6, fontface = "bold") +
  scale_color_identity() +
  annotate(geom = "text",
           x = c(ymd("2021-01-01"), ymd("2022-01-01")), y = -60,
           label = c("2021", "2022"), size = 4) +
  coord_cartesian(clip = "off", ylim = c(0, 400)) +
  labs(x = NULL, y = "AQI",
       title = "Ozone and PM2.5 Daily AQI Values",
       subtitle = "Durham-Chapel Hill, NC",
       caption = "\nSource: EPA Daily Air Quality Tracker") +
  theme(plot.title.position = "plot",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank()) 
```

]

]
]
]

---

## Highlights

- The **lubridate** package is useful for converting to dates from character strings in a given format, e.g. `mdy()`, `ymd()`, etc.

- `scale_x_date`: Set `date_labels` as `"%b %y"` for month-2 digit year, `"%D"` for date format such as `%m/%d/%y`, etc. See help for `strptime()` for more.

- `scale_color_identity()` or `scale_fill_identity()` can be useful when your data already represents aesthetic values that ggplot2 can handle directly. By default doesn't produce  a legend.

---

# Calculating cumulatives

---

## Cumulatives over time

- When visualizing time series data, a somewhat common task is to calculate cumulatives over time and plot them

- In our example we'll calculate the number of days with "good" AQI ($\le$ 50) and plot that value on the y-axis and the date on the x-axis

---

## Calculating cumulatives

Step 1. Arrange your data

```r
dch_2021 %>%
  select(date, aqi_value) %>%
  filter(!is.na(aqi_value)) %>%
  arrange(date)
```

```
## # A tibble: 278 × 2
##   date       aqi_value
##   <date>         <dbl>
## 1 2021-01-01        25
## 2 2021-01-02        22
## 3 2021-01-03        19
## 4 2021-01-04        16
## 5 2021-01-05        38
## 6 2021-01-06        19
## # … with 272 more rows
```

---

## Calculating cumulatives

Step 2. Identify good days

```r
dch_2021 %>%
  select(date, aqi_value) %>%
  filter(!is.na(aqi_value)) %>%
  arrange(date) %>%
  mutate(good_aqi = if_else(aqi_value <= 50, 1, 0))
```

```
## # A tibble: 278 × 3
##   date       aqi_value good_aqi
##   <date>         <dbl>    <dbl>
## 1 2021-01-01        25        1
## 2 2021-01-02        22        1
## 3 2021-01-03        19        1
## 4 2021-01-04        16        1
## 5 2021-01-05        38        1
## 6 2021-01-06        19        1
## # … with 272 more rows
```

---

## Calculating cumulatives

Step 3. Sum over time

```
## # A tibble: 278 × 4
##   date       aqi_value good_aqi cum_good_aqi
##   <date>         <dbl>    <dbl>        <dbl>
## 1 2021-01-01        25        1            1
## 2 2021-01-02        22        1            2
## 3 2021-01-03        19        1            3
## 4 2021-01-04        16        1            4
## 5 2021-01-05        38        1            5
## 6 2021-01-06        19        1            6
## # … with 272 more rows
```

---

## Plotting cumulatives

.panelset.sideways[
.panel[.panel-name[Code]

```r
dch_2021 %>%
  select(date, aqi_value) %>%
  filter(!is.na(aqi_value)) %>%
  arrange(date) %>%
  mutate(
    good_aqi = if_else(aqi_value <= 50, 1, 0),
    cum_good_aqi = cumsum(good_aqi)
    ) %>%
  ggplot(aes(x = date, y = cum_good_aqi, group = 1)) +
  geom_line() +
  scale_x_date(date_labels = "%b %Y") +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")
```

]

]
]

---

# Detrending

---

## Detrending

- Detrending is removing prominent long-term trend in time series to specifically highlight any notable deviations

- Let's demonstrate using multiple years of AQI data

---

## Read in multiple years of Durham-Chapel Hill data

```r
dch_files <- fs::dir_ls(here::here("data/durham-chapel-hill"))
```
]

```r
dch <- read_csv(dch_files, na = c(".", ""))

dch <- dch %>%
  janitor::clean_names() %>%
  mutate(
    date = mdy(date),
    good_aqi = if_else(aqi_value <= 50, 1, 0)
  ) %>%
  filter(!is.na(aqi_value)) %>%
  arrange(date) %>%
  mutate(cum_good_aqi = cumsum(good_aqi), .after = aqi_value)

dch
```

```
## # A tibble: 2,096 × 13
##   date       aqi_value cum_good_aqi main_pollutant site_name    site_id   source
##   <date>         <dbl>        <dbl> <chr>          <chr>        <chr>     <chr> 
## 1 2016-01-01        32            1 PM2.5          Durham Armo… 37-063-0… AQS   
## 2 2016-01-02        37            2 PM2.5          Durham Armo… 37-063-0… AQS   
## 3 2016-01-03        45            3 PM2.5          Durham Armo… 37-063-0… AQS   
## 4 2016-01-04        33            4 PM2.5          Durham Armo… 37-063-0… AQS   
## 5 2016-01-05        27            5 PM2.5          Durham Armo… 37-063-0… AQS   
## 6 2016-01-06        39            6 PM2.5          Durham Armo… 37-063-0… AQS   
## # … with 2,090 more rows, and 6 more variables: x20_year_high_2000_2019 <dbl>,
## #   x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>,
## #   date_of_20_year_high <chr>, date_of_20_year_low <chr>, good_aqi <dbl>
```
]

---

## Plot trend since 2016

.panelset.sideways[
.panel[.panel-name[Code]

```r
dch %>%
  ggplot(aes(x = date, y = cum_good_aqi, group = 1)) +
  geom_smooth(method = "lm", color = "pink") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
    ) +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")
```

]

]
]

---

## Detrend

Step 1. Fit a simple linear regression

```r
m <- lm(cum_good_aqi ~ date, data = dch)

m
```

```
## 
## Call:
## lm(formula = cum_good_aqi ~ date, data = dch)
## 
## Coefficients:
## (Intercept)         date  
##  -1.306e+04    7.758e-01
```

---

## Detrend

Step 2. Augment the data with model results (using `broom::augment()`)

```r
dch_aug <- augment(m)

dch_aug
```

```
## # A tibble: 2,096 × 8
##   cum_good_aqi date       .fitted .resid    .hat .sigma .cooksd .std.resid
##          <dbl> <date>       <dbl>  <dbl>   <dbl>  <dbl>   <dbl>      <dbl>
## 1            1 2016-01-01   -27.6   28.6 0.00190   20.8 0.00180       1.37
## 2            2 2016-01-02   -26.8   28.8 0.00190   20.8 0.00183       1.39
## 3            3 2016-01-03   -26.0   29.0 0.00190   20.8 0.00185       1.40
## 4            4 2016-01-04   -25.3   29.3 0.00189   20.8 0.00188       1.41
## 5            5 2016-01-05   -24.5   29.5 0.00189   20.8 0.00190       1.42
## 6            6 2016-01-06   -23.7   29.7 0.00189   20.8 0.00193       1.43
## # … with 2,090 more rows
```

---

## Detrend

Step 3. Divide the observed value of `cum_good_aqi` by the respective value in the long-term trend (i.e., `.fitted`)

```r
dch_aug <- dch_aug %>%
  mutate(ratio = cum_good_aqi / .fitted, .after = .fitted)

dch_aug
```

```
## # A tibble: 2,096 × 9
##   cum_good_aqi date       .fitted   ratio .resid    .hat .sigma .cooksd
##          <dbl> <date>       <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
## 1            1 2016-01-01   -27.6 -0.0362   28.6 0.00190   20.8 0.00180
## 2            2 2016-01-02   -26.8 -0.0746   28.8 0.00190   20.8 0.00183
## 3            3 2016-01-03   -26.0 -0.115    29.0 0.00190   20.8 0.00185
## 4            4 2016-01-04   -25.3 -0.158    29.3 0.00189   20.8 0.00188
## 5            5 2016-01-05   -24.5 -0.204    29.5 0.00189   20.8 0.00190
## 6            6 2016-01-06   -23.7 -0.253    29.7 0.00189   20.8 0.00193
## # … with 2,090 more rows, and 1 more variable: .std.resid <dbl>
```

---

## Visualize detrended data

.panelset.sideways[
.panel[.panel-name[Code]

```r
dch_aug %>%
  ggplot(aes(x = date, y = ratio, group = 1)) +
  geom_hline(yintercept = 1, color = "gray") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
    ) +
  labs(
    x = NULL, y = "Number of days (detrended)",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")
```

]

]
]

---

---

---

## Read in multiple years of SF data

```r
sf_files <- fs::dir_ls(here::here("data/san-francisco"))
```
]

```r
sf <- read_csv(sf_files, na = c(".", ""))

sf <- sf %>%
  janitor::clean_names() %>%
  mutate(
    date = mdy(date),
    good_aqi = if_else(aqi_value <= 50, 1, 0)
  ) %>%
  filter(!is.na(aqi_value)) %>%
  arrange(date) %>%
  mutate(cum_good_aqi = cumsum(good_aqi), .after = aqi_value)

sf
```

```
## # A tibble: 2,105 × 13
##   date       aqi_value cum_good_aqi main_pollutant site_name    site_id   source
##   <date>         <dbl>        <dbl> <chr>          <chr>        <chr>     <chr> 
## 1 2016-01-01        32            1 PM2.5          Durham Armo… 37-063-0… AQS   
## 2 2016-01-02        37            2 PM2.5          Durham Armo… 37-063-0… AQS   
## 3 2016-01-03        45            3 PM2.5          Durham Armo… 37-063-0… AQS   
## 4 2016-01-04        33            4 PM2.5          Durham Armo… 37-063-0… AQS   
## 5 2016-01-05        27            5 PM2.5          Durham Armo… 37-063-0… AQS   
## 6 2016-01-06        39            6 PM2.5          Durham Armo… 37-063-0… AQS   
## # … with 2,099 more rows, and 6 more variables: x20_year_high_2000_2019 <dbl>,
## #   x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>,
## #   date_of_20_year_high <chr>, date_of_20_year_low <chr>, good_aqi <dbl>
```
]

---

## Plot trend since 2016

.panelset.sideways[
.panel[.panel-name[Code]

```r
sf %>%
  ggplot(aes(x = date, y = cum_good_aqi, group = 1)) +
  geom_smooth(method = "lm", color = "pink") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
    ) +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "San Francisco-Oakland-Hayward, CA",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")
```

]

]
]

---

## Detrend

1. Fit a simple linear regression

```r
m_sf <- lm(cum_good_aqi ~ date, data = sf)
```

2. Augment the data with model results

```r
sf_aug <- augment(m_sf)
```

3. Divide the observed value of `cum_good_aqi` by the respective value in the long-term trend (i.e., `.fitted`)

```r
sf_aug <- sf_aug %>%
  mutate(ratio = cum_good_aqi / .fitted, .after = .fitted)
```

---

## Visualize detrended data

.panelset.sideways[
.panel[.panel-name[Code]

```r
sf_aug %>%
  ggplot(aes(x = date, y = ratio, group = 1)) +
  geom_hline(yintercept = 1, color = "gray") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
    ) +
  labs(
    x = NULL, y = "Number of days (detrended)",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "San Francisco-Oakland-Hayward, CA",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")
```

]

]
]

---

## Detrending

- In step 2 we fit a very simple model

- Depending on the complexity you're trying to capture you might choose to fit a much more complex model

- You can also decompose the trend into multiple trends, e.g. monthly, long-term, seasonal, etc.

<br>