View on GitHub

project-1-r_mageddon

STA/ISS 313 - Project 1

Investigating content and audience reactions to youtube Super Bowl commercials ================

Introduction

The Super Bowl has top television ratings in the US, and its commercials are widely watched by US audiences. Thanks to the “like”, “dislike”, and “comment” features of Youtube, we are able to examine the interaction between the audience and the TV ads that were posted on Youtube. This motivated our team to explore the trends in how the content and audience preferences of Super Bowl ads change over the years, as well as how their content and description differ during the election and non-election years.

Our youtube dataset was analyzed by FiveThirtyEight, a website that focuses on opinion poll analysis, politics, economics, and sports blogging. The dataset contains a list of ads with matching videos found on YouTube from the 10 brands that had the most advertisements in Super Bowls from 2000 to 2020, according to superbowl-ads.com. The 10 brands are Toyota, Bud Light, Hynudai, Coca-Cola, Kia, Budweiser, NFL, Pepsi, Doritos, and E-Trade. We mainly focus on 7 defining characteristics of each Super Bowl ad: funny, danger, use_sex, show_product_quickly, celebrity, patriotic, animals, which are represented as boolean variables. We will also use metrics like view_count, like_count, dislike_count, favorite_count, comment_count, description, and title of each ad to drive our analysis.

Question 1:What is the trend of the ads’ content and audience preferences over the years?

Introduction

Many things have changed in the first 20 years of the 21st century. There are life science breakthroughs, political conflicts, feminist movements, etc. Therefore, we want to see if people’s reactions to Super Bowl commercials reflect the changes in their lifestyles and thoughts over the years. The changes in the ads themselves may also tell how companies are reacting to audience preferences.

We want to explore the trends in how the ads change over the years in terms of content and audience preferences. To analyze the change in the content we use the logical variables funny,danger,use_sex,patriotic, show_product_quickly, celebrity, and animals. To explore audience preference and engagement,like_count and view_count are used to calculate the ratio of like over total views.

Approach

For the first plot, we used a stacked area chart to see the percentage makeup of video attributes from 2000 to 2020. In order to compute the percentage, we first count the total number of TRUE values for each attribute in a year and then divide it over the total number of TRUE values in that year. The percentage adds up to one as we are not dividing over the total number of rows. The percentages we get using the above method are then grouped by year and plotted as a stacked area chart. We preferred this visualization over the stacked bar chart since the stacked area chart makes it easier to identify the change, if any, in the trend of content proportions over the years.

For the second plot, we used line graphs faceted by attribute, a variable that is created using all logical variables in the data set. The faceted line graph is suitable for time-dependent changes by each attribute. We would like to see whether containing a certain attribute impacts audience preference for an ad in each year. In each year, all ads are divided into 2 groups: having a certain attribute or not. In each group, The sum of the ratio of “likes” over “views” is divided over the total number of ads in that group to achieve the average ratio of “likes” over “views”. Therefore, the 2 lines in each sub-plot clearly show how the trend of audience preference differs if an ad contains an attribute or not over the years.

Analysis

Plot 1:

# First, group attributes by year and count total number of TRUE values 
## by pivoting longer.
all_year <- youtube %>%
  drop_na(funny, show_product_quickly, patriotic, celebrity, danger, animals, use_sex) %>%
  pivot_longer(cols = c(funny, show_product_quickly, patriotic, 
                        celebrity, danger, animals, use_sex), 
               names_to = "attribute",
               values_to = "contain") %>%
  filter(contain == TRUE) %>%
  select(year, attribute, contain) %>%
  group_by(year, attribute) %>%
  summarise(attribute_occurence = n(), .groups = "drop")

# Change back the format to do rowwise percentage calculations.
all_year_summary <- all_year %>%
  pivot_wider(names_from = attribute, values_from = attribute_occurence) %>%
  replace(is.na(.), 0) %>%
  rowwise() %>%
  mutate(
    total = sum(c_across(animals:patriotic)),
    across(!c(year, total), ~ .x / total)
  ) %>%
  rename(
    "Animals" = "animals",
    "Danger" = "danger",
    "Funny" = "funny", 
    "Show product quickly" = "show_product_quickly",
    "Use sexuality" = "use_sex",
    "Celebrity" = "celebrity",
    "Patriotic" = "patriotic"
  )
# Reformat to fit the plotting step
all_year_plot <- all_year_summary %>%
  select(!total) %>%
  pivot_longer(
    cols = !year, 
    names_to = "Attributes",
    values_to = "Percentage")

# Plot Stacked Area chart
ggplot(all_year_plot,
       aes(x = year, y = Percentage, fill = Attributes)) +
  geom_area(alpha = 0.6 ,
            size = .5,
            colour = "white") +
  scale_y_continuous(
    labels = label_percent(
      accuracy = NULL,
      scale = 100,
      prefix = "",
      suffix = "%",
      big.mark = " ",
      decimal.mark = ".",
      trim = TRUE
    )
  ) +
  labs(
    x = "Year",
    y = "Percentage",
    fill = "Attributes",
    title = "Percentage comparison among commercial attributes in\nSuperbowl commercials over years (2000~2020)",
    subtitle = "By attributes"
  ) +
  scale_fill_brewer(palette = "Dark2") 

Plot 2:

# create compare function for different
# Function was written with the help of TA
create_compare <- function(varname, full_data) {
  full_data <- full_data %>%
    # drop N/A value for key variables
    drop_na({
      {
        varname
      }
    }, like_count, view_count, dislike_count, comment_count) %>%
    # create new variables for preference, dislike, engagement, and attributes
    mutate(
      like = like_count / view_count,
      dislike = dislike_count / view_count,
      engage = comment_count / view_count,
      attribute = as_label(enquo(varname))
    ) %>%
    select({
      {
        varname
      }
    }, like, dislike, engage, attribute, year) %>%
    rename(val = as_label(enquo(varname)))
  return(full_data)
}
# create a new data frame that contains key variables
all_compare <- rbind(
  create_compare(funny, youtube),
  create_compare(danger, youtube),
  create_compare(show_product_quickly, youtube),
  create_compare(patriotic, youtube),
  create_compare(celebrity, youtube),
  create_compare(animals, youtube),
  create_compare(use_sex, youtube)
) %>%
  # change variable names to be more readable
  mutate(
    attribute = recode(
      attribute,
      'animals' = 'Animals',
      'celebrity' = 'Celebrity',
      'danger' = 'Danger',
      'funny' = 'Funny',
      'patriotic' = 'Patriotic',
      'show_product_quickly' = 'Show product quickly',
      'use_sex' = 'Use sexuality'
    )
  ) %>%
  group_by(year, attribute, val) %>%
  summarise(
    mean_like = mean(like),
    mean_engage = mean(engage),
    .groups = "drop"
  )
# plotting like vs. year
ggplot(all_compare, aes(x = as.numeric(year), y = mean_like)) +
  geom_line(aes(color = val)) +
  labs(
    title = "Audience preferences for Youtube Superbowl Commercials\nEach year from 2000 to 2020",
    subtitle = "By video attributes and year",
    x = "Year of Superbowl",
    y = "Average ratio of likes over views"
  ) +
  facet_wrap(~ attribute, nrow = 3) +
  scale_y_continuous(
    labels = label_percent(
      accuracy = NULL,
      scale = 100,
      prefix = "",
      suffix = "%",
      big.mark = " ",
      decimal.mark = ".",
      trim = TRUE
    )
  ) +
  scale_x_continuous(breaks = c(2000, 2004, 2008, 2012, 2016, 2020)) +
  theme_minimal() +
  scale_color_manual(
    "Does the ads contain\n this attribute?",
    values = c("#808080", "#FF0000"),
    labels = c("No", "Yes")
  ) +
  theme(
    axis.title.x = element_text(hjust = 0.5),
    axis.title.y = element_text(margin = margin(r = 20),
                                hjust = 0.5),
    panel.spacing = unit(1.5, "lines"),
    legend.position = c(0.9, 0.1),
    plot.title = element_text(size = 12),
    panel.grid.minor = element_blank()
  ) 

Discussion

Looking at the stacked area chart, we can get an overview of the video attributes makeup. In general, funny and show_product_quickly take up the highest proportion among the seven video attributes over years. patriotic takes up the least proportion. We cannot generalize a clear pattern just by using the stacked area chart since the width of each color band is hard to measure and compare. However, this chart does give us information about what are the most preferred attributes of Superbowl commercials: showing products quickly and being funny. This makes sense since Super Bowl commercials are usually expected to be humorous, and showing the product quickly increases the company’s chances of maximizing consumer exposure to their product, which is why these attributes are the most prominent across the years.

Over the years, the trend of audience preference toward Super Bowl ads depends on the attributes the ads contain. For example, the audience’s love for patriotic ads peaked in 2010. 2010 was the year of the first midterm election after Obama’s victory, which might explain this increase. Furthermore, in specific years, the audience has strong feelings toward ads that contain certain attributes. For instance, in 2002, ads that do not contain funny received higher percentage of likes than the ones that contain funny. In this case, it is possible that the terrorist attack in 2001 had changed the national sentiment, so the audience consider funny videos less appropriate during that time.

Combining the line graphs with the stacked area plot, we discover interesting matches and unmatches between attributes proportion and audience reactions. For example, a corresponding match appears in attribute celebrity where celebrity-included ads have become popular with higher percentage since 2015, which coincides with the increasing audience preference by observing the interaction data in plot two. However, things are not always matched. When the proportion of funny videos grew from 2000 to 2003, audiences tend to prefer videos that do not contain humor. Similarly, danger in the last 5 years has narrowed its width in the stacked area chart while the audience tends to hit like more often for ads containing danger. There can be many underlying reasons, but our guesses reside in the different expectations between the audience and ad companies in which relevant regulations prevent companies from catering to their audience on dangerous content.

Question 2 How do election year ads differ from non-election year ads in terms of content and description?

Introduction

Our second question aims to explore the difference, if any, between ads aired during election years compared to those aired during non-election years. Particularly, we want to analyze if there is any noticeable difference in the description and content between the election and non-election year ads. Answering this question involves looking at the title variable to analyze how these ads were described, and then analyzing the boolean content variables of use_sex, patriotic,funny ,celebrity,danger, and animals to see if they had any noticeable differences in their content. Finally, the year variable is also needed to distinguish between election and non-election years.

We are interested in exploring this question because election years mark a significant cultural moment in the US. Therefore, we want to see whether this focus on politics translates into any noticeable effect on super bowl ads.

Approach

To analyze the description aspect of our question, we decided to use a word cloud visualization since we felt it was the most informative way to visualize what major descriptors are used for the ads. The alternative way of analyzing title that we considered included a bar char for top 10-20 words. However, we decided to opt for the word cloud since it provided more information in terms of relative occurrences of all the words being used (by size). Creating the word cloud involved cleaning the text (for e.g turning all words lower, removing punctuation, removing numbers) which was done by the tm library. We also performed sentiment analysis on the words using the get_sentiment() function and colored the word cloud based on the words’ sentiment score. We used the syuzhet library for the get_sentiment function since it assigned each word a sentiment score and had the widest range of words that could be assigned a score i.e largest word dictionary.

To analyze the content aspect, we decided to opt for a column graph with percentage values of each content category as labels of the visualization. Possible alternative that we considered was a pie chart but since our column graph shows the percentage values as well as count values, we decided to go for the visualization that maximized information. For this plot, we decided to not use the show_product_quickly attribute since it does not provide insight into whether the content was different.

Analysis

# Election years in the dataset
election_year <- c(2000, 2004, 2008, 2012, 2016, 2020)
# Adding an election year variable
youtube <- youtube %>%
  mutate(election_years = ifelse(year %in% election_year, 1, 0))
# Data Wrangling steps
test <- youtube %>%
  filter(election_years == 1) %>%
  select(title)

word_cloud_viz <- function(testvector, freqCheck) {
  docs <- VCorpus(VectorSource(testvector))
  docs <- docs %>%
    # Removing numbers
    tm_map(removeNumbers) %>%
    # Removing punctuation
    tm_map(removePunctuation) %>%
    # Removing whitespace
    tm_map(stripWhitespace)
  docs <- tm_map(docs, content_transformer(tolower))
  docs <- tm_map(docs, removeWords, stopwords("english"))
  # Removing words that contain brand names
  docs <- tm_map(
    docs,
    removeWords,
    c(
      "super",
      "bowl",
      "commercial",
      "superbowl",
      "bud ",
      "light",
      "budweiser",
      "pepsi",
      "hyundai",
      "doritos",
      "coke",
      "cocacola",
      "cola",
      "coca",
      "kia",
      "toyota",
      "etrade",
      "nfl"
    )
  )
  dtm <- TermDocumentMatrix(docs)
  matrix <- as.matrix(dtm)
  words <- sort(rowSums(matrix), decreasing = TRUE)
  election_df <- data.frame(word = names(words), freq = words)
  election_df <- election_df %>%
    mutate(angle = sample(-45:45, nrow(election_df), replace = TRUE)) %>%
    filter(freq >= freqCheck) %>%
    # Package to get sentiment scores for each word in the dataset
    mutate(sentiment = get_sentiment(word, "syuzhet"))
  retPlot <- election_df %>%
    ggplot(aes(
      label = word,
      color = sentiment,
      size = freq,
      angle = angle
    )) +
    geom_text_wordcloud() +
    theme_minimal() +
    scale_color_gradient(low = "#FFD662FF", high = "#00539CFF")
  return(retPlot)
  
}
election_1 <- youtube %>%
  filter(election_years == 1) %>%
  select(title)

election_0 <- youtube %>%
  filter(election_years == 0) %>%
  select(title)

election_wc <- word_cloud_viz(election_1, 1)

nelection_wc <- word_cloud_viz(election_0, 2)

efinal <- election_wc +
  scale_radius(range = c(2, 15)) +
  labs(
    title = "Election Year Ad Title Word Cloud",
    subtitle = "Colored by Sentiment
      Bluer for postive sentiment
      Brown for neutral sentiment
      Yellower for negative sentiment"
  )

nefinal <- nelection_wc +
  scale_radius(range = c(3.5, 15)) +
  labs(
    title = "Non-election Year Ad Title Word Cloud",
    subtitle = "Colored by Sentiment
      Bluer for postive sentiment
      Brown for neutral sentiment
      Yellower for negative sentiment"
  )

efinal / nefinal

# Creating variable with election, using pivot_longer to get attributes in the
# same column, and creating a percentage variable
election_yr <- youtube %>%
  pivot_longer(cols = c(use_sex, funny, celebrity, patriotic, danger, animals)) %>%
  filter(election_years == 1) %>%
  filter(value == TRUE) %>%
  group_by(name) %>%
  summarise(n = n()) %>%
  mutate(perc = paste(round((n / sum(
    n
  )) * 100), "%", sep = ""))
# Creating the plot
elecyr <- election_yr %>%
  ggplot(aes(y = name, x = n, fill = name)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = perc, color = name),
            nudge_x = 2,
            show.legend = FALSE) +
  labs(
    x = "Count",
    y = "Ad Attribute",
    title = "Count and Proportion of Ad by Attribute",
    subtitle = "In Election Years"
  ) +
  scale_fill_brewer(palette = "Dark2") +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal() +
  scale_y_discrete(labels = rev(
    c(
      "Use Sex",
      "Patriotic",
      "Funny",
      "Danger",
      "Celebrity",
      "Animals"
    )
  ))
# Creating variable with no election, using pivot_longer to get attributes in the
# same column, and creating a percentage variable
no_election_yr <- youtube %>%
  pivot_longer(cols = c(use_sex, funny, celebrity, patriotic, danger, animals)) %>%
  filter(election_years == 0) %>%
  filter(value == TRUE) %>%
  group_by(name) %>%
  summarise(n = n()) %>%
  mutate(perc = paste(round((n / sum(
    n
  )) * 100), "%", sep = ""))
# No Election year col vizualization
nelecyr <- no_election_yr %>%
  ggplot(aes(y = name, x = n, fill = name)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = perc, color = name),
            nudge_x = 4,
            show.legend = FALSE) +
  labs(x = "Count",
       y = "Ad Attribute",
       subtitle = "In Non-Election Years") +
  scale_fill_brewer(palette = "Dark2") +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal() +
  scale_y_discrete(labels = rev(
    c(
      "Use Sex",
      "Patriotic",
      "Funny",
      "Danger",
      "Celebrity",
      "Animals"
    )
  ))

elecyr / nelecyr

Discussion

We can divide the discussion of our visualizations into two parts: comparison by description and comparison by content.

When comparing our two words clouds, it immediately becomes clear that there is some overlap in the words used in Superbowl Ad titles. These include, winner, new, and nfl. Similarly, there does not seem to be any major difference in the overall sentiment makeup of these words. We can gauge this by observing that both the word clouds have a few blue words (indicating positive sentiment), very few yellow words (indicating negative sentiment), and majority brown words (indicating neutral sentiment). This goes on to indicate that there is very little difference in the titles of superbowl ads in election years compared to non-election years. This result was surprising since our team was expecting the election titles to contain more patriotic words such as “America”, “Freedom”, and “Liberty” but it appears as if the ad titles are not significantly affected by election and non-election years.

We observed a similar relation when analyzing the content of election and non-election year ads. For both these categories, there seemed to be a marginal difference in the attribute makeup, indicating that even the content of ads was very similar in election and non-election years. However, it was interesting to note that use_sex attribute dropped from 14% in non-election years to 9% in election years. While this could be indicative of a slight variation in content, it more more likely that this change was a coincidence.

Presentation

Our presentation can be found here.

Data

Mock, T 2021, Superbowl commercials, electronic dataset, tidytuesday, viewed 20 September 2021, https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-03-02/readme.md.

References

https://www.r-graph-gallery.com/136-stacked-area-chart https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html