Twitter Trends from the #DuBoisChallenge

by Ctrl+Alt+Elite

# Package messages suppressed
library(tidyverse)
library(maps)
library(sf)
library(rnaturalearth)
library(rvest)
library(ggthemes)
library(lubridate)
library(stringr)
library(colorspace)
library(scales)
library(ggrepel)
library(RColorBrewer)
library(patchwork)

knitr::opts_chunk$set(
  fig.width = 8,
  fig.asp = 0.618,
  fig.retina = 3,
  dpi = 300,
  out.width = "90%"
)

tweets <- read.csv("data/tweets.csv")

Introduction

W.E.B Du Bois was one of the most multifaceted and persuasive civil rights leaders in history; one of the channels he used to advance equality was his profound work in data visualization. In order to celebrate his legacy of using visualizations to do good in the world, the TidyTuesday #DuBoisChallenge calls upon users to recreate a selection of Du Bois’ visualizations from the 1900 Paris Exposition with the aid of modern tools.

For our project, we’re choosing to focus on the 2021 WEB Du Bois & Juneteenth Twitter challenge data - specifically, we’ll be looking at the content and metadata of tweets that used the #DuBoisChallenge hashtag to engage with the challenge via a plot submission or with commentary. There are a total of 13 different variables about the tweets and Twitter users - such as like_count, lat, long, and content - and 445 tweets to study. From this data, we hope to learn how people engage with culturally significant data visualizations in a social media setting.

Location Analysis of Tweets

How does user participation and inter-user interaction in the #DuBoisChallenge vary with location?

Introduction

W.E.B. Du Bois is a national hero in American culture - one would expect this hero status to result in overwhelming US participation, but the TidyTuesday #DuBoisChallenge attracted interest from all over the world. The goal of these sorts of competitions is twofold - to see what sorts of innovative choices participants make as well as to form ties and relationships between participants. By displaying where tweets come from and which geographical regions tend to have inter-user relationships (tags), we can advise future competitions on which target audiences are most receptive - the end goal of such advice is to allow larger TidyTuesday-like competitions that attract a larger audience and form more relationships among the participants.

Interestingly, Twitter metadata contains two different variables that represent a user’s location. There is a location variable constructed either from the user’s bio or their use of a geo-tag and there are lat and long: the coordinates of the device at the time of tweet publication. Per our question, we’d like to look at how the tweets are spread around the world and differences in user behavior with other users. Specifically, we’ll look at how users “tag” other users in their challenge posts by counting the number of @user occurrences in each tweet content to create tag_count.

Approach

In our first visualization, we will investigate the location character variable. Due to the self-reported nature of location, many of the entries are jokes, ambiguous, or in inconsistent formats. To address the latter problem, we had to perform manual cleaning - US tweets tend to record the state with some use of city names while international tweets use a mixture of country and city labels. In order to standardize, we converted US cities into their respective states and converted significant foreign cities into country groups. While comparing states and countries is atypical, the majority of the observed tweets are in the US and the use of states allows tweets from other countries to be studied. We proceed to count our cleaned location variable and narrow the list to the top 10 entries. In order to display counts for discrete location names, we chose to utilize a bar plot stratified into and colored by regions (SE, SW, NE, NW, international) and sorted in descending order within regions. Further, we’ve included a text layer with the percentage of each bar out of the total tweets from the top 10 locations.

Our second visualization aims to display which areas of the world published the most tweets during the challenge and how the number of tags in a tweet might vary across the globe: the map format arose as a natural candidate due to the need to plot coordinates and our goal of comparing geographical regions. Rather than create an overcrowded world map, we instead chose to use the long and lat location variables to form two maps - one of the Northeastern US and one of Europe - which each display a layer of country border data, a layer of points, and a layer of annotations. Each tweet point is colored, sized, and given transparency values via mapping to tag_count, a variable formed by counting the occurrence of @user strings within tweet content. Further, each point is given random noise by geom_jitter() in order to alleviate some of the extreme clustering of points around NYC and the southern UK.

Analysis

tweets <- tweets %>%
  mutate(tag_count = str_count(content, "@"))

tweets_locations <- tweets %>%
  filter(!str_detect(location, "@"), !str_detect(location, ":")) %>%
  mutate(
    location_pre_comma = gsub(",.*", "", location),
    location_pre_comma = case_when(
      location_pre_comma == "北京" ~ "Beijing",
      location_pre_comma == "God's earth" ~ "NA",
      location_pre_comma == "The City College of New York" ~ "New York",
      location_pre_comma == "World" ~ "NA",
      location_pre_comma == "he/they" ~ "NA",
      location_pre_comma == "At the home office" ~ "NA",
      location_pre_comma == "Deutschland" ~ "Germany",
      location_pre_comma == "Distrito Federal" ~ "Mexico City",
      location_pre_comma == "down in dey wid em" ~ "NA",
      location_pre_comma == "Forde-Obama Hall" ~ "NA",
      location_pre_comma == "France & UK" ~ "NA",
      location_pre_comma == "Lil’ Rudyshire" ~ "NA",
      location_pre_comma == "MIT" ~ "Boston",
      location_pre_comma == "New Yorker" ~ "New York",
      location_pre_comma == "OAK / NYC / ATL / The World" ~ "NA",
      location_pre_comma == "SP" ~ "NA",
      location_pre_comma == "Toronto || Ottawa" ~ "NA",
      location_pre_comma == "Tx" ~ "Texas",
      location_pre_comma == "UK" ~ "United Kingdom",
      location_pre_comma == "USA" ~ "United States",
      location_pre_comma == "Worldwide" ~ "NA",
      TRUE ~ location_pre_comma
    ),
    plot_state = case_when(
      location_pre_comma == "Nashville" ~ "Tennessee",
      location_pre_comma == "Merced" ~ "California",
      location_pre_comma == "Vienna" ~ "Austria",
      location_pre_comma == "Madison" ~ "Wisconsin",
      location_pre_comma == "Minneapolis" ~ "Minnesota",
      location_pre_comma == "Philadelphia" ~ "Pennsylvania",
      location_pre_comma == "Boston" ~ "Massachusetts",
      location_pre_comma == "London" ~ "United Kingdom",
      location_pre_comma == "Edinburgh" ~ "United Kingdom",
      location_pre_comma == "Amherst" ~ "Massachusetts",
      location_pre_comma == "Buffalo" ~ "New York",
      location_pre_comma == "Cambridge" ~ "Massachusetts",
      location_pre_comma == "San Diego" ~ "California",
      TRUE ~ as.character(location_pre_comma)
    )
  )

top_10_locations <- tweets_locations %>%
  count(plot_state) %>%
  arrange(desc(n)) %>%
  filter(plot_state != "NA") %>%
  head(10) %>%
  mutate(
    region = case_when(
      plot_state == "New York" ~ "US: Northeast",
      plot_state == "New Jersey" ~ "US: Northeast",
      plot_state == "Pennsylvania" ~ "US: Northeast",
      plot_state == "Massachusetts" ~ "US: Northeast",
      plot_state == "Tennessee" ~ "US: South",
      plot_state == "California" ~ "US: West",
      plot_state == "Austria" ~ "Europe",
      plot_state == "United Kingdom" ~ "Europe",
      plot_state == "Wisconsin" ~ "US: Midwest",
      plot_state == "Minnesota" ~ "US: Midwest"
    ),
    region = fct_relevel(region, c(
      "US: Northeast", "Europe", "US: Midwest", "US: South",
      "US: West"
    )),
    plot_state = fct_relevel(plot_state, c(
      "New York", "New Jersey", "Massachusetts",
      "Pennsylvania", "Austria", "United Kingdom",
      "Minnesota",
      "Wisconsin", "Tennessee", "California"
    )),
    plot_state = fct_rev(plot_state),
    percent_tweets = paste(round(n / sum(n), 4) * 100, "%")
  )

northeast_tweets <- tweets %>%
  filter(
    long <= -70,
    long >= -90,
    lat <= 46,
    lat >= 20
  )
europe_tweets <- tweets %>%
  filter(
    long >= -20 & long <= 45,
    lat >= 30 & lat <= 73
  )

world_map <- ne_countries(
  scale = "medium", type = "map_units",
  returnclass = "sf"
)
us_map <- map_data("state")
canada_map <- map_data("world", "canada")

US_locations <- tribble(
  ~city, ~lat, ~long,
  "NYC", 40.7128, -74.0060,
  "Baltimore", 39.2904, -76.6122
)

EU_locations <- tribble(
  ~city, ~lat, ~long,
  "London", 51.5074, -0.1278,
  "Vienna", 48.2082, 16.3738,
  "Rome", 41.9028, 12.4964
)

ggplot(data = top_10_locations, aes(y = plot_state, x = n, fill = region)) +
  geom_col() +
  geom_text(aes(label = percent_tweets, color = region),
    size = 2.5,
    nudge_x = 8,
    show.legend = FALSE
  ) +
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1") +
  scale_x_continuous(
    breaks = c(0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200)
  ) +
  labs(
    title = "Top 10 #DuBoisChallenge Tweet Locations",
    caption = "% calculated out of top 10",
    x = "Number of Tweets",
    y = "User Location in Twitter Bio",
    fill = "Location"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.caption = element_text(face = "italic"),
    legend.title = element_blank(),
    legend.position = c(.837, .25),
    text = element_text(family = "Times New Roman")
  )

set.seed(1234)

NEmap <- ggplot() +
  geom_polygon(
    data = canada_map,
    aes(x = long, y = lat, group = group),
    fill = "#F0F0F0",
    color = "black"
  ) +
  geom_polygon(
    data = us_map,
    aes(x = long, y = lat, group = group),
    fill = "#F0F0F0", color = "black"
  ) +
  geom_jitter(
    data = northeast_tweets,
    width = 0.5, height = .5,
    aes(
      x = long, y = lat, size = tag_count, color = tag_count,
      alpha = tag_count
    )
  ) +
  labs(
    title = "#DuBoisChallenge Tweets",
    subtitle = "in Northeastern U.S. & Canada\n",
    caption = "Based on Location Where Tweet Was Published",
    size = "# of Users Tagged",
    color = "# of Users Tagged",
    alpha = "# of Users Tagged"
  ) +
  theme_void() +
  scale_color_continuous_sequential(
    palette = "OrYel", rev = FALSE,
    breaks = c(0, 2, 4, 6, 8, 10)
  ) +
  scale_size_continuous(range = c(2, 5), breaks = c(0, 2, 4, 6, 8, 10)) +
  scale_alpha_continuous(range = c(.27, 1), breaks = c(0, 2, 4, 6, 8, 10)) +
  guides(
    color = guide_legend(),
    size = guide_legend(),
    alpha = guide_legend()
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, face = "italic"),
    plot.caption = element_text(hjust = 0.5),
    legend.position = "bottom",
    panel.border = element_rect(color = "black", fill = NA, size = .75),
    panel.background = element_rect(color = "black", fill = "lightblue"),
    text = element_text(family = "Times New Roman")
  ) +
  geom_text_repel(
    data = US_locations,
    aes(x = long, y = lat, label = city),
    size = 3, nudge_x = -0.15,
    nudge_y = 0.95,
    segment.linetype = "dotted"
  ) +
  coord_map(
    xlim = c(-80, -65),
    ylim = c(36, 46)
  )

Euromap <- ggplot() +
  geom_sf(data = world_map, fill = "#F0F0F0", color = "black") +
  geom_jitter(
    data = europe_tweets,
    aes(
      x = long, y = lat, size = tag_count, color = tag_count,
      alpha = tag_count
    )
  ) +
  coord_sf(xlim = c(-20, 45), ylim = c(30, 73), expand = FALSE) +
  labs(
    title = "#DuBoisChallenge Tweets",
    subtitle = "in Europe\n",
    caption = "Based on Location Where Tweet Was Published",
    size = "# of Users Tagged",
    alpha = "# of Users Tagged",
    color = "# of Users Tagged"
  ) +
  theme_void() +
  scale_color_continuous_sequential(
    palette = "OrYel", rev = FALSE,
    breaks = c(0, 2, 4, 6, 8, 10)
  ) +
  scale_size_continuous(range = c(2, 5), breaks = c(0, 2, 4, 6, 8, 10)) +
  scale_alpha_continuous(range = c(.27, 1), breaks = c(0, 2, 4, 6, 8, 10)) +
  guides(
    color = guide_legend(),
    size = guide_legend(),
    alpha = guide_legend()
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, face = "italic"),
    plot.caption = element_text(hjust = 0.5),
    legend.position = "bottom",
    panel.border = element_rect(color = "black", fill = NA, size = .75),
    panel.background = element_rect(color = "black", fill = "lightblue"),
    text = element_text(family = "Times New Roman")
  ) +
  geom_text_repel(
    data = EU_locations,
    aes(x = long, y = lat, label = city),
    size = 3,
    nudge_x = -14, nudge_y = -0.3,
    segment.linetype = "dotted"
  )

(NEmap + theme(plot.margin = unit(c(0,25,0,0), "pt"))) +
(Euromap + theme(plot.margin = unit(c(0,0,0,25), "pt")))

Discussion

Our first plot clearly shows that most of the Twitter challenge activity is occurring in the US; eight of the ten locations were US states even though these states were being directly compared to entire foreign countries. The only international locations in the plot - the UK and Austria - account for merely 7.88% out of the top 10 locations. This confirms our initial assumption that most people interested in and interacting with this challenge are in the US. Domestically, four of the top ten locations are northeastern states: an unsurprising find when considering population figures and Du Bois’s legacy in NYC. Moving forward in our analysis, we chose to focus on the Northeast and Europe to study their users and how they may engage each other differently in the two regions.

Our second visualization is a set of two maps of the Northeastern United States and Europe with point coloring/sizing by the number of @tags. One advantage of the long and lat variable over the previous plot’s location is that we can more precisely display areas where tweet clustering occurs rather than user home locations. For example, the Northeast map shows the dominant NYC cluster extending outwards towards Newark, NJ while another small cluster exists around Baltimore. In Europe, there is a low-density cluster centered in London and small clusters in major cultural cities such as Rome and Vienna, but the overall number of tweets is far lower than in the US. A glance at the tag_count mappings proves more fruitful - in both maps, the low tag counts are most frequent in all regions, but high tag counts appear exclusively in cluster centers. Since these clusters are all cosmopolitan cities, it’s reasonable to speculate that their users would have the most ties to other participants and thus the ability to tag more people per tweet. However, due to the low number of observations and the correlational nature of the data, we lack the evidence to make a causal claim that cosmopolitan cities generate more tags (and thus more user interactions). Further, our metric to study inter-user interactions, tag_count, is an imperfect proxy for how users speak. Not every tag is a meaningful connection between users and the relationship between tagging and user interaction is confounded by this ambiguity.

Twitter Posting Mediums and Audience Engagement

What is the relationship between the device that is used to publish a tweet and how the audience engages with the author?

Introduction

We’re interested in how the different “mediums” of publishing to Twitter are related to the success of the tweet. Based on our own anecdotal experiences, we believe that tweets from traditional computers may be constructed less impulsively than mobile tweets. Further, there are socioeconomic factors tied to which device one picks as well as content considerations; one can easily publish #DuBoisChallenge commentary from an iPhone, but a full plot submission would be exceedingly difficult. By mapping out how such limitations influence audience response, we hope to discover which devices are best-suited for both low-visibility commentary and high-visibility challenge submissions.

For this question, we intend to explore how various audience approval metrics vary with the type of device used by the tweet author. To do this, we’ll begin by manipulating the text variable - a character of the platform/access point that the device used to access Twitter - by parsing it through case_when with specific string detection. Then, we will examine followers and like_count and their relationship with devicetype. Unlike the other two variables, the number of followers is not exclusively set by one tweet, but by the user’s long-term success. In recognition of this disparity, we intend to display the relationship between followers and devicetype with groupings by username and avg_followers across all of a user’s tweets in the data.

Approach

Our first plot dives into the relationship between like_count and devicetype by showing side-by-side boxplots faceted by devicetype; our goal is to directly compare how different platforms may generate likes and boxplots are especially well-suited to comparing similar distributions. Further, the measures of central tendency - medians and quartiles - are easier to view in boxplot form and better communicate the audience’s reaction to a tweet. Boxplots cannot accommodate large numbers of groups easily, however, so we chose to narrow our analysis to the three device types that form the overwhelming majority of the data (iPhone, Android, and Web App). There is an incredibly strong right skew in the like counts, but applying a log-transform on the x-axis allows all the data to be visible.

Our second plot shows the distribution of followers count per user stratified across the three most popular device types. In order to assign each user a single follower count and device type, we use the average number of followers across all their tweets (avg_followers) and assign a device type based on the user’s most common way to post on twitter. Thus, the plot examines the relationship between the primary device that a user owns and their average number of followers while participating in the TidyTuesday competition. We chose to show the distributions of follower count as density plots in order to better examine the shapes of the distributions; we’re interested in not only typical users, but also “extreme” users from the right tail of the distribution. In order to make these extreme values more visible, a log scale is applied to average follower counts.

Analysis

tweets <- tweets %>%
  mutate(devicetype = case_when(
    str_detect(text, "Twitter Web App") ~ "Web App",
    str_detect(text, "Twitter for iPhone") ~ "iPhone",
    str_detect(text, "Twitter for Android") ~ "Android",
    str_detect(text, "Buffer") ~ "Buffer",
    str_detect(text, "Twitter for iPad") ~ "iPad",
    str_detect(text, "TweetDeck") ~ "TweetDeck",
    str_detect(text, "Crowdfire App") ~ "Crowdfire App",
    str_detect(text, "Twitter for Mac") ~ "Mac"
  ))

tweets_device <- tweets %>%
  filter(devicetype == "iPhone" |
    devicetype == "Web App" |
    devicetype == "Android")

users_and_devicetype <- tweets %>%
  group_by(username) %>%
  select(username, devicetype) %>%
  count(devicetype) %>%
  arrange(desc(n)) %>%
  slice(1) %>%
  select(-n)

users_and_avg_follower <- tweets %>%
  group_by(username) %>%
  summarise(avg_followers = mean(followers))

joined_devices_follower <- inner_join(users_and_devicetype,
  users_and_avg_follower,
  by = "username"
) %>%
  filter(devicetype == "iPhone" |
    devicetype == "Web App" |
    devicetype == "Android")

# Plot drops NAs and tweets with 0 likes since log(0) is undefined
ggplot(data = tweets_device, aes(x = like_count, fill = devicetype)) +
  facet_wrap(~devicetype, ncol = 1) +
  geom_boxplot(show.legend = FALSE) +
  scale_fill_manual(values = c("#a4c639", "#A2AAAD", "#1DA1F2")) +
  scale_x_log10() +
  labs(
    title = "Distribution of Likes",
    subtitle = "by Device Tweet Was Posted From",
    caption = "x-axis uses logarithmic scale (tweets with 0 likes are ommited)",
    x = "Number of Likes",
    y = NULL 
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, face = "italic"),
    plot.caption = element_text(face = "italic", hjust = 0.5),
    axis.ticks = element_blank(),
    axis.text.y = element_blank(),
    text = element_text(family = "Times New Roman")
  )

# Plot drops NAs 
ggplot(
  data = joined_devices_follower,
  aes(x = avg_followers, fill = devicetype)
) +
  geom_density(show.legend = FALSE) +
  facet_wrap(. ~ devicetype, nrow = 3) +
  theme_minimal() +
  scale_x_log10(labels = label_number(big.mark = ",")) +
  scale_fill_manual(values = c("#a4c639", "#A2AAAD", "#1DA1F2")) +
  scale_y_continuous(labels = label_number(accuracy = .1)) +
  labs(
    title = "Distribution of Follower Count",
    subtitle = "by User's Primary Device",
    caption = "x-axis uses logarithmic scale", 
    x = "Average Number of Followers Per User",
    y = NULL
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, face = "italic"),
    plot.caption = element_text(face = "italic"),
    text = element_text(family = "Times New Roman")
  )

Discussion

In our first visualization, we immediately see strong skew in all three device types and their number of likes - the presence of the logarithmic scale implies that a “symmetric” box is actually strongly right-skewed. The strongest right skew and highest median number of likes belongs to Web App tweets. One possible explanation is the challenge’s format - most tweets are either going to be a detailed plot recreation or commentary on another plot. We suspect that plots are disproportionately published by web apps due to screen size required to work with data visualization - since plots will generally earn more viewership than commentary, web app posts are more likely to attain superstar status.

Our second distribution examines average followers rather than likes in order to assess long-term success rather than individual post success. Overall, Android tweets have a flat distribution while iPhones and Web Apps have left and right skew in the plot, respectively. However, the use of a logarithmic scale means that these shapes correspond to strongly right-skewed data. Of these distributions, Web App tweets are skewed the furthest to the right followed by Androids and iPhones. In terms of modal values in the distributions, iPhones have the sharpest and largest peak followed by a slightly lower mode for Web Apps and no clear peak for Android tweets.

We expected likes and followers to have a similar relationship with devicetype where the device that garners more likes would also attract the most followers. Surprisingly, we see that Web Apps are responsible for the “superstar” tweets with a high like count while iPhone users tend to draw the most followers. We have two possible explanations for these effects. An explanation for the web apps might be that it is much easier to type out and structure tweets on a computer than it is on a phone. In fact, one study from Grammarly showed that people make five times as many writing errors on their phone than on a PC. This may be one potential reason why tweets authored on a Web App device seem to attract more likes than those authored on a mobile device like an Android or iPhone – they could be written more methodically and profoundly. In regards to the followers, there are socioeconomic considerations; iPhones are more expensive than most computers or Androids and higher socioeconomic status could lead to more free time to create tweets on a regular basis. However, we lack data on both writing times and the cost of a user’s device; thus, we can’t make causal conclusions between device and tweet success, but the above pattern warrants future study with more experimental data.

Presentation

Our presentation can be found here.

Data

Starks, A, Hillery, A, Tyler, S 2021, Du Bois and Juneteenth revisited: TidyTuesday week 8 of 2021, electronic dataset, tidytuesday, retrieved 13 September 2021, https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-06-15/readme.md.