A little analysis of the training participants based on email domain
name - I won’t show the actual emails here, but the emails
variable contains all of the email addresses.
First we’re going to extract the domain name from each one, using
str_extract()
from the stringr
pacakage and a
regeular expression. We’ll send the output to a single column data
frame.
library(stringr)
email_domains <- str_extract(
emails,
"\\.[[:alpha:]]+$|\\.[[:alpha:]]+>$"
) %>%
gsub("\\.|>", "", .) %>%
data.frame(domain = .)
Translate the domain names to countries by joining the suffixes from a csv file to our data frame
library(dplyr)
library(here)
domain_countries <- read.csv(here("data", "domain_suffixes.csv")) %>%
rename_with(tolower)
email_domains <- left_join(email_domains, domain_countries)
## Joining with `by = join_by(domain)`
The .com domain has no coutry name so is set to NA - change it “unknown”
email_domains <- mutate(
email_domains,
country = case_when(
is.na(country) ~
"Unknown",
TRUE ~ country
)
)
Now we plot with the ggplot2
package, using some
functions from the forcats
package to affect the order.
library(ggplot2)
library(forcats)
ggplot(email_domains, aes(x = country)) + geom_bar()
Let’s fix it so we can see the country names by rotating the axis text and setting the horizontal justification to align with tick and vertical justification to centre on the tick
ggplot(email_domains, aes(x = country)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
It’s still hard to read - long text should always be on the y-axis so you don’t have to turn your head to read it!
ggplot(email_domains, aes(y = country)) +
geom_bar()
Having the data in alphabetical order doesn’t really help with ranking the countries, so let’s put them in order of frequency. We have to modify the label otherwise we get the formula.
ggplot(email_domains, aes(y = fct_infreq(country))) +
geom_bar() +
labs(y = "country")
Maybe the ones with the most should go at the top, so we reverse the factors…
ggplot(email_domains, aes(y = fct_rev(fct_infreq(country)))) +
geom_bar() +
labs(y = "country")
Perhaps we can colour by country?
ggplot(
email_domains,
aes(y = fct_rev(fct_infreq(country)), fill = fct_infreq(country))
)+
geom_bar() +
labs(y = "country", fill = NULL)
There are probably too many countries for that to be meaningful. Maybe we can colour by count instead, and while we’re at it fix the scale breaks on the x-axis, the gap at the left, and we don’t really need a y-axis title at all as the data are self explanatory. We can add a title too! Plus there’s a little bit of adjustment to get the grid lines around the bars rather than through the middle of them.
ggplot(
email_domains,
aes(y = fct_rev(fct_infreq(country)), fill = after_stat(count))
)+
geom_bar(position = position_nudge(y = -0.5)) +
labs(
x = "Number of particpants",
y = NULL,
title = "Finland has more particpants at the 2022 harp Training Course\nthan any other country"
) +
scale_fill_viridis_c(guide = "none") +
scale_x_continuous(
breaks = seq(0, 10),
expand = expansion(add = c(0, 0.5))
) +
scale_y_discrete(expand = expansion(add = 0)) +
theme(
panel.grid.minor.x = element_blank(),
axis.ticks = element_blank(),
axis.text.y = element_text(vjust = 1.5)
)