--- title: "R Workshop Series - Presentation 2" author: "Eric Geesaman - Assistant Director of Workshops/Databases" date: "Wednesday, April 14, 2021" output: html_notebook: default --- # Introduction Hello everyone and welcome back to R! After reviewing data importing and wrangling in last week's lesson, tonight we will switch our focus to the graphical portion of R. As a program, R has *amazing* graphics, ranging from simple black-and-white histograms to multi-variable, color-coded 3D charts. While these complex charts are fun to make, it is important to keep in mind the goal of a graph: to inform the audience of the necessary data without adding anything confusing/unnecessary. With that in mind, let's discuss tonight's agenda. ## Tonight's Goals Tonight's goal is to learn/review some R graphics basics, depending on your past experience. We will discuss: 1. What is **ggplot** and some common graphs in the package? 2. What is **plotly** and some common graphs in the package? Before we dive into each of the packages, let's load them and some additional packages into our R session using the code chunk below. ```{r} # clear R session rm(list = ls()) # install packages for session install.packages("scales") install.packages("gridExtra") install.packages("grid") # load packages into R session library(ggplot2) # graphing library(plotly) # graphing library(tidyverse) # chaining library(dplyr) # wrangling library(scales) # renumbering library(gridExtra) # organizing graphs library(grid) # organizing graphs ``` # ggplot Overview **ggplot** (package **ggplot2**) is a widely-used graphics builder in the R community. We can create many different types of graphs using the package. Here is an *extremely* helpful **ggplot** cheat sheet that I encourage you to review when you are creating graphics with the package: [ggplot Cheat Sheet](https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) Also, here is a "Colors in R" page that you can use when inputting colors into the program. There are many different color options, but always make sure your colors do not clash and make sure your audience can see them well (i.e. do not use "white" on a white background). [Colors in R](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf?utm_source=twitterfeed&utm_medium=twitter) ### Using ggplot Now that we have the cheat sheet for assistance, let's load some data from the last presentation and use it to build some graphs. ```{r} ### DATA LOAD/IMPORT # neighborhood data neighborhood <- data.frame(first_name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T"), last_name = c("W", "X", "X", "V", "V", "U", "X", "X", "V", "V", "V", "W", "Y", "Z", "X", "V", "U", "W", "X", "W"), family_member = c(4, 2, 4, 1, 1, 4, 2, 3, 3, 3, 1, 5, 2, 2, 5, 3, 4, 2, 5, 5), height_ft = c(6, 6, 6, 6, 6, 5, 5, 5, 4, 6, 4, NA, 6, 6, NA, 5, 4, 5, NA, NA), height_in = c(9, 6, 0, 2, 2, 3, 8, 1, 3, 0, 7, NA, 1, 6, NA, 7, 9, 6, NA, NA), job_status = c(0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, NA, 0, 1, NA, 1, 1, 1, NA, NA)) neighborhood_recoded <- neighborhood %>% mutate(family_member = case_when(family_member == 1 ~ "Grandparent", family_member == 2 ~ "Parent", family_member == 3 ~ "Child", family_member == 4 ~ "Guardian", family_member == 5 ~ "Pet"), job_status = case_when(job_status == 0 ~ "Unemployed", job_status == 1 ~ "Employed")) neighborhood_heights <- neighborhood_recoded %>% mutate(height_inches = height_ft*12 + height_in, height_cm = height_inches*2.54) # vaccine data state_abbreviations <- read.csv("C:\\Users\\eric_\\OneDrive - The Pennsylvania State University\\Clubs\\Actuarial Science\\Assistant Director of Workshops-Databases\\Technical Workshop Series\\Week 1\\Presentation 1\\StateAbbreviations.csv") %>% rename(state = State, abbreviation = Abbreviation, region = Region) vaccine_data <- read.csv("C:\\Users\\eric_\\OneDrive - The Pennsylvania State University\\Clubs\\Actuarial Science\\Assistant Director of Workshops-Databases\\Technical Workshop Series\\Week 1\\Presentation 1\\VaccineData.csv") %>% rename(state = Name, one_shot = `Pct..of.people.given.At.least.one.shot`, two_shots = `Pct..of.people.given.Two.shots`, delivered_doses = `Doses.delivered`, given_shots = `Shots.given`, used_doses = `Doses.used`) %>% mutate(one_shot = as.numeric(gsub("%","",one_shot)), two_shots = as.numeric(gsub("%","",two_shots)), used_doses = used_doses*100) %>% full_join(state_abbreviations) %>% select(state, abbreviation, region, one_shot, two_shots, delivered_doses, given_shots, used_doses) %>% filter(!is.na(abbreviation)) ``` ##### Neighborhood Data ```{r} # Let's start with a scatterplot of height in inches to height in centimeters. ggplot(neighborhood_heights, aes(x = height_inches, y = height_cm)) + geom_point() # This graph is great. It shows that our inches to centimeters computation worked for each object, and the animals are automatically eliminated. Let's add a few extra elements to make this a "showable" graph. ggplot(neighborhood_heights, aes(x = height_inches, y = height_cm)) + geom_point(color = "navy", size = 3) + geom_smooth(method = "lm", color = "gray70") + labs(title = "Family Heights", subtitle = "Inches to Centimeters", x = "Height (Inches)", y = "Height (Centimeters)") + theme_bw() # Now, let's construct a boxplot, where the x-axis will be employment status and the y-axis will be height in centimeters. ggplot(neighborhood_heights, aes(x = job_status, y = height_cm)) # We do not want the "NA" part of the job status. These entries are animals that will not have heights either, so let's perform a small adjustment to filter out the NA entries. Then, we need to use the new data frame for graphing. neighborhood_newjobs <- neighborhood_heights %>% filter(!is.na(job_status)) ggplot(neighborhood_newjobs, aes(x = job_status, y = height_cm)) + geom_boxplot() ``` ##### Vaccination Data ```{r} # Let's make a histogram of doses delivered by state. ggplot(vaccine_data, aes(x = state, y = delivered_doses)) + geom_bar(stat = "identity") # There are so many states that we cannot read their names! Let's fix that, bring some color to the graph, and retitle the axes. ggplot(vaccine_data, aes(x = state, y = delivered_doses)) + geom_bar(stat = "identity", aes(fill = region)) + scale_fill_manual(values = c("goldenrod", "blue", "darkgreen", "firebrick", "lightpink")) + labs(title = "COVID-19 Vaccine Doses Delivered by State", subtitle = "Colored by State's Region in Country", x = "State", y = "Delivered Doses") + scale_y_continuous(labels = comma) + theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 1)) # Instead of coloring the bars, let's make mini plots for each of the regions. To do this, we are need to use a feature in ggplot called "facet_wrap" (or "facet_grid" if you want to specify the dimensions of the mini plots). Let's practice below. ggplot(vaccine_data, aes(x = state, y = delivered_doses)) + geom_bar(stat = "identity") + facet_grid(~region) + scale_y_continuous(labels = comma) + theme(axis.text.x = element_blank(), axis.ticks = element_blank()) # The labels are very extremely tight and hard to read on the graph. Let's fix this using the grid and gridExtra packages as well as creating data frames for each region. vaccine_data_mw <- vaccine_data %>% filter(region == "MW") vaccine_data_ne <- vaccine_data %>% filter(region == "NE") vaccine_data_se <- vaccine_data %>% filter(region == "SE") vaccine_data_sw <- vaccine_data %>% filter(region == "SW") vaccine_data_w <- vaccine_data %>% filter(region == "W") boxplot_mw <- ggplot(vaccine_data_mw, aes(x = state, y = delivered_doses)) + geom_bar(stat = "identity", fill = "goldenrod") + scale_y_continuous(labels = comma) + ylim(0, 12000000) + labs(title = "Midwest", x = "State", y = "Delivered Doses") + theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 1)) boxplot_ne <- ggplot(vaccine_data_ne, aes(x = state, y = delivered_doses)) + geom_bar(stat = "identity", fill = "darkblue") + scale_y_continuous(labels = comma) + ylim(0, 12000000) + labs(title = "Northeast", x = "State", y = "Delivered Doses") + theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 1)) boxplot_se <- ggplot(vaccine_data_se, aes(x = state, y = delivered_doses)) + geom_bar(stat = "identity", fill = "darkgreen") + scale_y_continuous(labels = comma) + ylim(0, 12000000) + labs(title = "Southeast", x = "State", y = "Delivered Doses") + theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 1)) boxplot_sw <- ggplot(vaccine_data_sw, aes(x = state, y = delivered_doses)) + geom_bar(stat = "identity", fill = "firebrick") + scale_y_continuous(labels = comma) + ylim(0, 12000000) + labs(title = "Southwest", x = "State", y = "Delivered Doses") + theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 1)) boxplot_w <- ggplot(vaccine_data_w, aes(x = state, y = delivered_doses)) + geom_bar(stat = "identity", fill = "gray") + scale_y_continuous(labels = comma) + ylim(0, 12000000) + labs(title = "West", x = "State", y = "Delivered Doses") + theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 1)) grid.arrange(boxplot_mw, boxplot_ne, boxplot_se, boxplot_sw, boxplot_w, layout_matrix = cbind(c(1,2),c(3,4),c(5)), top = textGrob("COVID-19 Vaccine Doses Delivered by State and Region")) ``` ### Using plotly **plotly** is a little bit more complex than **ggplot**. For this reason, the weekly assignment will mainly focus on the first package but there will be some analysis using **plotly** as well. The advantage of this package is creating graphs that are interactive. This means that the audience can click/hover the graph and receive more information than otherwise possible. While this is great, remember that a graph should not contain more information than necessary because it could become confusing. There is not a traditional cheat sheet for this package, but instead an entire website dedicated to **plotly** graphics: [plotly Graphics Website](https://plotly.com/r/) ##### United States' Sports Salaries Data ```{r} Salaries_Final <- read.csv("C:\\Users\\eric_\\OneDrive - The Pennsylvania State University\\Clubs\\Actuarial Science\\Assistant Director of Workshops-Databases\\Technical Workshop Series\\Week 2\\Presentation 2\\Salaries_Final.csv") %>% select(name, team, position, rank, salary, league) # Let's build an interactive boxplot, looking at players' salaries by league, based on the ranking of their team in the previous season. plot_ly(Salaries_Final, x = ~ salary, y = ~ league, color = ~ rank, colors = c("goldenrod2", "darkgreen", "yellowgreen", "darkorange", "firebrick"), type = "box") %>% layout(boxmode = "group", title = "Salary by Standings Position", xaxis = list(title = "Salary"), yaxis = list(title = "League")) ``` # Weekly Assignment After going through these examples, you should now be able to complete the assignment that will be due one week from today. If you have any trouble, as always with R, the first step is to Google the error you get or the function you want to do! If you cannot find the answer or receive the help you need this way, please do not hesitate to reach out to me. Again, my email is erg5331@psu.edu. Thank you all for attending the R presentations, and I wish you luck in completing the final project, which is due Wednesday, April 28, 2021 by 11:59 PM EST.