--- title: "R Workshop Series - Presentation 1" author: "Eric Geesaman - Assistant Director of Workshops/Databases" date: "Thursday, April 8, 2021" output: html_notebook: default --- # Introduction Hello everyone and welcome to R! This is a statistical coding program that allows you to do analyses on very large sets of data. You can also import Excel spreadsheets, so this is a great tool for program collaboration. If you ever run into problems with the program itself or some coding that you used, I *strongly* encourage you to Google your problem. Chances are, at least one other person has run into the same issue! If you cannot find a solution, feel free to email me at erg5331@psu.edu, and we can work on resolving the error together. ## Tonight's Goals Tonight's goal is to learn/review some R basics, depending on your past experience. We will discuss: 1. What is R, and what are some common uses? 2. What is a package, and what are some commonly-used packages in R? 3. What is a function, and what are commonly-used functions in R? 4. What is a data frame? How is one created in R and how can they be imported? 5. What is data wrangling, and how is it performed? # R Overview R is an extremely versatile program. Currently, we are using R to perform the coding in RStudio. Within R Studio, we are working in an R Notebook. This is the best option when creating code for others to see. The great thing about an R Notebook is that it is not exclusively for coding! For example, this portion of the notebook is used to provide a textual support of tonight's lesson. This brings us to the first portion of tonight's session: using R for documentation and commentary. ### R Markdown Cheat Sheets As a program, there are so many different uses and functions for and within R. As I said, Google is a great place to start, but it often comes in word form like this. If you are a visual learner, the R community provides awesome cheat sheets for different functions in R. I will link cheat sheets throughout the workshop to help your comprehension of R. ### R Notebooks When performing data analysis, it is important to discuss and document findings. We do this with words, symbols, and formatting. All of these can be done in an R Notebook, too. Instead of diving into each of the types specifically, here is a link to an R Notebook cheat sheet that walks through a variety of uses within R Notebooks: [R Markdown Cheat Sheet](https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) Additionally, R Notebooks allow users to export their work into files like .html, .pdf, .docx. This makes the file easily shareable to those without R and tidies the formatting and coding. ### R Script If you want to do some simple calculations or coding and do not intend for anyone else to view it, working with R Script is a great place to do so. Under the "File" tab, you can create a "New File" called "R Script." This opens a blank page intended just for coding. Tonight and for the after-presentation assignments, we will use R Notebooks so that we can export our work neatly. # Packages R is such a great program because it is completely free. As such, the community is expansive and many members have created "packages" to make coding more efficient. Packages contain coding functions. If a package is a house, a function is person in the house. Some packages contain the same functions, so we want to use packages that contain many functions but also those that meet our needs for whatever we are coding. Additionally, some functions come with R. These functions are referred to as being in the "Base" R package. To use certain functions, we have to "install" packages or recall it if we have the package already. For our first coding activity in R, let's install some packages. If you already have these packages, you can re-install them to check for updates. Here is a cheat sheet for "base" R functions that you can use without loading any packages: [Base R Cheat Sheet](https://rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf) ```{r} # This is a code chunk in R. You can insert it with the "Insert" dropdown above and select "R". More quickly, you can use the keyboard shortcut CTRL+ALT+I on Windows or CMD+OPTION+I on Mac. # The "#" in a code chunk means that R will skip this line when running the code. To have R run a coded line, do not include a #, as demonstrated next. install.packages("tidyverse") # joining lines of code install.packages("dplyr") # data manipulation install.packages("tidyr") # tidy data (best data for R) install.packages("stringr") # character strings install.packages("lubridate") # dates/times install.packages("ggplot2") # graphing install.packages("plotly") # graphing install.packages("shiny") # data presentation install.packages("rmarkdown") # exporting R Notebooks install.packages("xml2") # importing data install.packages("rvest") # "" # Now, we have to "load" those installed packages to use them. rm(list = ls()) library(tidyverse) library(dplyr) library(tidyr) library(stringr) library(lubridate) library(ggplot2) library(plotly) library(shiny) library(rmarkdown) library(xml2) library(rvest) ``` # Functions Now that we have some basic packages, we can perform some basic functions. We actually already used 2 functions, install.packages() and library(). Both of these are included with base R, so we did not have to call a package before we used them. If you ever need help remembering a function's use, you can use the help() function. Let's practice some basic functions. ```{r} # How can we use the mean function? ?mean # Let's create a vector, using the "combine" function, c(). vector1 <- c(1, 3, 6, 8, 57, 204, 395) # What is the average of our vector? mean(vector1) # What is the middle number of our vector? median(vector1) # What is the spread of our vector? sd(vector1) ``` # Dataframes A dataframe is an word used by R coders that means a datatable, or simply, a table. You can create these directly in R, but often times we want to import an Excel worksheet or website that already has a table in it. Let's start by creating our own data frame and then importing one from Excel and another from a website. ### Creating Dataframes in R ```{r} # Let's create a fictitious neighborhood with 20 people, 6 families, 5 different family members (1 = grandparent, 2 = parent, 3 = child, 4 = guardian, 5 = pet), heights from 4'0" to 6'11", and job status (0 = unemployed, 1 = employed). neighborhood <- data.frame(first_name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T"), last_name = c("W", "X", "X", "V", "V", "U", "X", "X", "V", "V", "V", "W", "Y", "Z", "X", "V", "U", "W", "X", "W"), family_member = c(4, 2, 4, 1, 1, 4, 2, 3, 3, 3, 1, 5, 2, 2, 5, 3, 4, 2, 5, 5), height_ft = c(6, 6, 6, 6, 6, 5, 5, 5, 4, 6, 4, NA, 6, 6, NA, 5, 4, 5, NA, NA), height_in = c(9, 6, 0, 2, 2, 3, 8, 1, 3, 0, 7, NA, 1, 6, NA, 7, 9, 6, NA, NA), job_status = c(0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, NA, 0, 1, NA, 1, 1, 1, NA, NA)) # Let's perform some analysis on this neighborhood. neighborhood_recoded <- neighborhood %>% mutate(family_member = case_when(family_member == 1 ~ "Grandparent", family_member == 2 ~ "Parent", family_member == 3 ~ "Child", family_member == 4 ~ "Guardian", family_member == 5 ~ "Pet"), job_status = case_when(job_status == 0 ~ "Unemployed", job_status == 1 ~ "Employed")) # What is the mean height (in centimeters) of each family? neighborhood_familyheights <- neighborhood_recoded %>% mutate(height_inches = height_ft*12 + height_in, height_cm = height_inches*2.54) %>% group_by(last_name) %>% summarise(avg_heightcm = mean(height_cm, na.rm = TRUE)) neighborhood_familyheights ``` ### Importing Dataframes from Excel ```{r} excel_df <- read.csv("C:\\Users\\eric_\\OneDrive - The Pennsylvania State University\\Clubs\\Actuarial Science\\Assistant Director of Workshops-Databases\\StateAbbreviations.csv") head(excel_df) ``` ### Importing Dataframes from Websites ```{r} vaccine_nyt <- "https://www.nytimes.com/interactive/2020/us/covid-19-vaccine-doses.html" vaccine_converted <- read_html(vaccine_nyt) vaccine_data <- html_table(vaccine_converted)[[1]] %>% as_tibble(.name_repair = "unique") state_abbreviations <- read.csv("C:\\Users\\eric_\\OneDrive - The Pennsylvania State University\\Clubs\\Actuarial Science\\Assistant Director of Workshops-Databases\\StateAbbreviations.csv") %>% rename(state = State, abbreviation = Abbreviation, region = Region) vaccine_data <- read.csv("C:\\Users\\eric_\\OneDrive - The Pennsylvania State University\\Clubs\\Actuarial Science\\Assistant Director of Workshops-Databases\\VaccineData.csv") %>% rename(state = Name, one_shot = `Pct..of.people.given.At.least.one.shot`, two_shots = `Pct..of.people.given.Two.shots`, delivered_doses = `Doses.delivered`, given_shots = `Shots.given`, used_doses = `Doses.used`) %>% mutate(one_shot = as.numeric(gsub("%","",one_shot)), two_shots = as.numeric(gsub("%","",two_shots)), used_doses = used_doses*100) %>% full_join(state_abbreviations) %>% select(state, abbreviation, region, one_shot, two_shots, delivered_doses, given_shots, used_doses) %>% filter(!is.na(abbreviation)) vaccine_data ``` # Data Wrangling Before we begin truly picking data apart, I want to share this cheat sheet on the *dplyr* package. In R, *dplyr* is used to manipulate data with a variety of functions linked by the pipe operator (%>%) from the *tidyverse* package. Here is the cheat sheet: [dyplyr Cheat Sheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) ## Data Manipulation on Vaccine Data 1. What is the one:two shot ratio for Pennsylvania? ```{r} vaccine_data %>% filter(state == "Pennsylvania") %>% select(state, one_shot, two_shots) %>% mutate(one_two_ratio = one_shot/two_shots) ``` 2. How many total doses have been delivered to states with the first letter of the alphabet? ```{r} vaccine_data %>% filter(substr(abbreviation, 1, 1) == "A") %>% summarise(a_delivered_doses = sum(delivered_doses)) ``` 3. How many states have at least 8% of their population fully vaccinated? Of these, who has vaccinated the most and who has vaccinated the least? ```{r} vaccine_data %>% filter(two_shots >=8) %>% arrange(desc(two_shots)) ``` 4. Which region has used the most of its supply? ```{r} vaccine_data %>% group_by(region) %>% summarise(avg_used_doses = round(mean(used_doses), digits = 2)) %>% arrange(desc(avg_used_doses)) ``` # Weekly Assignment After going through these examples, you should now be able to complete the assignment that will be due before next week's lesson. If you have any trouble, as always with R, the first step is to Google the error you get or the function you want to do! If you cannot find the answer or receive the help you need this way, please do not hesitate to reach out to me. Again, my email is erg5331@psu.edu. Thank you all for attending this week's lesson, and I will see you next Wednesday, April 14, 2021.