--- title: "R Workshop Series - Assignment 1" author: "Your Name Here" date: "Due: Wednesday, April 14, 2021" output: html_notebook --- # Introduction In this assignment, we are going to review importing data into R Studio from both websites and Excel. Additionally, we will manipulate some data to answer the questions that appear throughout the assignment. As I said in the presentation, please try to use Google as much as possible for working through the assignment. If you cannot find a solution with it, my email is erg5331@psu.edu. I will also hold two office hours for each assignment. If you cannot attend these, please email me with questions. Put the assignment and question number in the subject of your email, and we can either schedule a 1-on-1 Zoom "office hours" for help or we can engage in a discussion through email. With that said, here is the assignment! # Load Packages For this assignment, we will need to use the packages for manipulating data, using the pipe operator, and importing websites (hint: there are 2 packages). We are also going to use a new package that includes a function called *retype()*. Essentially, R will rethink the type of each variable with this function so we can correct a variable to be an integer when R thinks it is a character. To use the function, place it at the end of data frame corrections/revisions. The function is located in the package **hablar**, which you will need to install if you do not have it already. ```{r} # clear R environment rm(list = ls()) # Use this code chunk to load the 5 packages needed for this assignment. ``` # Importing Data In order to manipulate data, we need to load some data into our R session. We are going to load 2 datasets for this assignment. The first will be from a website and the second will be from 2 .csv files in the email, which are labeled "Election2008.csv" and "Election2012.csv". ### Website Data Our first data import will be from the 2008 and 2012 United State Presidential Elections. Elections provide a great source of data. Additionally, we can learn a lot of trends about the country by looking at election data. Here are the Wikipedia pages: [2008 United States Presidential Election - Wikipedia](https://en.wikipedia.org/wiki/2008_United_States_presidential_election) [2012 United States Presidential Election - Wikipedia](https://en.wikipedia.org/wiki/2012_United_States_presidential_election) ```{r} # Use this code chunk to import the 2 data frames, one from each Wikipedia page. Once you have the code for one election, you can simply copy/paste it for the other election. I encourage you to use this method for the rest of the assignment. ``` ### Excel Data Now, let's import a data frame from Excel. The workbook has additional classifications for the states during the elections. Each state has its own row and there are some columns, labeled "Abbreviation, Region, Winner, Counties." *Abbreviation* is the postal abbreviation of the state/district. *Region* is the region of the country in which each state/district is located. *Winner* is the candidate who received the most votes in each state/district. *Counties* is the number of counties in each state/district. ```{r} # Use this code chunk to import the Excel sheets labeled "Election2008.csv" and "Election2012.csv". ``` # Wrangling Data ### Reorganize Website Data Frames The data from the Wikipedia pages probably looks terrible at this point. Let's clean it up by giving each variable (column) a unique name and removing the abbreviation column. The abbreviations for each state/district are included in the Excel files. Additionally, we want to get rid of the final row ("U.S. Total") because we can calculate this ourselves. This is an extremely difficult clean-up, especially if this is your first time coding in R. Because of that, I have provided the code for cleaning the data frame, and I strongly encourage you to read each line and read my commentary to see what each function is accomplishing. Feel free to copy/paste this code if you ever need to do something web scraping in the future. The only modifications you need to make are putting your data frame (df) name in each of the required spots and putting all of the code in a chunk. I could not include it in a chunk because the document would not have knit. put_your_2008_df_name_here %>% ####### We need to rename each of the unique column names from the Wiki table. Notice that we have to use the "back space" key to surround the variable names instead of quotations because the variables from Wikipedia contained spaces. rename(state_district = `...1`, electoral_votes = `...2`, obama2008_vote = `Barack ObamaDemocratic...3`, obama2008_vote_pct = `Barack ObamaDemocratic...4`, obama2008_ev = `Barack ObamaDemocratic...5`, mccain_vote = `John McCainRepublican...6`, mccain_vote_pct = `John McCainRepublican...7`, mccain_ev = `John McCainRepublican...8`, nader_vote = `Ralph NaderIndependent...9`, nader_vote_pct = `Ralph NaderIndependent...10`, nader_ev = `Ralph NaderIndependent...11`, barr_vote = `Bob BarrLibertarian...12`, barr_vote_pct = `Bob BarrLibertarian...13`, barr_ev = `Bob BarrLibertarian...14`, baldwin_vote = `Chuck BaldwinConstitution...15`, baldwin_vote_pct = `Chuck BaldwinConstitution...16`, baldwin_ev = `Chuck BaldwinConstitution...17`, mckinney_vote = `Cynthia McKinneyGreen...18`, mckinney_vote_pct = `Cynthia McKinneyGreen...19`, mckinney_ev = `Cynthia McKinneyGreen...20`, other_vote = `Others...21`, other_vote_pct = `Others...22`, other_ev = `Others...23`) %>% filter(state_district != c("State/district", "U.S. Total")) %>% select(-`Margin...24`, -`Margin...25`, -`Total votes...26`, -`Total votes...27`) %>% ####### Now, we must remove the percentage signs and commas from the vote numbers so that R can recognize the variables as numbers. We can do this using the "gsub()" function, where the first input is the part we want to replace, the second input is the replacement character ("" means to delete it and replace with nothing), and the third input is the variable name. We can do the same with the footnote that Wiki included for Maine and Nebraska in the state_district column. mutate(state_district = as.character(gsub("[†]","", state_district)), obama2008_vote = as.numeric(gsub("[\\%,]","", obama2008_vote)), obama2008_vote_pct = as.numeric(gsub("[\\%,]","", obama2008_vote_pct)), obama2008_ev = as.numeric(gsub("[\\%,]","", obama2008_ev)), mccain_vote = as.numeric(gsub("[\\%,]","", mccain_vote)), mccain_vote_pct = as.numeric(gsub("[\\%,]","", mccain_vote_pct)), mccain_ev = as.numeric(gsub("[\\%,]","", mccain_ev)), nader_vote = as.numeric(gsub("[\\%,]","", nader_vote)), nader_vote_pct = as.numeric(gsub("[\\%,]","", nader_vote_pct)), nader_ev = as.numeric(gsub("[\\%,]","", nader_ev)), barr_vote = as.numeric(gsub("[\\%,]","", barr_vote)), barr_vote_pct = as.numeric(gsub("[\\%,]","", barr_vote_pct)), barr_ev = as.numeric(gsub("[\\%,]","", barr_ev)), baldwin_vote = as.numeric(gsub("[\\%,]","", baldwin_vote)), baldwin_vote_pct = as.numeric(gsub("[\\%,]","", baldwin_vote_pct)), baldwin_ev = as.numeric(gsub("[\\%,]","", baldwin_ev)), mckinney_vote = as.numeric(gsub("[\\%,]","", mckinney_vote)), mckinney_vote_pct = as.numeric(gsub("[\\%,]","", mckinney_vote_pct)), mckinney_ev = as.numeric(gsub("[\\%,]","", mckinney_ev)), other_vote = as.numeric(gsub("[\\%,]","", other_vote)), other_vote_pct = as.numeric(gsub("[\\%,]","", other_vote_pct)), other_ev = as.numeric(gsub("[\\%,]","", other_ev))) %>% retype() put_your_2012_df_name_here %>% rename(obama2012_vote = `Barack ObamaDemocratic...2`, obama2012_vote_pct = `Barack ObamaDemocratic...3`, obama2012_ev = `Barack ObamaDemocratic...4`, romney_vote = `Mitt RomneyRepublican...5`, romney_vote_pct = `Mitt RomneyRepublican...6`, romney_ev = `Mitt RomneyRepublican...7`, johnson_vote = `Gary JohnsonLibertarian...8`, johnson_vote_pct = `Gary JohnsonLibertarian...9`, johnson_ev = `Gary JohnsonLibertarian...10`, stein_vote = `Jill SteinGreen...11`, stein_vote_pct = `Jill SteinGreen...12`, stein_ev = `Jill SteinGreen...13`, other_vote = `Others...14`, other_vote_pct = `Others...15`, other_ev = `Others...16`) %>% slice(c(-1, -2, -59)) %>% select(-`...1`, -`Margin...17`, -`Margin...18`, -`Total...19`, -`Total...20`) %>% ####### Instead of spending a lot of time on a substitution here, we just manually enter columns with entries for the state/district and electoral votes (for some reason, Wiki did not include 2012 electoral vote counts for each state in the table). Notice the notation, where the first entry is the new column name followed by all of the entries. Because the state/district column is character strings, each entry must be surrounded by quotations. We specify where the column should go in the data frame with the second input in the function, .before. We do the same with the electoral vote manual column, but we do not have to include quotations because these entries are numbers and the first .before input will continue for the electoral vote column too. add_column(state_district = c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maine's 1st", "Maine's 2nd", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nebraska's 1st", "Nebraska's 2nd", "Nebraska's 3rd", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"), .before = 1, electoral_votes = c(9, 3, 11, 6, 55, 9, 7, 3, 3, 29, 16, 4, 4, 20, 11, 6, 6, 8, 8, 2, 1, 1, 10, 11, 16, 10, 6, 10, 3, 2, 1, 1, 1, 6, 4, 14, 5, 29, 15, 3, 18, 7, 7, 20, 4, 9, 3, 11, 38, 6, 3, 13, 12, 5, 10, 3)) %>% mutate(obama2012_vote = as.numeric(gsub("[\\%,]","", obama2012_vote)), obama2012_vote_pct = as.numeric(gsub("[\\%,]","", obama2012_vote_pct)), obama2012_ev = as.numeric(gsub("[\\%,]","", obama2012_ev)), romney_vote = as.numeric(gsub("[\\%,]","", romney_vote)), romney_vote_pct = as.numeric(gsub("[\\%,]","", romney_vote_pct)), romney_ev = as.numeric(gsub("[\\%,]","", romney_ev)), johnson_vote = as.numeric(gsub("[\\%,]","", johnson_vote)), johnson_vote_pct = as.numeric(gsub("[\\%,]","", johnson_vote_pct)), johnson_ev = as.numeric(gsub("[\\%,]","", johnson_ev)), stein_vote = as.numeric(gsub("[\\%,]","", stein_vote)), stein_vote_pct = as.numeric(gsub("[\\%,]","", stein_vote_pct)), stein_ev = as.numeric(gsub("[\\%,]","", stein_ev)), other_vote = as.numeric(gsub("[\\%,]","", other_vote)), other_vote_pct = as.numeric(gsub("[\\%,]","", other_vote_pct)), other_ev = as.numeric(gsub("[\\%,]","", other_ev))) %>% retype() ### Join Website/Excel Data Frames After importing both the 2008 and 2012 election data from the Wikipedia pages and Excel csv files, we can join the website and Excel data frames. Be sure to end up with 2 finalized data frames, one for each 2008 and 2012 election data. The key to doing this is clearly labeling stored objects. For example, do not label both data frames "election_data". Instead, come up with unique labels like "election_2008" or "final_2012". ```{r} # Use this code chunk to join the Wikipedia and Excel data frames together, insuring that the final result is 2 data frames for each year of election data. ``` ## Data-Specific Questions Now that we have 2 data frames for each election year (2008 and 2012), we are ready to dig into the data and see what we can find! Before that, verify that each variable is unique and not separated by spaces, each row is a state or district, and each variable has been correctly categorized (i.e. "dbl" or "int" for a number and "chr" for a word). If all of these look good, we can answer the questions below. Each question is listed with its assigned points. Carefully work through the code and check that every part is answered before submitting the assignment. Remember, you can submit this assignment twice, so the first submission does not have to be perfect if you are just looking for general feedback. A. (3 pts.) What was the margin of the electoral vote between the major party candidates (Democratic and Republican) in each election? ```{r} ``` B. (3 pts.) What was the margin of the popular vote between the major party candidates (Democratic and Republican) in each election? ```{r} ``` C. (2 pts.) Which region of the country cast the most votes for the Libertarian party in 2008, accounting for differences in population? (Hint: Think in percentages, not raw numbers) ```{r} ``` D. (2 pts.) Of states/districts beginning with the letter "M", which had the tightest margin between Obama and Romney? (Hint: To find the absolute value in R, use the abs() function around a formula or number.) ```{r} ``` E. (6 pts.) In each party that participated in both elections (Democratic, Republican, Libertarian, and Green), what was the percentage change from 2008 to 2012? (Hint: % change over time = ((later - earlier)/earlier)*100.) ```{r} ``` F. (2 pts.) In the 2008 election, what percentage of total votes did the Constitution candidate (Chuck Baldwin) receive? ```{r} ``` G. (2 pts.) How many counties voted for each candidate in each election, based on the winner in each state? ```{r} ``` # Conclusion Data analysis in R can be difficult, especially if this is your first time doing something like this assignment. For that reason, do not hesitate to come to office hours or individually email me for help! Once you are comfortable with your progress, email a PDF document or HTML notebook to me at erg5331@psu.edu so I can assess your work. I will comment on your work and give you another chance to submit the assignment if you did not get a 100% on the first submission. For next week's lesson and activity, we will use some of the techniques from this lesson to dive into one of R's best features: graphics.