--- title: "Intro to R training" author: "Cigna Group AEDP" date: "8/23/2024" output: html_document: default word_document: default --- ------------------------------------------- #Section 1: R Markdown Basics ------------------------------------------- # R Markdown ## What is R Markdown? This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see . This is a way to write R code that is easily savable and digestible as it can be organized into different chunks of code. #R Basics (for people who have never seen R before) ## What am I currently looking at on my screen? For those unfamiliar with R please refer to the attached word document to see a cut out of the screen and what the different buttons and areas look like. ## What are these blue #'s and what do they mean The blue #'s are essentially titles, different way's to format words. The number of #'s denotes what type of heading you are using. The more #'s, the smaller the header is ##What is a chunk of code? A chunk of code will be a formatted section of the document that looks like this You can add a new chunk of code in an RMarkdown file using Ctrl + Alt + i ```{r CHUNKNAME} #Code here #to here ``` The CHUNKNAME portion is what the name of that specific chunk is called, this is not too important but is good to include The area in between the chunk that is a darker shade of grey is where you can write your code To run this code you have a few options 1) Click the play button on the side of the specific chunk to run that chunk only 2) Click the "run" button on the top right of the screen which will run everything you have highlighted, all chunks before the selected chunk, etc. 3) Ctrl + Enter to run the line of code you're in or the highlighted section 4) Ctrl + Shift + Enter to run the whole code chunk your cursor is in An example of a completed chunk is below try running the code (Also note that the chunk is made using ` which is by the escape key not to be confused with an apostrophe). When you knit the document (can try this now, but it requires running all the lines of code in this doc and may take a while) you will see this chunk show up with the code as well as the corresponding output. #Exercise 1 - to be completed on own before R Training session ##Create a new chunk of code in the space below ##Copy and paste the each of the following line of code into the Console below for your first time using RStudio at Cigna and let it run. It may take over a minute to install. install.packages("tidyverse") ```{r} install.packages("tidyverse") ``` ------------------------------------------- #Section 2: Intro to R ------------------------------------------- ##Loading the package (you need to do this every time you open R) ```{r loadPackage} #Note that when using the library() command you do not have quotation marks around the package name #When you installed this package in Exercise 1 before the session, you needed tidyverse in quotes library(tidyverse) ``` Tidyverse is the only package that needs to be used for this training. It contains a group of packages like ggplot2 for visualizations (used later in this training) and dplyr for data wrangling. For more details on the tidyverse package, check out this website https://www.tidyverse.org/packages/ ##The Base R package enables you to do things like arithmetic and variable definition, as you can do by executing the code chunks below ```{r SimpleAddition} x = 2 y = 5 z = x + y z ``` ##Useful Links Longer tutorial video going over more of basics (video description has timestamps for different specific topics may be useful): https://www.youtube.com/watch?v=ZYdXI1GteDE List of commands, distributions and other basic lines of code: https://www.maths.usyd.edu.au/u/jchan/Rcommands.pdf #Technical Training ##Basic R Coding (for people who have never seen R before) Creating a variable ```{r} x = 3 y = "Hello" z <- 5 ``` Creating a vector (collection of variables) ```{r} v1 <- c(1, 2, 4, 6, 70) v2 = c(rep(5,10)) v3 = seq(from = 10, to = 2.5, by = -0.5) ``` Selecting an item in vector ```{r} v1[3] ``` For loops ```{r} for (f in 1:10){ print(f) } ``` ##Looking at and loading data ###Loading a specific dataset If the data in a csv file has been uploaded in the files section along with this R markdown file you should be able to run the following code. Excel files can be saved as CSV files. ```{r importCSVfile} HealthData = read.csv("insurance.csv") ``` Data Source: https://www.kaggle.com/teertha/ushealthinsurancedataset (made some slight modifications to it) ###Looking at the actual data set All of the data ```{r viewData} #Uncomment the line below to view the data. It's been commented out so that it doesn't open a new tab every time you run this chunk #View(HealthData) #Try head(HealthData) and change n; the default is n=6 and thus 6 rows are returned ``` Select a specific column ```{r viewColumn} #Uncomment the line below to view the data. It's been commented out so that it doesn't open a new tab every time you run this chunk #View(HealthData$region) ``` The code above looks at the entire dataset as well as one individual column. To select a column in R please use the following format DATASET$COLUMN_NAME You may also click "HealthData" located in the upper right corner of your screen in the "Environment" Pane. ```{r removeVariables} HealthData = HealthData[HealthData$age != "No Age", ] HealthData = HealthData[HealthData$region != "null", ] ``` The code above sets the "HealthData" dataset (left side of equals) = to all rows in current "HealthData" dataset where the age is not equal to "No Age", this effectively deletes all entries of "No Age" Some other useful but not neccessary commands is selecting a subsection of data, for instance what if you needed to look at data in the north east region only? ```{r} NorthEastData = HealthData[HealthData$region=='northeast',] ``` This is called splicing and can be very useful when working with very large data sets, more can be found here (really good guide) https://bookdown.org/ndphillips/YaRrr/slicing-dataframes.html ###Summary of all variables in data; Structure of the data frame ```{r Summary1} # These are two different ways to look at the variable types in a data frame and some of their values summary(HealthData) str(HealthData) ``` A couple things to note when doing the summary() command For categorical variables (non-numeric) you will get frequency counts. An example for the sex variable we get the count of female and male. For numeric variables you get the summary stats (min, 1st quartile, median, etc.). Age--which is a numeric variable--has frequency counts in the summary meaning it is being treated as a categorical not numeric variable; lets convert it to a numerical variable ```{r convertAge} HealthData$age = as.numeric(HealthData$age) ``` Lets do the summary again! ```{r Summary2} summary(HealthData) ``` Your data is now all set to use. ------------------------------------------- #Section 3: Data Visualizations using GGPLOT ------------------------------------------- ##Data Visualization (for people who have a rudimentary understanding of R) A very well known and useful data visualization tool is ggplot. GGplot is what is known as a package which is a collection of code not included in base R studio but can be imported for use. ###ggplot() A ggplot useful tutorial can be found here: http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html GGplot works by first creating a ggplot "object" ggplot(DATASET, aes(x=XVARIABLE,y=YVARIABLE)) and then adding a command that dictates what type of visual to make to it, here are some examples geom_bar() - bar chart geom_boxplot() - box plot geom_density() - density graph see more at https://ggplot2.tidyverse.org/reference/ Here are a couple full examples for the healthdata dataset ###Histogram of claim amounts with line through average ```{r} ggplot(HealthData, aes(x=charges))+geom_histogram(color="red",fill="orange")+geom_vline(aes(xintercept=mean(charges)),color="blue") ``` ###Scatterplot of age and charge with color changing based off of children ```{r} ggplot(HealthData, aes(x=age,y=charges,color=children))+geom_point() ``` ###Box plot of charges for different regions ```{r} ggplot(HealthData, aes(x=region,y=charges))+geom_boxplot() ``` ###Exercise 2 - TRY YOUR OWN HERE. Work together. What are relationships you can find in the data? Try and investigate some relationship we haven't looked at so far (it doesn't just have to be for target variables charges/claims!) ```{r} ``` ------------------------------------------- #Section 4: More Advanced: DPLYR and Data Wrangling ------------------------------------------- ##Dplyr (more advanced applicable work) Basic intro: https://dplyr.tidyverse.org/ Really useful 'cheat sheet': https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf Dplyr is a way to manipulate very large data sets (like +100k rows). Many data sources (claims data, pricing data) can be accessed in R. Some teams may utilize this. If you are familiar with this, it can be extremely useful compared to excel as excel often crashes if you ever have to do a vlookup on 100k rows while dplyr can do that is less than a minute on 5x the rows. It is also great at summarizing data, or manipulating large datasets. There is definitely a learning curve but can best be thought of as something similar to SQL, though not as clunky or confusing. Fortunately dplyr is a part of tidyverse, so you don't need to execute library(dplyr) since we've done it earlier in this file. The '%>%' symbol is called a "pipe operator." It is essentially passing the HealthData data frame into the function filter. The same output can ###filter by claims over 40000 ```{r} #Using pipe operators to define HealthDataUnder40000 HealthDataUnder40000 = HealthData %>% filter(charges>40000) #This should also yield the same output, and it doesn't define a variable with the output filter(HealthData, charges>40000) ``` ###Yet another way filter by claims over 40000 using a predefined variable 'z' ```{r} z = 40000 HealthData %>% filter(charges>z) ``` ###filter by claims over 40000 for men in the southeast ```{r} HealthData %>% filter(charges>40000,sex=='male',region=='southeast') ``` ###Order by descending region and then ascending children ```{r} HealthData %>% arrange(desc(region),children) ``` ###Select Age and Sex of people with claims over 40000 in southeast and sort by descending age ```{r} HealthData %>% filter(charges>40000) %>% select(age,sex) %>% arrange(desc(age)) ``` ###Create new column that is charge per child ```{r} HealthData %>% mutate(childcharge = charges/children) ``` ###Create new column that is charge per child with 0 set when children is 0 ```{r} HealthData %>% mutate(childcharge = ifelse(children==0,0,charges/children)) ``` ###Summarize average children ```{r} HealthData %>% summarize(avg_child = mean(children)) ``` ###Summarize average children per region ```{r} HealthData %>% group_by(region) %>% summarize(avg_child = mean(children)) ``` ###Summarize total claims per region and gender ```{r} Claims = HealthData %>% group_by(region,sex) %>% summarize(totalclaims = sum(charges)) Claims ``` ###Summarize total claims by region and gender for people over age 30 in ascending order by gender ```{r} HealthData %>% filter(age>30) %>% group_by(region,sex) %>% summarize(totalclaims = sum(charges)) %>% arrange(sex) ``` ##Exercise 3 -- Try it ###Summarize average bmi for individuals over age of 30 grouped by gender ```{r} HealthData %>% filter(age>30) %>% summarize(AverageBMI = mean(bmi)) ``` ##Exercise 4 -- Try it ###Select the min and max charges for each gender (hint use summarize) ```{r} HealthData %>% group_by(sex)%>% summarize(MinAge = min(age), MaxAge = max(age)) ``` ##Exercise 5 -- Try it ###Get average claims by region for individuals who have claims over the median amount of claims You will need multiple lines of code but remember that dplyr can use other r functions and variables! ```{r} HealthData %>% filter(charges > median(charges))%>% group_by(region)%>% summarize( AvgClaims = mean(charges)) ``` ------------------------------------------- #Section 5: More working with data ------------------------------------------- ##One of the largest advantages is being able to join data in R which is significantly faster than excel (understatement) The package readxl is a part of tidyverse, so we don't need to install it again, nor we do have to load it via the library command. I'm leaving the library command in the chunk below just for completeness ```{r} library(readxl) #The following commands read in two different sheets of data from the Hospital excel file and create two new data frames HospitalData = read_xlsx("HospitalData.xlsx",sheet = "data") HospitalProcedures = read_xlsx("HospitalData.xlsx",sheet = "hospital") ``` ```{r} #The following are two ways to left join the two hospital data data frames. The first one creates a new data frame "JoinedHospital" JoinedHospital = HospitalData %>% left_join(HospitalProcedures,by=c("HospitalID"="HosTag")) left_join(HospitalData,HospitalProcedures, by=c("HospitalID"="HosTag")) ``` ------------------------------------------ #Section 6: Modeling topics that can be helpful for Exam PA ------------------------------------------- ##Modeling If instead of just looking at relationships we want to develop some way to predict claims we can make a model in R to do so. This is not something you are expected to know or pick up immediately but its good to know the relationships and models you can create in R versus excel. ###Tree Model Install needed package ```{r} #Uncomment and install the rpart package if you haven't installed it before. #install.packages("rpart") ``` Load Package ```{r} library(rpart) library(rpart.plot) ``` Make Tree Diagram We have to add a variable called lowhighclaims that is categorical because a tree model works best by using a categorical variable as its target. The "LowHighClaims ~ . - charges" portion of the staement sets LowHighclaims as the target variables, the "." says use all variables "-charges" except the charges variables to predict the target ```{r} HealthData$LowHighClaims = ifelse(HealthData$charges>15000,"High","Low") treemodel = rpart(LowHighClaims ~ . - charges,data=HealthData,method='class',control=rpart.control(minbucket = 20, cp=.05,maxdepth = 4)) rpart.plot(treemodel) ``` A very simple model with one split. It classifies an individual as having high claims if the smoker variable is equal to yes and low claims if the smoker variable is equal to no. ###GLM GLM stands for generalized linear model, this is something you have likely seen in your classes before, using a combination of the predictor variables a GLM creates an equation that relates them to the target variable claims ```{r} HealthData$LowHighClaims = NULL glm1 = glm(charges~.,data=HealthData) summary(glm1) ``` The model above is a simple linear model interpreted in the manner of y = b0 + b1*x1 + b2*x2 ... etc. We can also make more complex models, for example an exponential model ```{r} HealthData$LowHighClaims = NULL glm2 = glm(log(charges)~.,data=HealthData) summary(glm2) ``` This model can be visualized as log(y)= b0 + b1*x1 + b2*x2 (log(j) is the same as the natural log of j). We would do this log transformation if we did not believe the target variable was normally distributed, which looking back at the historgram we developed earlier it does not seem to be. Instead we are claiming that log(y) is normally distributed. You can also create models using on specific variables ```{r} HealthData$LowHighClaims = NULL glm3 = glm(charges~smoker+age,data=HealthData) summary(glm3) ``` You can create even more advanced models by changing the distribution that the model follows and how the mean relates to model using different distribution families and link functions. This is far more complex but read up on it at these links link versus data transformations: https://www.theanalysisfactor.com/the-difference-between-link-functions-and-data-transformations/ More detailed description of family and link functions: https://www.sagepub.com/sites/default/files/upm-binaries/21121_Chapter_15.pdf (first couple pages covers family and link functions, covers more advanced topics later on) ###Using a model to add a predicted variable ```{r} HealthData$PredictedClaimsModel3 = predict(glm3,HealthData,type='response') #Uncomment the line below to view the data. It's been commented out so that it doesn't open a new tab every time you run this chunk #View(HealthData) ```