Introduction to R for
Health Data Science

Hands-on training

Carlos Matos // ISPUP // November 2023

Welcome!

Language


  • What language do we use ?


Checkpoint


Open RStudio!

Check R version

#Check R version
sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Lisbon
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] purrr_1.0.2     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
 [9] ggplot2_3.4.4   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.4      jsonlite_1.8.7    compiler_4.3.2    tidyselect_1.2.0 
 [5] systemfonts_1.0.5 scales_1.2.1      textshaping_0.3.7 yaml_2.3.7       
 [9] fastmap_1.1.1     R6_2.5.1          generics_0.1.3    knitr_1.45       
[13] munsell_0.5.0     pillar_1.9.0      tzdb_0.4.0        rlang_1.1.1      
[17] utf8_1.2.4        stringi_1.8.1     xfun_0.41         timechange_0.2.0 
[21] cli_3.6.1         withr_2.5.2       magrittr_2.0.3    digest_0.6.33    
[25] grid_4.3.2        rstudioapi_0.15.0 hms_1.1.3         lifecycle_1.0.4  
[29] vctrs_0.6.4       evaluate_0.23     glue_1.6.2        ragg_1.2.6       
[33] fansi_1.0.5       colorspace_2.1-0  rmarkdown_2.25    tools_4.3.2      
[37] pkgconfig_2.0.3   htmltools_0.5.7  

Packages

  • List of packages that we will be using throughout the course
  • Copy the code below to RStudio and run
install.packages(
  c("tidyverse","janitor","gapminder","medicaldata","ggstatsplot",
  "outbreaks","crosstable","lme4", "datasauRus", "kableExtra", "ineptR",
  "patchwork","showtext","car","ggpmisc","MASS", "report", "glue"
  "survival","ggsurvfit","devtools","scales", "eurostat", "leaflet")
)

How I have used R

  • Public Health Medical Doctor @ Public Health Department
  • Started using R during the COVID-19 pandemic
    • Epicurves

COVID-19 epicurve. Dates and counts are omitted for anonimity

How I have used R

  • Public Health Medical Doctor @ Public Health Department
  • Started using R during the COVID-19 pandemic
    • Epicurves
    • Forecasting
    • Automating procedures

Covid case number forecasts by reporting date. Dates and counts are omitted for anonimity

How I have used R

How I have used R

How I have used R

How I have used R

  • Data analysis
    • Deaths attributable to Covid, Influenza and extreme temperatures

How I have used R

  • Data analysis
    • Deaths attributable to Covid, Influenza and extreme temperatures
  • Developed the ineptR package to facilitate and automate data extraction from Statistics Portugal with R

How I have used R

  • Now working on improving dataviz skills and portfolio…


Goals

Goals for this course

Always learning
  • Be a learning catalyst
    • Know what R is capable of
    • Learn how to search the web for answers
  • Gain a solid understanding of data wrangling with the tidyverse
  • Learn the syntax of statistical models in R
  • Be empowered to create and edit charts with ggplot

Goals for this course

Always learning
  • Communicate your results
    • Static reports
    • Dynamic dashboards
  • Reproducible research and collaboration with version control
  • A first step in the migration from other software!

Methods

  • Many worked examples
    • Assuming no prior knowledge
  • Start simple and increasing complexity over time
  • Minimize redundancy
    • R has many ways to achieve the same results. Choose one and stick to it.

Cognitive load theory

Methods

  • We will be working side by side in R
  • Slides are available on the course website
  • You can copy the code from the code chunks and paste in R
    • I recommend that you use this approach as a last resort
    • It’s better if you write the code manually, to get a feel for shortcuts, code completion, bracket auto-closing and other RStudio qualtiy of life features

Before we go into R…

Before we go into R…


Intro to R


Why R?

  • Free and Open Source
  • Workflow and analysis reproducibility
  • Community engagement
    • Pretty much all your future questions are already answered online
    • You just need to ask the right questions
  • Certain level of complexity
    • BUT, tidyverse makes it way more approachable

RStudio

R and RStudio

R vs RStudio

RStudio

Anatomy of RStudio

Frequently used shortcuts

  • For future reference
    • Esc to interrupt current execution
    • Alt/Option + - to insert <- (assign operator)
    • Ctrl/Cmd + Shift + m to insert %>% (pipe)
    • Ctrl/Cmd + Enter to run current line/selection of code
    • Ctrl/Cmd + f to find and replace in current script

Useful functionalities

  • F1 or ? for help
  • TAB for autocomplete
  • Plot auto preview in Plots pane
  • UP and DOWN for history tracking
  • Parenthesis/brackets autoclose and highlight

Good vs bad code

The single most important thing to remember

COMMENT YOUR CODE!


Programmer and God…

Cooment your code!

Key concepts

R vs R Packages

  • Objects - Everything we store in R. Can be variables, datasets, graphs, etc. Objetcs are assigned a name, which can be referenced in later commands

  • Functions - A function is a code operation that accept inputs and returns a transformed output. Read more in the Functions section. The basic unit of a package.

  • Packages - A bundle of functions that can be shared.

  • Scripts - A document/file that stores a set of commands.

Packages

  • Packages can be downloaded and installed locally with install.packages("package")
  • Once installed, the package is stored in your library
  • To use the package in the current session, we need to load the package with library(package)
    • Needs to happen in each session
  • Packages are more frequently installed from
    1. CRAN (Comprehensive R Archive Network)
    2. GitHub
  • Update a package in the packages pane, in RStudio

Functions

  • Consider the following code that calls the ficticious function get_prognosis(), to get the prognosis of a patient:
patient_prognosis <- 
  get_prognosis(gender = "Male", 
                age = 45,
                comorbidities = c("diabetes", "hypertension")
                )
  • We are calling the function get_prognosis() with 3 arguments (gender, age, and comorbidities), and storing the resulting calculation in an object called patient_prognosis.

Working with data

Data types

  • String or character
"The patient has diabetes"
  • Number (integer or double)
42L
42
  • Logical
TRUE
FALSE
[1] TRUE
[1] FALSE

Strings or Characters

  • Surrounded by double " or single ' quotes
"abc"
[1] "abc"
typeof("abc")
[1] "character"
"1" #if surrounded by quotes, it's a character.
[1] "1"
  • Some operations are not available with strings
1 + 1 #No error
[1] 2
1 + "A" #Error. Cant sum a number and a string
Error in 1 + "A": non-numeric argument to binary operator

Numbers

Integers

  • Can be of type integer or double
  • integer comes with the letter L right after the number
typeof(1)
[1] "double"
typeof(1L)
[1] "integer"
1L + 2.5 #When summing an integer and a double, R knows we want a double
[1] 3.5
  • Integers are more relevant for low level programming, not very much for our use cases. We will always use doubles

Numbers

Double / Numeric

  • In most cases numbers will be stored as double
  • Used to represent any real number
typeof(42)
[1] "double"
typeof(3.14)
[1] "double"
typeof(-5e-10)
[1] "double"
typeof(1/5)
[1] "double"

Logicals

  • Very frequently used for conditional logic (if else statements)
  • We will use then inside tidyverse functions
TRUE
[1] TRUE
FALSE
[1] FALSE
c(T,F)
[1]  TRUE FALSE
2 > 1
[1] TRUE
c(1,2,3) > c(3,2,1) #Vectorised 'greather than' operation
[1] FALSE FALSE  TRUE
c(as.logical(0), as.logical(1)) #0's and 1's can be interpreted as logicals
[1] FALSE  TRUE

Type coercion

  • We can (and very much want to!) convert some data to other types
    • e.g. we import a dataset with a character column (e.g. outcome: “dead” or “alive”) that we want to convert to 0’s and 1’s for logistic regression
  • R has functions with syntax as.something(), that allow conversion of some types into others
as.logical(1)
[1] TRUE
as.integer("1984")
[1] 1984
as.character(42)
[1] "42"
as.numeric("Some text") #No error! Returns NA with a warning.
Warning: NAs introduced by coercion
[1] NA

Type coercion

  • We can (and very much want to!) convert some data to other types
    • e.g. we import a dataset with a character column (e.g. outcome: “dead” or “alive”) that we want to convert to 0’s and 1’s for logistic regression
  • R has functions with syntax as.something(), that allow conversion of some types into others
as.numeric("Some text") #No error! Returns NA with a warning.
Warning: NAs introduced by coercion
[1] NA

Important

When converting to other data types, sometimes NAs might be introduced if some error is detected. Those values get lost.

Data structures

  • values
single_number <- 10
single_number
[1] 10
  • vectors
example_vector <- c("A","B","C")
example_vector
[1] "A" "B" "C"

Data structures

  • lists
example_list <- list("A",1,c(TRUE,FALSE))
example_list
[[1]]
[1] "A"

[[2]]
[1] 1

[[3]]
[1]  TRUE FALSE
  • data frames
example_dataframe <- 
  data.frame(id = c(1,2,3),
             name = c("Jon","Tyrion","Arya"))
example_dataframe
  id   name
1  1    Jon
2  2 Tyrion
3  3   Arya
  • Matrices, arrays

Values

  • The simplest data structures
  • Can be of any type
answer_to_life_the_universe_and_everything <- 42 # A numeric value
something_not_true <- FALSE

Vectors

  • Set of values of the same data type
  • They are created with the concatenate function c()
logicals <- c(TRUE, F, FALSE, T)
logicals
[1]  TRUE FALSE FALSE  TRUE
integers <- c(1:10) #sequence of numbers from 1 to 10
integers
 [1]  1  2  3  4  5  6  7  8  9 10
doubles <- integers + 0.5
doubles
 [1]  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5
characters <- c("a","b","c","d") #remember the quotes!
characters
[1] "a" "b" "c" "d"
some_long_vector <- seq(0,10,0.1) #all the values from 0 to 100, in 0.1 increments

Vectors

R is a language built around vectors!


Type coercion

Warning

Beware of automatic type coercion when creating vectors or, more often, importing data!

years_vector <- c(2020,2021,"202a2",2023)
typeof(years_vector) #We expected 'integer' but got 'character', because one year had a typo
[1] "character"
#Let's convert the vector to integer: 
as.integer(years_vector) #the year with a typo was converted to NA because R couldn't figure out what we wanted
[1] 2020 2021   NA 2023

Vectors

Type coercion

A vector with different data types in some elements will automatically be coerced to a data type of higher complexity

flowchart LR
  A(Logical) --> B(Integer)
  B --> C(Numeric)
  C --> D(Character)
# What is the data type of each of the following objects:
c(1, 2, 3)
c('a', 'b', 'c') 
c("d", "e", "f") 
c(TRUE,1L,10)
c("11",10,12)
c("Diabetes","Cancer", FALSE)
typeof(c(1, 2, 3)) #double
typeof(c('a', 'b', 'c')) #character
typeof(c("d", "e", "f")) #character
typeof(c(TRUE,1L,10)) #double
typeof(c("11",10,12)) #character
typeof(c("Diabetes","Cancer", FALSE)) #character

Vectors

  • Vector elements can have names
    • Name can be given when creating the vector, or at a later stage
#Names attributed when creating a vector
my_named_vector <- c(value_one = "A",
                     value_two = "C")
my_named_vector
value_one value_two 
      "A"       "C" 
#Names attributed to an existing vector
my_new_vector <- c(1:4)
names(my_new_vector) <- c("a","b","c","d")
my_new_vector
a b c d 
1 2 3 4 

Exploring vectors

Let’s look at the long vector we created earlier:

some_long_vector
  [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3  1.4
 [16]  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7  2.8  2.9
 [31]  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1  4.2  4.3  4.4
 [46]  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9
 [61]  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4
 [76]  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9
 [91]  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9 10.0

What if we wanted to look only at some data points?

head(some_long_vector)
[1] 0.0 0.1 0.2 0.3 0.4 0.5
tail(some_long_vector)
[1]  9.5  9.6  9.7  9.8  9.9 10.0
head(some_long_vector, 3)
[1] 0.0 0.1 0.2

Exploring vectors

let’s look at the long vector we created earlier:

some_long_vector
  [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3  1.4
 [16]  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7  2.8  2.9
 [31]  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1  4.2  4.3  4.4
 [46]  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9
 [61]  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4
 [76]  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9
 [91]  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9 10.0

We can also look at specific sections/positions/indices using [] brackets

some_long_vector[5] #index 5
[1] 0.4
some_long_vector[c(5,10)] #indices 5 AND 10
[1] 0.4 0.9
some_long_vector[5:10] #indices 5 TO 10
[1] 0.4 0.5 0.6 0.7 0.8 0.9

Exploring vectors

Or look at all except some indices with the minus - sign

character_vector <- c("a","b","c","d","e","f")
character_vector[-2] #Excludes the second index
[1] "a" "c" "d" "e" "f"
character_vector[-c(2:4)] #Excludes indices 2 to 4
[1] "a" "e" "f"

Exploring vectors

  • Vector elements can be accesses by the name given to each index
#Created earlier
my_new_vector
a b c d 
1 2 3 4 
#Access the value wthat is named "a"
my_new_vector["a"]
a 
1 
#Access multiple values by name
my_new_vector[c("b","d")]
b d 
2 4 
#Useful logic to manually control prol colors
colors_for_a_plot <- c("Portugal" = "red", "Europe" = "blue", "World" = "gray")
colors_for_a_plot["Portugal"]
Portugal 
   "red" 

Exploring vectors

  • Vector elements can be accesses using a logical expression
#Let's create a numeric vector
x <- c(10, 33, NA, 4, 9, 2, NA)

#a logical vector. Is TRUE if x is NA
x_na <- is.na(x)
x_na
[1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
# All non-missing values of x
x[!x_na]
[1] 10 33  4  9  2
#All values greater than five (Note that NAs are included!!)
x[x > 5]
[1] 10 33 NA  9 NA

Exploring vectors

# All non-missing values of x
x[!x_na]
[1] 10 33  4  9  2
#All values GREATER THAN five (Note that NAs are included!!)
x[x > 5]
[1] 10 33 NA  9 NA
#All values EQUAL TO five (Note that NAs are included!!)
x[x == 5]
[1] NA NA

Note

! is the not/negation operator. It transform TRUE to FALSE and vice-versa

Note

NAs are returned when evaluating equality or inequality:
== (‘equal to’ operator), != (not equal to), > (greather), < (smaller), >= (greather or equal ), <= (smaller or equal)

Modifying vectors

It’s possible to reassign values to a specified index of an existing vector

#assign the value "z" to the first position of the "character_vector"
character_vector[1] <- "z"
character_vector
[1] "z" "b" "c" "d" "e" "f"
#assign the values "four" and "five" to the 4th and 5th positions of the "character_vector"
character_vector[4:5] <- c("four","five")
character_vector
[1] "z"    "b"    "c"    "four" "five" "f"   

This approach can be used to add data to a vector

#Assign the value "some_value" to a new index
character_vector[7] <- "some_value"
character_vector
[1] "z"          "b"          "c"          "four"       "five"      
[6] "f"          "some_value"

Lists

  • Are the more comprehensive object type
    • Can contain (almost) anything
  • Can be created with the function list(), similarly to c() for vectors
  • Lists are ordered, just like vectors
my_list <- list(c(1:3),"R",list(1:3))
my_list
[[1]]
[1] 1 2 3

[[2]]
[1] "R"

[[3]]
[[3]][[1]]
[1] 1 2 3

What differences do you see between the 1st and 3rd elements of the list? Shouldn’t they be the same?

Exploring lists

  • Lists can be tricky
my_list[1][1]
[[1]]
[1] 1 2 3
my_list[1][1][1]
[[1]]
[1] 1 2 3
my_list[1][1][1][1]
[[1]]
[1] 1 2 3

???????

Exploring lists

  • Lists can be tricky
my_list[1]
[[1]]
[1] 1 2 3
my_list[[1]]
[1] 1 2 3
  • A list can be accessed with single [ or double [[ brackets.
  • [ Returns a smaller list, while [[ returns the contents of the that smaller list.
  • Usually we want [[.

Exploring lists

https://adv-r.hadley.nz/subsetting.html#subset-single

Exploring lists

  • List elements can be named, just like vector elements
my_named_list <- list(some_vector = c(1:3),
                some_value = "R",
                other_value = list(1:3))

my_named_list
$some_vector
[1] 1 2 3

$some_value
[1] "R"

$other_value
$other_value[[1]]
[1] 1 2 3

Exploring lists

  • Lists can be accessed by name with $
  • $ works like [[
my_named_list$some_vector
[1] 1 2 3

Important

This behaviour is very relevant for our use cases, because data frames also behave the same way with column names!

#Statistical models output lists in R. 
#Select the coefficient associated with gdpPercap in the model below
#'[Red]* HINT: use the object explorer to visually explore the output *
some_model_output <- lm(lifeExp ~ gdpPercap, data = gapminder::gapminder)
some_model_output$coefficients[[2]]
[1] 0.0007648826
some_model_output %>% 
  pluck(coefficients) %>% #plucks extracts some element from a list
  magrittr::extract(2)
   gdpPercap 
0.0007648826 

Modifying lists

Lists

my_list
#replace the first value of the list with the value "A"
my_list[1] <- "A"
#Replace the second value of the list in third element with "nested_modification"
my_list[[3]][[1]][2] <- "nested_modification"
my_list
[[1]]
[1] 1 2 3

[[2]]
[1] "R"

[[3]]
[[3]][[1]]
[1] 1 2 3


[[1]]
[1] "A"

[[2]]
[1] "R"

[[3]]
[[3]][[1]]
[1] "1"                   "nested_modification" "3"                  

How much are we going to work directly with lists in this course?


Pretty much zero…

But we need to know how they work, because some outputs are lists (notably, outputs of statistical models), and we may want to grab some values from those lists (e.g. a p-value from a linear regression model)

Data frames

  • A 2D object (aka, a table…)
    • You can think of it as a more rigorous Excel spreadsheet
  • Unquestionably the most useful storage structure for data analysis
  • Each column/variable is a vector
    • Each column ALWAYS has the same type (contrary to Excel, where errors may occur)

Exploring data frames

  • Vectors and lists are 1D objects, therefore inside [] we only need to specify an index.
  • Dataframes are 2D, meaning that we need to specify 2 dimensions inside []: rows and columns
    • df[rows, cols] selects a single value
    • df[rows, ] selects some rows, all columns
    • df[,cols] selects all rows, some columns

Exploring data frames

df <- data.frame(col1 = c(1,2,3),
                 col2 = c("A","B","C"))
df
  col1 col2
1    1    A
2    2    B
3    3    C
#Selecting value from third row, second col
df[3,2]
[1] "C"
#Selecting all cols from second row
df[2,]
  col1 col2
2    2    B

Exploring data frames

#Selecting all rows from first column by index
df[,1]
[1] 1 2 3
#Selecting all rows from first column by name
df[,"col1"]
[1] 1 2 3
#Or alternatively, by name
df$col1
[1] 1 2 3

Data frames vs tibbles

  • A tibble is the tidyverse version of a data frame
  • Very similar, with some qualtiy of life improvements
  • Main differences
    • Tibbles don’t print all rows for large data frames, only first 10
    • Stricter subsetting (need to specify entire correct name of variable, no abbreviations)
    • Less prone to errors (converting variable types by mistake)
df <- data.frame(some_col = c(1,2,3),
                 other_col = c("A","B","C"))
tb <- tibble(some_col = c(1,2,3),
             other_col = c("A","B","C"))
#Abbreviated column name
df$som #Works
[1] 1 2 3
tb$som #Returns error
NULL

Modifying data frames

Note

I will henceforth use data frame and tibble interchangeably, unless otherwise specified, but we will be working with tibbles.

Note

Keep in mind that all the ways to access vectors and lists that we have seen before (e.g. using vectors, sequences or logical expressions to subset) also work with data frames, with the appropriate adaptations to 2D space

df <- tibble(col1 = c("A","B","C"), col2 = c(1,2,3), col3 = c(TRUE, FALSE, TRUE))
df[1:2,c(1,3)] # Select only rows one and two, from columns one and three
# A tibble: 2 × 2
  col1  col3 
  <chr> <lgl>
1 A     TRUE 
2 B     FALSE

Modifying data frames

  • Adding a column to a data frame
df <- tibble(col1 = c("A","B","C"),
             col2 = c(1,2,3),
             col3 = c(TRUE, FALSE, TRUE))

#Add a new variable to the tibble
df$my_awesome_new_col <- c("Awe-","wait for it","-some!")
df
# A tibble: 3 × 4
  col1   col2 col3  my_awesome_new_col
  <chr> <dbl> <lgl> <chr>             
1 A         1 TRUE  Awe-              
2 B         2 FALSE wait for it       
3 C         3 TRUE  -some!            

Important

The new column must have the same number of rows as the existing data frame, or you will get an error.

Modifying data frames

#Create a temporary tibble
df <- tibble(col1 = c("A","B","C"),
             col2 = c(1,2,3),
             col3 = c(TRUE, FALSE, TRUE))

#Change the value of Row 3, Col 2
df[3,2] <- "some_character"
Error in `[<-`:
! Assigned data `"some_character"` must be compatible with existing
  data.
ℹ Error occurred for column `col2`.
Caused by error in `vec_assign()`:
! Can't convert <character> to <double>.
  • We get an error because we are trying to assign a character value to a numeric vector (col2).

Note

With data.frame() instead of tibble(), automatic conversion to character would occur, which is usually not the desired result. This type of conversions is prone to cause errors in your code, particularly if you use new data with substandard quality.

Exercises