Exploratory Data Analysis
(EDA)

Carlos Matos // ISPUP // November 2023

EDA

Perform initial investigations to identify patterns and anomalies and check assumptions…

EDA

… with the help of summary statistics and graphical representations.

EDA

Any method of looking at data that does not include formal statistical modelling and inference.

Goals of EDA

  • Identify patterns, errors, missings, outliers
  • Check assumptions
  • Identify relationships among variables (direction, size)

EDA

One way to categorize


Univariate vs Multivariate


Graphical vs Non-graphical

Univariate
Non-graphical

Univariate non-graphical

Categorical data

  • Tabulation (counts, proportions)
#Base R
table(polyps$sex)

female   male 
     9     13 
prop.table(table(polyps$sex))

   female      male 
0.4090909 0.5909091 
#crosstable
library(crosstable)

crosstable(data = polyps,
           cols = sex) %>% 
  as_flextable()

Univariate, categorical

label

variable

value

sex

female

9 (40.91%)

male

13 (59.09%)

Univariate non-graphical

Quantitative data

  • Summary statistics
    • Central tendency
      • Mean
      • Median
    • Dispersion/Spread
      • Interquartile range (IQR)
      • Variance
      • Standard deviation

Univariate non-graphical

Quantitative data

#Base R
summary(gapminder$lifeExp[gapminder$year == max(gapminder$year)])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  39.61   57.16   71.94   67.01   76.41   82.60 
#Base R - Specific quantiles
quantile(x = gapminder$lifeExp, 
         probs = c(.01,.99))
      1%      99% 
33.49260 80.23892 
#tidyverse
gapminder %>% 
  filter(year == max(year)) %>% 
  summarise(mean = mean(lifeExp),
            median = median(lifeExp),
            iqr = IQR(lifeExp),
            variance = var(lifeExp),
            sd = sd(lifeExp)) %>% 
  pivot_longer(everything())
# A tibble: 5 × 2
  name     value
  <chr>    <dbl>
1 mean      67.0
2 median    71.9
3 iqr       19.3
4 variance 146. 
5 sd        12.1

Univariate non-graphical

Quantitative data

What is the mean and standard deviation of GDP per capita in the most recent year, by continent?
library(gapminder)

gapminder %>% 
  group_by(continent) %>% 
  filter(year == max(year)) %>% 
  summarise(mean = mean(gdpPercap),
            sd = sd(gdpPercap))
# A tibble: 5 × 3
  continent   mean     sd
  <fct>      <dbl>  <dbl>
1 Africa     3089.  3618.
2 Americas  11003.  9713.
3 Asia      12473. 14155.
4 Europe    25054. 11800.
5 Oceania   29810.  6541.

Univariate
Graphical

Univariate graphical

Univariate graphical

:::

Multivariate
Non-graphical

Multivariate non-graphical

Qualitative data

  • Cross-tabulation
#Base R
table(sex = polyps$sex, 
      treatment = polyps$treatment, 
      useNA = "ifany")
        treatment
sex      placebo sulindac
  female       4        5
  male         7        6
#crosstable package
crosstable(data = polyps,
           cols = c(treatment, age),
           by = sex,
           total = "column", 
           percent_digits=1) %>% 
  as_flextable()

Example crosstable

label

variable

sex

female

male

treatment

placebo

4 (36.4%)

7 (63.6%)

sulindac

5 (45.5%)

6 (54.5%)

Total

9 (40.9%)

13 (59.1%)

age

Min / Max

13.0 / 23.0

13.0 / 50.0

Med [IQR]

22.0 [18.0;23.0]

23.0 [19.0;34.0]

Mean (std)

20.1 (3.5)

26.8 (10.8)

N (NA)

9 (0)

13 (0)

Multivariate non-graphical

Quantitative data

  • Correlation, covariance
#Base R
with(gapminder,
     cor(x = lifeExp[year == max(year)],
         y = gdpPercap[year == max(year)], 
    method = "spearman"))
[1] 0.8565899
#crosstable package
gapminder %>% 
  filter(year == max(year)) %>% 
  crosstable(cols = c(lifeExp, pop),
             by = gdpPercap,
             cor_method = "spearman",
             percent_digits=2) %>%
  as_flextable()

Multivariate, 2 numerical variables

label

variable

gdpPercap

gdpPercap

lifeExp

spearman

0.86

pop

spearman

-0.06

Multivariate
Graphical

Multivariate Graphical

Multivariate Graphical

Multivariate Graphical

Multivariate Graphical

Exercises