Study design

Warm up

Announcements

  • Thanks to those who read the syllabus and followed the prompts

  • Lab 1 is due Sunday (May 19) at 11:59 pm on Gradescope

  • Late policy review

Any questions about the first assignment due?

Study design

Reading check in

Any questions on the readings or tutorials?

SAT scores and teacher salaries

What is going on in the following plot?

Modern Data Science with R. Baumer, Kaplan, Horton. (2023)

SAT scores and teacher salaries

What about this plot?

COVID vaccine and deaths from Delta variant

The main question we’ll explore today is “How do deaths from COVID cases compare between vaccinated and unvaccinated?”

What do you think?

Goals

  • Creating data visualizations and calculating summary statistics for comparing trends across groups

  • Distinguishing observational studies and experiments

  • Reviewing various sampling methods

  • Identifying confounding variables and Simpson’s paradox

Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%||%()   masks base::%||%()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data

The data for this case study come from a technical briefing published by Public Health England in August 2021 on COVID cases, vaccinations, and deaths from the Delta variant.

First, let’s load the data:

delta <- read_csv("delta.csv")
Rows: 268166 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): vaccine, age, outcome

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Visualizing and summarizing categorical data

Exercise 1

How many rows and columns are in this dataset? Answer in a full sentence using inline code. What does each row represent and what does each column represent? For each variable, identify its type.

There are 268166 rows and 3 columns in the dataset. Each row represents a person with COVID, and the columns represent whether the person was vaccinated or not, their age, and whether they died or survived.

Exercise 2

Do these data come from an observational study or experiment? Why?

Observational study, people in the study chose to get vaccinated or not, they weren’t randomized into groups.

Exercise 3

Create a visualization of health outcome by vaccine status that allows you to compare the proportion of deaths across those who are and are not vaccinated. What can you say about death rates in these two groups based on this visualization?

While this is very difficult to see, the proportion of patients who died is slightly higher for the vaccinated group compared to the unvaccinated group.

ggplot(delta, aes(x = vaccine, fill = outcome)) + 
  geom_bar(position = "fill")

Exercise 4

Calculate the proportion of deaths in among those who are vaccinated. Then, calculate the proportion among those who are not vaccinated.

Proportion of deaths among the vaccinated is 0.00407 and the proportion of deaths among the unvaccinated is 0.00166.

delta |>
  count(vaccine, outcome) |>
  group_by(vaccine) |>
  mutate(prop = n/sum(n))
# A tibble: 4 × 4
# Groups:   vaccine [2]
  vaccine      outcome       n    prop
  <chr>        <chr>     <int>   <dbl>
1 Unvaccinated died        250 0.00166
2 Unvaccinated survived 150802 0.998  
3 Vaccinated   died        477 0.00407
4 Vaccinated   survived 116637 0.996  

Exercise 5

Create the visualization and calculate proportions from the two previous exercises, this time controlling for age. How do the proportions compare?

Among both the younger patients (<50) and the older patients (50+), proportions of deaths is smaller for the vaccinated.

ggplot(delta, aes(x = vaccine, fill = outcome)) + 
  geom_bar(position = "fill") +
  facet_wrap(~age)

delta |>
  count(age, vaccine, outcome) |>
  group_by(age, vaccine) |>
  mutate(prop = n/sum(n))
# A tibble: 8 × 5
# Groups:   age, vaccine [4]
  age   vaccine      outcome       n     prop
  <chr> <chr>        <chr>     <int>    <dbl>
1 50+   Unvaccinated died        205 0.0596  
2 50+   Unvaccinated survived   3235 0.940   
3 50+   Vaccinated   died        459 0.0168  
4 50+   Vaccinated   survived  26848 0.983   
5 <50   Unvaccinated died         45 0.000305
6 <50   Unvaccinated survived 147567 1.00    
7 <50   Vaccinated   died         18 0.000200
8 <50   Vaccinated   survived  89789 1.00    

Exercise 6

Based on your findings so far, fill in the blanks with more, less, or equally: Is there anything surprising about these statements? Speculate on what, if anything, the discrepancy might be due to.

  • In 2021, among those in the UK who were COVID Delta cases, the vaccinated were more likely to die than the unvaccinated.

  • For those under 50, those who were unvaccinated were more likely to die than those who were vaccinated.

  • For those 50 and up, those who were unvaccinated were more likely to die than those who were vaccinated.

    The relationshio between outcome and vaccine status changes depending on the age of the person.

Simpson’s Paradox

Simpson’s paradox is a phenomenon in which a trend appears in subsets of the data, but disappears or reverses when the subsets are combined. The paradox can be resolved when confounding variables and causal relations are appropriately addressed in the analysis.

Exercise 7

Let’s rephrase the previous question which asked you to speculate on why deaths among vaccinated cases overall is higher while deaths among unvaccinated cases are higher when we split the data into two groups (below 50 and 50 and up). What might be the confounding variable in the relationship between vaccination and deaths?

Age.

Exercise 8

Visualize and describe the distribution of seniors (50 and up) based on (a.k.a. conditional on) vaccination status. Hint: Your description will benefit from calculating proportions of seniors in each of the vaccination groups and working those values into your narrative.

The proportion of seniors (50+) is higher for the vaccinated group (0.233) compared to the unvaccinated group (0.0228).

ggplot(delta, aes(x = vaccine, fill = age)) + 
  geom_bar(position = "fill")

delta |>
  count(vaccine, age) |>
  group_by(vaccine) |>
  mutate(prop = n/sum(n))
# A tibble: 4 × 4
# Groups:   vaccine [2]
  vaccine      age        n   prop
  <chr>        <chr>  <int>  <dbl>
1 Unvaccinated 50+     3440 0.0228
2 Unvaccinated <50   147612 0.977 
3 Vaccinated   50+    27307 0.233 
4 Vaccinated   <50    89807 0.767 

Summary

The percentages of deaths from COVID across vaccination groups is as follows:

delta |> 
  count(vaccine, outcome) |>
  group_by(vaccine) |>
  mutate(perc = round(n/sum(n)*100,2)) |>
  filter(outcome == "died") |>
  select(-outcome, -n)
# A tibble: 2 × 2
# Groups:   vaccine [2]
  vaccine       perc
  <chr>        <dbl>
1 Unvaccinated  0.17
2 Vaccinated    0.41

Also considering age groups, the death rates are as follows:

vaccine_age_outcome_perc = delta |> 
  count(vaccine, age, outcome) |>
  group_by(vaccine, age) |>
  mutate(perc = round(n/sum(n)*100,2)) |>
  filter(outcome == "died") |>
  select(-outcome, -n)

vaccine_age_outcome_perc
# A tibble: 4 × 3
# Groups:   vaccine, age [4]
  vaccine      age    perc
  <chr>        <chr> <dbl>
1 Unvaccinated 50+    5.96
2 Unvaccinated <50    0.03
3 Vaccinated   50+    1.68
4 Vaccinated   <50    0.02

We can pivot these data for better display; we’ll learn more about these “data moves” soon:

vaccine_age_outcome_perc |> 
  pivot_wider(names_from = age, values_from = perc)
# A tibble: 2 × 3
# Groups:   vaccine [2]
  vaccine      `50+` `<50`
  <chr>        <dbl> <dbl>
1 Unvaccinated  5.96  0.03
2 Vaccinated    1.68  0.02

We identified age as a potential confounding variable in the relationship between. So let’s take a look at the distribution of age in the data:

age_props <- delta |>
  count(age) |>
  mutate(p = n / sum(n))

And then, let’s use these proportions to weigh the percentages of deaths.

vaccine_age_outcome_perc |>
  mutate(perc_wt = if_else(age == "50+", perc * 0.115, perc * 0.885)) |>
  group_by(vaccine) |>
  summarize(perc = sum(perc_wt))
# A tibble: 2 × 2
  vaccine       perc
  <chr>        <dbl>
1 Unvaccinated 0.712
2 Vaccinated   0.211

Revisiting the question we posed to start with: How do deaths from COVID cases compare between vaccinated and unvaccinated?

Acknowledgements

This case study is inspired by Statistical Literacy: Simpson’s Paradox and Covid Deaths by Milo Schield.