Hypothesis testing with randomization

Hypothesis testing

A hypothesis test is a statistical technique used to evaluate competing claims using data.

Null hypothesis, \(H_o\): An assumption about the population. “There is nothing going on.”
Alternative hypothesis, \(H_A\): A research question about the population. “There is something going on”.

Note

Hypotheses are always at the population level!

Populations vs. samples

Suppose you’re cooking a pot of soup:

Taste a spoonful and decide if that spoonful has enough salt \(\rightarrow\) exploratory data analysis of the sample
Decide the pot of soup must also have enough salt since the spoonful does \(\rightarrow\) inference
Mixing the soup in the pot before taking a spoonful \(\rightarrow\) randomizing
Taking a spoonful with the intention of making an inference about the pot \(\rightarrow\) sampling

Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%||%()   masks base::%||%()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom        1.0.5          ✔ rsample      1.2.0     
✔ dials        1.2.0          ✔ tune         1.1.2     
✔ infer        1.0.5.9000     ✔ workflows    1.1.3     
✔ modeldata    1.2.0          ✔ workflowsets 1.0.1     
✔ parsnip      1.1.1          ✔ yardstick    1.2.0     
✔ recipes      1.0.8          
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::%||%()     masks base::%||%()
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

library(openintro)  # for data for case study 2: yawn

Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata

Attaching package: 'openintro'

The following object is masked from 'package:modeldata':

    ames

Case study 1: Donors

Organ donors

People providing an organ for donation sometimes seek the help of a special “medical consultant”. These consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of complications during the medical procedure and recovery. Patients might choose a consultant based in part on the historical complication rate of the consultant’s clients.

One consultant tried to attract patients by noting that the average complication rate for liver donor surgeries in the US is about 10%, but her clients have only had 3 complications in the 62 liver donor surgeries she has facilitated. She claims this is strong evidence that her work meaningfully contributes to reducing complications (and therefore she should be hired!).

Data

organ_donor <- read_csv("organ-donor.csv")

organ_donor |>
  count(outcome)

# A tibble: 2 × 2
  outcome             n
  <chr>           <int>
1 complication        3
2 no complication    59

Parameter vs. statistic

A parameter for a hypothesis test is the “true” value of interest. We typically estimate the parameter using a sample statistic as a point estimate.

\(p~\): true rate of complication

\(\hat{p}~\): rate of complication in the sample = \(\frac{3}{62}\) = 0.048

Correlation vs. causation

Is it possible to assess the consultant’s claim using the data?

No. The claim is that there is a causal connection, but the data are observational. For example, maybe patients who can afford a medical consultant can afford better medical care, which can also lead to a lower complication rate.

While it is not possible to assess the causal claim, it is still possible to test for an association using these data. For this question we ask, could the low complication rate of \(\hat{p}\) = 0.048 be due to chance?

Two claims

Null hypothesis: “There is nothing going on”

Complication rate for this consultant is no different than the US average of 10%

Alternative hypothesis: “There is something going on”

Complication rate for this consultant is different than the US average of 10%

Hypothesis testing as a court trial

Hypotheses:
- Null hypothesis, \(H_0\): Defendant is innocent
- Alternative hypothesis, \(H_A\): Defendant is guilty
Present the evidence: Collect data
Judge the evidence: “Could these data plausibly have happened by chance if the null hypothesis were true?”
- Yes: Fail to reject \(H_0\)
- No: Reject \(H_0\)

Hypothesis testing framework

Start with a null hypothesis, \(H_0\), that represents the status quo
Set an alternative hypothesis, \(H_A\), that represents the research question, i.e. what we’re testing for
Conduct a hypothesis test under the assumption that the null hypothesis is true and calculate a p-value (probability of observed or more extreme outcome given that the null hypothesis is true)
- if the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, stick with the null hypothesis
- if they do, then reject the null hypothesis in favor of the alternative

Setting the hypotheses

Which of the following is the correct set of hypotheses for evaluating whether the consultant’s complication rate is different than the US average of 10%?

\(H_0: p = 0.10\); \(H_A: p \ne 0.10\)
\(H_0: p = 0.10\); \(H_A: p > 0.10\)
\(H_0: p = 0.10\); \(H_A: p < 0.10\)
\(H_0: \hat{p} = 0.10\); \(H_A: \hat{p} \ne 0.10\)
\(H_0: \hat{p} = 0.10\); \(H_A: \hat{p} > 0.10\)
\(H_0: \hat{p} = 0.10\); \(H_A: \hat{p} < 0.10\)

Simulating the null distribution

Since \(H_0: p = 0.10\), we need to simulate a null distribution where the probability of success (complication) for each trial (patient) is 0.10.

Describe how you would simulate the null distribution for this study using a bag of say pocker chips. How many chips? What colors? What do the colors indicate? How many draws? With replacement or without replacement?

We can take a bag of a 100 chips, say 10 of them are red (corresponding to complications) and 90 white (corresponding to no complications). Then, if I draw a chip from it, there is a 10% chance that it is red (aka complication). Thus, this “bag” now corresponds to our world where the true complication rate is 10% (where null hypothesis is true). Every time I draw a chip, I record its color and put it back (if I do not do so, the second chip I draw has either 9/99 or 10/99 chance of being red instead of 10/100 depending on what color the first chip was). This is what we call sampling with replacement.

Now, I will draw 62 chips with replacement and calculate the proportion of red chips out of these 62 chips. This reconstructs consultant’s sample under the assumption that her true complication rate is 10%.

We repeat this draw of 62 chips many times and record the proportion of red chips every time.

What do we expect?

When sampling from the null distribution, what is the expected proportion of success (complications)?

We expect to see around 6.2 (10% of 62) red chips. Of course, it does not make sense to have 0.2 of a chip, but if I were to repeat this many many times, the average number of red chips I observe each time would be close to 6.2.

If we were to plot a histogram of the resulting proportions, we expect it to be centered around 0.1, as the probability of drawing a which chip out of a bag is 0.1.

Simulation #1

sim1
   complication no complication 
              7              55

[1] 0.1129032

Simulation #2

sim2
   complication no complication 
             10              52

[1] 0.1612903

Simulation #3

sim3
   complication no complication 
              5              57

[1] 0.08064516

This is getting boring…

We need a way to automate this process!

The first dataset we’ll use is organ_donors:

organ_donor <- read_csv("organ-donor.csv")

Rows: 62 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): outcome

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The hypotheses we are testing are:

\(H_0: p = 0.10\)

\(H_A: p \ne 0.10\)

where \(p\) is the true complication rate for this consultant.

Exercise 1

Construct the null distribution with 100 resamples. Name it null_dist_donor. How many rows does null_dist_donor have? How many columns? What does each row and each column represent?

set.seed(10)

null_dist_donor <- organ_donor |>
  specify(response = outcome, success = "complication") |>
  hypothesise(null = "point", p = 0.1) |>
  generate(reps = 100, type = "draw") |>
  calculate(stat = "prop")

null_dist_donor

Response: outcome (factor)
Null Hypothesis: point
# A tibble: 100 × 2
   replicate   stat
       <int>  <dbl>
 1         1 0.0323
 2         2 0.0645
 3         3 0.0968
 4         4 0.0161
 5         5 0.161 
 6         6 0.0968
 7         7 0.0645
 8         8 0.129 
 9         9 0.161 
10        10 0.0968
# ℹ 90 more rows

null_dist_donor has 100 rows and 2 columns. Each row is a replicate, and replicate column indicates the replicate number and stat is the simulated proportion, \(\hat p\) .

Exercise 2

Where do you expect the center of the null distribution to be? Visualize it to confirm.

# option 1

ggplot(null_dist_donor, aes(x = stat)) +
  geom_histogram(bins = 15, color = "white")

# option 2

visualize(null_dist_donor)

Exercise 2

Calculate the observed complication rate of this consultant. Name it obs_stat_donor.

obs_stat_donor <- organ_donor |>
  specify(response = outcome, success = "complication") |>
  calculate(stat = "prop")

obs_stat_donor

Response: outcome (factor)
# A tibble: 1 × 1
    stat
   <dbl>
1 0.0484

Exercise 3

Overlay the observed statistic on the null distribution and comment on whether an observed outcome as extreme as the observed statistic, or lower, is a likely or unlikely outcome, if in fact the null hypothesis is true.

# option 1

ggplot(null_dist_donor, aes(x = stat)) +
  geom_histogram(bins = 15, color = "white") + 
  geom_vline(xintercept = obs_stat_donor |> pull(stat), color = "red" )

# option 2

visualize(null_dist_donor) +
  shade_p_value(obs_stat = obs_stat_donor, direction = "two-sided")

Exercise 4

Calculate the p-value and comment on whether it provides convincing evidence that this consultant incurs a lower complication rate than 10% (overall US complication rate).

Since the p-value is greater than the discernability level, we fail to reject the null hypothesis. These data do not provide convincing evidence that this consultant incurs a lower complication rate than 10% (overall US complication rate).

# option 1

null_dist_donor |>
  filter(stat <= pull(obs_stat_donor, stat)) |>
  nrow()*2/100

[1] 0.26

# option 2

null_dist_donor |>
  get_p_value(obs_stat = obs_stat_donor, direction = "two-sided")

# A tibble: 1 × 1
  p_value
    <dbl>
1    0.26

Exercise 5

Let’s get real! Redo the test with 15,000 simulations. Note: This can take some time to run.

null_dist_donor <- organ_donor |>
  specify(response = outcome, success = "complication") |>
  hypothesise(null = "point", p = 0.1) |>
  generate(reps = 15000, type = "draw") |>
  calculate(stat = "prop")

null_dist_donor |>
  visualise() + 
  shade_p_value(obs_stat = obs_stat_donor, direction = "two-sided")

null_dist_donor |>
  get_p_value(obs_stat = obs_stat_donor, direction = "two-sided")

# A tibble: 1 × 1
  p_value
    <dbl>
1   0.250

Case study 2: Yawners

Is yawning contagious?

Do you think yawning is contagious?

Is yawning contagious?

An experiment conducted by the MythBusters tested if a person can be subconsciously influenced into yawning if another person near them yawns.

If you’re interested, you can watch the full episode at https://www.dailymotion.com/video/x7ydkt2.

Study description

In this study 50 people were randomly assigned to two groups: 34 to a group where a person near them yawned (treatment) and 16 to a control group where they didn’t see someone yawn (control).

The data are in the openintro package: yawn

yawn |>
  count(group, result)

# A tibble: 4 × 3
  group result       n
  <fct> <fct>    <int>
1 ctrl  not yawn    12
2 ctrl  yawn         4
3 trmt  not yawn    24
4 trmt  yawn        10

Proportion of yawners

yawn |>
  count(group, result) |>
  group_by(group) |>
  mutate(p_hat = n / sum(n))

# A tibble: 4 × 4
# Groups:   group [2]
  group result       n p_hat
  <fct> <fct>    <int> <dbl>
1 ctrl  not yawn    12 0.75 
2 ctrl  yawn         4 0.25 
3 trmt  not yawn    24 0.706
4 trmt  yawn        10 0.294

Proportion of yawners in the treatment group: \(\frac{10}{34} = 0.2941\)
Proportion of yawners in the control group: \(\frac{4}{16} = 0.25\)
Difference: \(0.2941 - 0.25 = 0.0441\)
Our results match the ones calculated on the MythBusters episode.

Independence?

Based on the proportions we calculated, do you think yawning is really contagious, i.e. are seeing someone yawn and yawning dependent?

# A tibble: 4 × 4
# Groups:   group [2]
  group result       n p_hat
  <fct> <fct>    <int> <dbl>
1 ctrl  not yawn    12 0.75 
2 ctrl  yawn         4 0.25 
3 trmt  not yawn    24 0.706
4 trmt  yawn        10 0.294

Dependence, or another possible explanation?

The observed differences might suggest that yawning is contagious, i.e. seeing someone yawn and yawning are dependent.
But the differences are small enough that we might wonder if they might simple be due to chance.
Perhaps if we were to repeat the experiment, we would see slightly different results.
So we will do just that - well, somewhat - and see what happens.
Instead of actually conducting the experiment many times, we will simulate our results.

Two competing claims

Null hypothesis:

“There is nothing going on.” Yawning and seeing someone yawn are independent, yawning is not contagious, observed difference in proportions is simply due to chance.

Alternative hypothesis:

“There is something going on.” Yawning and seeing someone yawn are dependent, yawning is contagious, observed difference in proportions is not due to chance.

Simulation by hand - setup

A regular deck of cards is comprised of 52 cards: 4 aces, 4 of numbers 2-10, 4 jacks, 4 queens, and 4 kings.
Take out two aces from the deck of cards and set them aside.
The remaining 50 playing cards to represent each participant in the study:
- 14 face cards (including the 2 aces) represent the people who yawn.
- 36 non-face cards represent the people who don’t yawn.

Simulation by hand - running

Shuffle the 50 cards at least 7 times¹ to ensure that the cards counted out are from a random process.
Count out the top 16 cards and set them aside. These cards represent the people in the control group.
Out of the remaining 34 cards (treatment group) count the number of face cards (the number of people who yawned in the treatment group).
Calculate the difference in proportions of yawners (treatment - control), and plot it on the board.
Mark the difference you find on the dot plot on the board.

Simulation by computation

Exercise 6

Using the yawn dataset in the openintro package, conduct a hypothesis test for evaluating whether yawning is contagious. First, set the hypotheses. Then, conduct a randomization test using 1000 simulations. Visualize and calculate the p-value and use it to make a conclusion about the statistical discernability of the difference in proportions of yawners in the two groups. Then, comment on whether you “buy” this conclusion.

Null hypothesis: Yawning is not contagious (there is no relationship between seeing someone yawn and yawning). \(H_0: p_{trt}= p_{ctr}\) where \(p_{trt}\) is the proportion of those who yawn after seeing someone yawn and \(p_{ctr}\) it the proportion of those who yawn in the control group.

Alternative hypothesis: Yawning is contagious (there is a relationship). \(H_A: p_{trt} > p_{ctr}\).

null_dist_yawner <- yawn |>
  specify(response = result, explanatory = group, success = "yawn") |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "diff in props", order = c("trmt", "ctrl"))

obs_stat_yawner <- yawn |>
  specify(response = result, explanatory = group, success = "yawn") |>
  calculate(stat = "diff in props", order = c("trmt", "ctrl") )

null_dist_yawner |>
  visualise() +
  shade_p_value(obs_stat = obs_stat_yawner, direction = "greater")

null_dist_yawner |>
  filter(stat >= obs_stat_yawner |> pull(stat)) |>
  nrow()/1000

[1] 0.522

null_dist_yawner |>
  get_p_value(obs_stat = obs_stat_yawner, direction = "greater")

# A tibble: 1 × 1
  p_value
    <dbl>
1   0.522

Since the p-value is 0.509 and is greater than discernability level of 0.05, we fail to reject the null. There not enough convincing evidence to show that yawning is contagious.

Things to report:

p-value
compare it to discernability level
conclude what that mean for null hypothesis
enough/not enough evidence to conclude …. (context of the problem)

Recap

Types of alternative hypotheses

One-sided (one-tailed) alternatives: The parameter is hypothesized to be less than or greater than the null value, < or >
Two-sided (two-tailed) alternatives: The parameter is hypothesized to be not equal to the null value, \(\ne\)
- Calculated as two times the tail area beyond the observed sample statistic
- More objective, and hence more widely preferred

Average systolic blood pressure of people with Stage 1 Hypertension is 150 mm Hg. Suppose we want to use a hypothesis test to evaluate whether a new blood pressure medication has an effect on the average blood pressure of heart patients. What are the hypotheses?

Discernability level

We often use 5% as the cutoff for whether the p-value is low enough that the data are unlikely to have come from the null model. This cutoff value is called the discernability level, \(\alpha\).

If p-value < \(\alpha\), reject \(H_0\) in favor of \(H_A\): The data provide convincing evidence for the alternative hypothesis.
If p-value > \(\alpha\), fail to reject \(H_0\) in favor of \(H_A\): The data do not provide convincing evidence for the alternative hypothesis.

Statistically discernable

If you’ve taken a statistics course before, or read papers that use hypothesis testing for making a conclusion, you might have encountered the term “statistically significant” or “significance level”.
We will use the term “statistically discernable” or discernability level”, because “significant” has a different meaning in everyday language and this often causes misconceptions of what “statistically significant” means. It doesn’t necessarily mean a notable or important event has happened, it just means the data are unlikely to have come from the null model.

Setting a seed

Goal: Pin down the random generation so that the same random generation happens each time a document is rendered (by you or by someone else wanting to replicate your results)
When: Set a seed each time right before generate()ing new resamples. Setting a seed once in a document would also work for re-rendering the document, but considering we often run the code chunk interactively, it’s best to set the seed again in each code chunk that does random generation.