library(tidyverse)
library(tidymodels)
library(openintro)
Lab 5
By the end of this lab you will create formulate and conduct hypothesis tests using randomization.
For all visualizations you create, be sure to include informative titles for the plot, axes, and legend!
Getting started
Go to Duke Container Manager and create a new folder titled lab-5 - Hypothesis testing.
Download the files from Canvas and upload them into the new folder you created in step 1.
Complete the exercises in this document.
Part 1: IMS Exercises
The exercises in this section do not require code. Make sure to answer the questions in full sentences.
Exercise 1
IMS - Chapter 11 exercises, #2: Identify the parameter, II.
Exercise 2
IMS - Chapter 11 exercises, #6: Identify hypotheses, II.
Exercise 3
IMS - Chapter 11 exercises, #8: Heart transplants.
Part 2: Heart transplants - outcome
Today, we will be working with data from The Stanford University Heart Transplant Study[@turnbull1974].
We’ll use the tidyverse and tidymodels packages for our analysis, so let’s load those first.
And then, let’s read the data and take a peak at it:
<- read_csv("heart-transplant.csv")
heart_transplant glimpse(heart_transplant)
Rows: 103
Columns: 8
$ id <dbl> 15, 43, 61, 75, 6, 42, 54, 38, 85, 2, 103, 12, 48, 102, 35,…
$ outcome <chr> "deceased", "deceased", "deceased", "deceased", "deceased",…
$ transplant <chr> "control", "control", "control", "control", "control", "con…
$ age <dbl> 53, 43, 52, 52, 54, 36, 47, 41, 47, 51, 39, 53, 56, 40, 43,…
$ survtime <dbl> 1, 2, 2, 2, 3, 3, 3, 5, 5, 6, 6, 8, 9, 11, 12, 16, 16, 16, …
$ acceptyear <dbl> 68, 70, 71, 72, 68, 70, 71, 70, 73, 68, 67, 68, 71, 74, 70,…
$ prior <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
$ wait <dbl> NA, NA, NA, NA, NA, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA, …
The data dictionary is as follows:
Variable | Description |
---|---|
id | ID number of the patient |
outcome | Survival status with levels alive and deceased |
transplant | Transplant group with levels control (did not receive a transplant) and treatment (received a transplant) |
age | Age of the patient at the beginning of the study |
survtime | Number of days patients were alive after the date they were determined to be a candidate for a heart transplant until the termination date of the study |
acceptyear | Year of acceptance as a heart transplant candidate |
prior | Whether or not the patient had prior surgery with levels yes and no |
wait | Waiting time for transplant |
Exercise 4
Recreate the stacked bar plot in Exercise 3.
The colors are #569BBD
and #114B5F
. Use scale_fill_manual()
to indicate these are the colors you want to use.
Now that you’ve completed a few exercises, pause and render your document. If it renders without any issues, great! Move on to the next exercise. If it does not, debug the issue before moving on. Ask for help if you need. Do not proceed without rendering your document.
Exercise 5
Calculate the point estimate for the difference in proportions of patients who died between the treatment and control groups: \(\hat{p}_{treatment} - \hat{p}_{control}\), where \(\hat{p}\) is the observed probability of dying in each group. Do this in two ways:
- Construct a frequency table of
survived
bytransplant
and calculate the difference in proportions “by hand”. - Use
specify()
andcalculate()
to calculate the difference in proportions as shown below.
<- heart_transplant |>
obs_diff_outcome specify(response = outcome, explanatory = transplant, success = "deceased") |>
calculate(stat = "diff in props", order = c("treatment", "control"))
Then, confirm that you obtained the same results with the two methods.
Exercise 6
Using the code below, generate a null distribution for the hypothesis test for the hypothesis you formulated in Exercise 3. Use 100 resamples (reps
). However, change the seed to use a different seed.
set.seed(1234)
<- heart_transplant |>
null_dist_outcome specify(response = outcome, explanatory = transplant, success = "deceased") |>
hypothesize(null = "independence") |>
generate(reps = 100, type = "permute") |>
calculate(stat = "diff in props", order = c("treatment", "control"))
Then, inspect the object null_dist_outcome
using either glimpse()
or just printing it to screen. How many rows and how many columns does it have? What does each row represent? What does each variable represent?
Exercise 7
Visualize the distribution of simulated differences in proportions of deceased between treatment and control groups. What is the shape of this distribution? What is the center of this distribution? Are these results expected? Explain your reasoning.
These simulated values are in the stat
column of null_dist_outcome
.
Exercise 8
Calculate the p-value. Do this in two ways:
- Filter
null_dist_outcome
forstat
values that are at least as far away from the null value as the observed value you calculated in Exercise 4. Make sure you’re considering both directions! Note how many such values there are, and calculate the proportion out of the total number of resamples. - Use
get_p_value()
:
<- null_dist_outcome |>
p_value_outcome get_p_value(obs_stat = obs_diff_outcome, direction = "two-sided")
Then, confirm that you obtained the same results with the two methods. Finally, interpret the p-value in context of the data and the research question and comment on the statistical discernability of this result using a discernability level of 5%.
Now that you’ve completed a few exercises, pause and render your document. If it renders without any issues, great! Move on to the next exercise. If it does not, debug the issue before moving on. Ask for help if you need. Do not proceed without rendering your document.
Part 3: Heart transplants - survtime
Exercise 9
Recreate the side-by–side box plots in Exercise 3. The variable of interest is survtime
, which is the number of days patients were alive after the date they were determined to be a candidate for a heart transplant until the termination date of the study.
Exercise 10
Do these data provide convincing evidence of a difference between average survival times of patient who did and did not get a heart transplant?
Write the hypotheses for answering this question and conduct the hypothesis test using randomization. Use 1,000 resamples. In order to avoid name collision, use object names like obs_diff_survtime
, null_dist_survtime
, and p_value_survtime
. Find and interpret the p-value in context of the data and the research question and comment on the statistical discernability of this result using a discernability level of 5%.
Wrap up
Submitting
Before you proceed, first, make sure that you have updated the document YAML with your name! Then, render your document one last time, for good measure.
To submit your assignment to Gradescope:
Go to your Files pane and check the box next to the PDF output of your document (
lab-5.pdf
).Then, in the Files pane, go to More > Export. This will download the PDF file to your computer. Save it somewhere you can easily locate, e.g., your Downloads folder or your Desktop.
Go to the course Canvas page and click on Gradescope and then click on the assignment. You’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the papers of your lab should be associated with at least one question (i.e., should be “checked”).
If you fail to mark the pages associated with an exercise, that exercise won’t be graded. This means, if you fail to mark the pages for all exercises, you will receive a 0 on the assignment. The TA can’t mark your pages for you, and for them to be able to grade, you must mark them.
Grading
Exercise | Points |
---|---|
Exercise 1 | 5 |
Exercise 2 | 4 |
Exercise 3 | 6 |
Exercise 4 | 4 |
Exercise 5 | 4 |
Exercise 6 | 6 |
Exercise 7 | 5 |
Exercise 8 | 4 |
Exercise 9 | 5 |
Exercise 10 | 7 |
Total | 50 |