Inference for two independent means

Warm up

Announcements

  • Tomorrow we will talk about final report formatting
  • Monday: presentation day
  • Wednesday: project report due

Inference for comparing two means

Today’s focus

  • Inference for comparing two population means

  • Specifically, making decisions via hypothesis tests

  • And generally, what in the world is a hypothesis test?!

  • Different than comparing two paired means:

    • Two sets of observations are paired if each observation in one set has a special correspondence or connection with exactly one observation in the other dataset.

Setup

library(tidyverse)
library(openintro)
library(tidymodels)
library(ggthemes)

Randomization test for comparing two means

  1. Combine the data from the two groups and randomly shuffle them into two groups of sizes equal to the original group sample sizes

  2. Calculate the means of each group and record the different

  3. Repeat steps 1 and 2 many times to build the null distribution

  4. Find the p-value as the number of simulations with simulated differences at least as extreme (in the direction of the alternative hypothesis) as the observed difference

Case study: Birth weights of babies and smoking

Every year, the United States Department of Health and Human Services releases to the public a large dataset containing information on births recorded in the country. This dataset has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. In this case study we work with a random sample of 1,000 cases from the dataset released in 2014. The distributions of birth weights of babies, measured in pounds, by mother’s smoking habit are shown below.

glimpse(births14)
Rows: 1,000
Columns: 13
$ fage           <int> 34, 36, 37, NA, 32, 32, 37, 29, 30, 29, 30, 34, 28, 28,…
$ mage           <dbl> 34, 31, 36, 16, 31, 26, 36, 24, 32, 26, 34, 27, 22, 31,…
$ mature         <chr> "younger mom", "younger mom", "mature mom", "younger mo…
$ weeks          <dbl> 37, 41, 37, 38, 36, 39, 36, 40, 39, 39, 42, 40, 40, 39,…
$ premie         <chr> "full term", "full term", "full term", "full term", "pr…
$ visits         <dbl> 14, 12, 10, NA, 12, 14, 10, 13, 15, 11, 14, 16, 20, 15,…
$ gained         <dbl> 28, 41, 28, 29, 48, 45, 20, 65, 25, 22, 40, 30, 31, NA,…
$ weight         <dbl> 6.96, 8.86, 7.51, 6.19, 6.75, 6.69, 6.13, 6.74, 8.94, 9…
$ lowbirthweight <chr> "not low", "not low", "not low", "not low", "not low", …
$ sex            <chr> "male", "female", "female", "male", "female", "female",…
$ habit          <chr> "nonsmoker", "nonsmoker", "nonsmoker", "nonsmoker", "no…
$ marital        <chr> "married", "married", "married", "not married", "marrie…
$ whitemom       <chr> "white", "white", "not white", "white", "white", "white…

Note that there are some NAs in the habit variable.

births14 |>
  count(habit)
# A tibble: 3 × 2
  habit         n
  <chr>     <int>
1 nonsmoker   867
2 smoker      114
3 <NA>         19

Let’s drop those since we can’t use those observations in our analysis.

births14 <- births14 |>
  drop_na(habit)
births14 |>
  ggplot(aes(x = weight, color = habit, fill = habit)) +
  geom_density(alpha = 0.5)

Randomization test for comparing two means

Exercise 1

Set the hypotheses for testing if there is a difference between mean birth weight of babies born to mothers who are smokers and those born to mothers who are not smokers.

Add answer here.

Exercise 2

Calculate the observed difference between the mean birth weight of babies born to mothers who are smokers and those born to mothers who are not smokers in this sample.

# add code here

Exercise 3

Suppose the birth weights of the babies in this sample are written on pieces of paper. Explain how you would conduct the randomization test tactically.

Add answer here.

Exercise 4

Construct and visualize the null distribution. Based on your visualization, speculate on whether the p-value will be small or large.

# add code here

Add answer here.

Exercise 5

Visualize and calculate the p-value. At the 5% discernability level, what is the conclusion of the hypothesis test?

# add code here

Add answer here.

Bootstrap interval for the difference of two means

Exercise 6

Construct and interpret confidence interval at the equivalent level as the previous hypothesis test for the difference between the mean weight of babies born to mothers who are not smokers and those who are born to mothers who are smokers.

# add code here

Add response here.

Inference using mathematical models for comparing two means

The two conditions necessary to model the difference in sample means using \(t\)-distribution are: - Independent observations both, within and between samples. - Large sample size in each group and no extreme outliers.

When null hypothesis is true and the conditions are met, we can use a \(t\)-distribution with \(df = \min(n_1-1, n_2-1)\).

Mathematical model for estimating the difference in means using confidence intervals

The \(t\)-distribution can be used for inference when working with the standardized difference of two means if - The data are independent within and between the two groups - We check the outliers for each group separately.

Then, the margin or error is \(t^*_{df}\times \sqrt{\frac{s_1^2}{n_1} - \frac{s_2^2}{n_2}}\).

Exercise 7

Check that the technical conditions for conducting inference using mathematical models for comparing two means are met for these data.

Add response here.

Exercise 8

Conduct a hypothesis test using mathematical models for the hypotheses you set in Section 3.1.

# add code here