Inference for a proportion

Warm up

Announcements

  • Lab 6 due 11:59 pm
  • Intro and EDA due 11:59 pm
  • Email your reviewer!

Inference for a proportion

Proportions

If the parameter of interest is a single population proportion, what type of and how many variables are being studied?

Packages

We’ll use the tidyverse and tidymodels packages.

library(tidyverse)
library(tidymodels)

Data

In a survey conducted by Survey USA between September 30, 2023 and October 3, 2023, 2759 registered voters from all 50 US states were asked

America will hold an election for President of the United States next November. Not everyone makes the time to vote in every election. Which best describes you? Are you certain to vote? Will you probably vote? Are the chances you will vote about 50/50? Or will you probably not vote?

The data from this survey: voting-survey.csv.

voting_survey <- read_csv("voting-survey.csv")
Rows: 2759 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): vote
dbl (1): id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exercise 1

What, if anything, do you know about voter turnout in the US?

~ 60% last presidential election.

Exercise 2

Load the data and visualize the distribution of responses. Also calculate the proportion of respondents who are certain to vote in the next presidential election.

ggplot(voting_survey, aes(y = vote)) + 
  geom_bar()

voting_survey |>
  count(vote) |>
  mutate(p_hat = n/sum(n)) |>
  filter(vote == "Certain to vote")
# A tibble: 1 × 3
  vote                n p_hat
  <chr>           <int> <dbl>
1 Certain to vote  1921 0.696

Estimation

Based on these data, we want to estimate the true proportion of registered US voters who are certain to vote in the next presidential election.

Exercise 3

What is the parameter of interest?

Proportion of registered voters who are certain to vote in the next presidential election.

Exercise 4

Estimate using a 95% bootstrap interval.

voting_survey <- voting_survey |>
  mutate(vote_certain = if_else(
    vote == "Certain to vote", 
    "Certain to vote",
    "Not certain to vote"
  ))

obs_stat <- voting_survey |>
  specify(response = vote_certain, success = "Certain to vote") |>
  calculate(stat = "prop")

set.seed(1234)

boot_dist <- voting_survey |>
  specify(response = vote_certain, success = "Certain to vote") |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "prop")

ci <- boot_dist |>
  get_ci()

visualise(boot_dist) +
  shade_ci(ci)

ci
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.678    0.714

Exercise 5

L = 0.6781443 ; U = 0.7140268

Suppose the bounds of this interval are L = 0.678 and U = 0.714, your friend interprets the interval as

95% of the time, the true proportion of proportion of registered US voters who are certain to vote in the next presidential election is between L and U.

Comment on this interpretation. Is it correct? If not, how would ix it?

We are 95% confident that 67.8% to 71.4% of all registered voters are certain to vote in the next presidential election.

Testing

A newspaper claims that more than 60% of registered US are certain to vote in the next presidential election and cites this study as evidence. Do these data provide convincing evidence for this claim?

Exercise 6

What are the hypotheses?

\(H_0\) 60% of registered US voters are certain to vote ($p =0.6$)

\(H_A\) More than 60% of registered US voters are certain to vote ($p > 0.6$).

Exercise 7

Conduct a randomization test, at 5% discernability level, for this claim. What is the conclusion of the test?

set.seed(1234)

null_dist <- voting_survey |>
  specify(response = vote_certain, success = "Certain to vote") |>
  hypothesise(null = "point", p = 0.6) |>
  generate(reps = 1000, type = "draw") |>
  calculate(stat = "prop")


null_dist |>
  get_p_value(obs_stat, direction = "greater")
Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
# A tibble: 1 × 1
  p_value
    <dbl>
1       0
visualize(null_dist) +
  shade_p_value(obs_stat, direction = "greater")
Warning in min(diff(unique_loc)): no non-missing arguments to min; returning
Inf

The p-value is smaller than 0.001, which is smaller than 0.05 (disernability level), so we reject the null hypothesis. The data provides evidence that the true proportion of US voters certain to vote is greater than 60%.

Exercise 8

Suppose the p-value you found is approx 0, and your friends are in disagreement about the interpretation about this value. One friend claims:

The probability that 60% of all of registered US voters are certain to vote in the next presidential election is approx 0.

Another friend claims:

The probability that more than 60% of all of registered US voters are certain to vote in the next presidential election is approx 0.

Who is right? Explain your reasoning.

Both are wrong. P-value does not give us a probability that \(H_0\) is true or the probability that \(H_A\) is true.

Conceptual

Exercise 9

What is \(p\) vs. \(\hat{p}\) vs. p-value. Explain generically as well as in the context of these data and research question.

Add response here.

Exercise 10

What is sampling distribution vs. bootstrap distribution vs. null distribution? Explain generically as well as in the context of these data and research question.

Add response here.