library(tidyverse)
library(tidymodels)
Inference for a proportion
Warm up
Announcements
- Lab 6 due 11:59 pm
- Intro and EDA due 11:59 pm
- Email your reviewer!
Inference for a proportion
Proportions
If the parameter of interest is a single population proportion, what type of and how many variables are being studied?
Packages
We’ll use the tidyverse and tidymodels packages.
Data
In a survey conducted by Survey USA between September 30, 2023 and October 3, 2023, 2759 registered voters from all 50 US states were asked
America will hold an election for President of the United States next November. Not everyone makes the time to vote in every election. Which best describes you? Are you certain to vote? Will you probably vote? Are the chances you will vote about 50/50? Or will you probably not vote?
The data from this survey: voting-survey.csv
.
<- read_csv("voting-survey.csv") voting_survey
Rows: 2759 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): vote
dbl (1): id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Exercise 1
What, if anything, do you know about voter turnout in the US?
~ 60% last presidential election.
Exercise 2
Load the data and visualize the distribution of responses. Also calculate the proportion of respondents who are certain to vote in the next presidential election.
ggplot(voting_survey, aes(y = vote)) +
geom_bar()
|>
voting_survey count(vote) |>
mutate(p_hat = n/sum(n)) |>
filter(vote == "Certain to vote")
# A tibble: 1 × 3
vote n p_hat
<chr> <int> <dbl>
1 Certain to vote 1921 0.696
Estimation
Based on these data, we want to estimate the true proportion of registered US voters who are certain to vote in the next presidential election.
Exercise 3
What is the parameter of interest?
Proportion of registered voters who are certain to vote in the next presidential election.
Exercise 4
Estimate using a 95% bootstrap interval.
<- voting_survey |>
voting_survey mutate(vote_certain = if_else(
== "Certain to vote",
vote "Certain to vote",
"Not certain to vote"
))
<- voting_survey |>
obs_stat specify(response = vote_certain, success = "Certain to vote") |>
calculate(stat = "prop")
set.seed(1234)
<- voting_survey |>
boot_dist specify(response = vote_certain, success = "Certain to vote") |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = "prop")
<- boot_dist |>
ci get_ci()
visualise(boot_dist) +
shade_ci(ci)
ci
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 0.678 0.714
Exercise 5
L = 0.6781443 ; U = 0.7140268
Suppose the bounds of this interval are L = 0.678 and U = 0.714, your friend interprets the interval as
95% of the time, the true proportion of proportion of registered US voters who are certain to vote in the next presidential election is between L and U.
Comment on this interpretation. Is it correct? If not, how would ix it?
We are 95% confident that 67.8% to 71.4% of all registered voters are certain to vote in the next presidential election.
Testing
A newspaper claims that more than 60% of registered US are certain to vote in the next presidential election and cites this study as evidence. Do these data provide convincing evidence for this claim?
Exercise 6
What are the hypotheses?
\(H_0\) 60% of registered US voters are certain to vote ($p =0.6$)
\(H_A\) More than 60% of registered US voters are certain to vote ($p > 0.6$).
Exercise 7
Conduct a randomization test, at 5% discernability level, for this claim. What is the conclusion of the test?
set.seed(1234)
<- voting_survey |>
null_dist specify(response = vote_certain, success = "Certain to vote") |>
hypothesise(null = "point", p = 0.6) |>
generate(reps = 1000, type = "draw") |>
calculate(stat = "prop")
|>
null_dist get_p_value(obs_stat, direction = "greater")
Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
# A tibble: 1 × 1
p_value
<dbl>
1 0
visualize(null_dist) +
shade_p_value(obs_stat, direction = "greater")
Warning in min(diff(unique_loc)): no non-missing arguments to min; returning
Inf
The p-value is smaller than 0.001, which is smaller than 0.05 (disernability level), so we reject the null hypothesis. The data provides evidence that the true proportion of US voters certain to vote is greater than 60%.
Exercise 8
Suppose the p-value you found is approx 0, and your friends are in disagreement about the interpretation about this value. One friend claims:
The probability that 60% of all of registered US voters are certain to vote in the next presidential election is approx 0.
Another friend claims:
The probability that more than 60% of all of registered US voters are certain to vote in the next presidential election is approx 0.
Who is right? Explain your reasoning.
Both are wrong. P-value does not give us a probability that \(H_0\) is true or the probability that \(H_A\) is true.
Conceptual
Exercise 9
What is \(p\) vs. \(\hat{p}\) vs. p-value. Explain generically as well as in the context of these data and research question.
Add response here.
Exercise 10
What is sampling distribution vs. bootstrap distribution vs. null distribution? Explain generically as well as in the context of these data and research question.
Add response here.