By the end of this lab you will
For all visualizations you create, be sure to include informative titles for the plot, axes, and legend!
Getting started
Go to Duke Container Manager and create a new folder titled **lab-8 - Inference for proportions*.
Download the files from Canvas and upload them into the new folder you created in step 1.
Complete the exercises in this document.
Packages
We’ll use the following packages in this lab:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%||%() masks base::%||%()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ broom 1.0.5 ✔ rsample 1.2.0
✔ dials 1.2.0 ✔ tune 1.1.2
✔ infer 1.0.5.9000 ✔ workflows 1.1.3
✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
✔ parsnip 1.1.1 ✔ yardstick 1.2.0
✔ recipes 1.0.8
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::%||%() masks base::%||%()
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
Attaching package: 'dsbox'
The following object is masked from 'package:infer':
gss
Part 1: General Social Survey
The General Social Survey (GSS) is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The GSS has been conducted each year since 1972 by the National Opinion Research Center at the University of Chicago. The GSS collects data on demographic characteristics and attitudes of residents of the United States. The GSS sample is designed as a multistage stratified sample.
For the exercises in this part, we’ll use the gss16
data set from the dsbox
package. You can find out more about the dataset by inspecting its documentation, which you can access by running ?gss16
in the Console or using the Help menu in RStudio to search for gss16
. You can also find this information here.
In 2016, the GSS added a new question on harassment at work. The question is phrased as the following.
Over the past five years, have you been harassed by your superiors or co-workers at your job, for example, have you experienced any bullying, physical or psychological abuse?
Answers to this question are stored in the harass5
variable in gss16
set.
Exercise 1
Create a subset of the data that only contains Yes and No answers for the harassment question. How many responses chose each of these answers?
Exercise 2
Describe how bootstrapping can be used to estimate the proportion of all Americans who have been harassed by their superiors or co-workers at their job.
Exercise 3
- Calculate a 95% bootstrap confidence interval for the proportion of Americans who have been harassed by their superiors or co-workers at their job. Use 1,000 iterations when creating your bootstrap distribution. Set the seed to
1234
. Interpret this interval in context of the data. Make sure to visualize this interval as well.
- Where was your 95% confidence interval centered? Why does this make sense?
- Now, calculate 90% bootstrap confidence interval for the proportion of Americans who have been harassed by their superiors or co-workers at their job. Interpret this interval in context of the data. Is it wider or more narrow than the 95% confidence interval? Additionally, visualize the interval by layering it on the visualization (for the 95% confidence interval) from part (a). You can do this simply by adding another shading layer where the endpoints now come from the new 90% confidence interval. Use a different pair of colors for the
color
and fill
arguments, e.g., darkorange4
and darkorange1
.
- Now, suppose you created a bootstrap distribution with 50,000 simulations instead of 1,000. What would you expect to change (if anything)? Center of your confidence interval? Width of your confidence interval? Do not run this simulation, just conceptually think about it.
Now that you’ve completed the second part of the exercises, pause and render your document. If it renders without any issues, great! Move on to the next exercise. If it does not, debug the issue before moving on. Ask for help from your TA if you need. Do not proceed without rendering your document.
Part 2: Course scores
Students in large university course that is offered each semester average 32 (out of 50) on the midterm, with a standard deviation of 4. The scores are distributed normally.
Exercise 4
What is the probability a randomly selected student scored less than 28?
What is the probability a randomly selected student scored higher than than 35?
What score did only 20 percent of the class exceed?
Exercise 5
I pick 10 students at random from the class.
[Without looking at the next question] How does the standard error of their average score compare to the standard deviation of the population of all STA 101 students? Explain your reasoning.
The standard error of the mean can be calculated as \(SE = \frac{\sigma}{\sqrt{n}}\). What is the probability the average midterm score of these 10 students is less than 28?
What is the probability the average midterm score of these 10 students is higher than 35?
How do the probabilities you calculated in the previous two questions (less than 28 and higher than 35) compare to the probabilities you calculated in Exercise 4? Why?
Now that you’ve completed the first part of the exercises, pause and render your document. If it renders without any issues, great! Move on to the next exercise. If it does not, debug the issue before moving on. Ask for help from your TA if you need. Do not proceed without rendering your document.
Part 3: IMS exercises
The exercises in this section do not require code. Make sure to answer the questions in full sentences.
Wrap up
Submitting
Before you proceed, first, make sure that you have updated the document YAML with your name! Then, render your document one last time, for good measure.
To submit your assignment to Gradescope:
Go to your Files pane and check the box next to the PDF output of your document (lab-8.pdf
).
Then, in the Files pane, go to More > Export. This will download the PDF file to your computer. Save it somewhere you can easily locate, e.g., your Downloads folder or your Desktop.
Go to the course Canvas page and click on Gradescope and then click on the assignment. You’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the papers of your lab should be associated with at least one question (i.e., should be “checked”).
If you fail to mark the pages associated with an exercise, that exercise won’t be graded. This means, if you fail to mark the pages for all exercises, you will receive a 0 on the assignment. The TAs can’t mark your pages for you, and for them to be able to grade, you must mark them.
Grading
The lab will be graded out of 50 total points.
Exercise 1 |
2 |
Exercise 2 |
6 |
Exercise 3 |
10 |
Exercise 4 |
3 |
Exercise 5 |
4 |
Exercise 6 |
4 |
Exercise 7 |
4 |
Exercise 8 |
5 |
Exercise 9 |
4 |
Exercise 10 |
8 |
Total |
50 |