library(tidyverse)
library(tidymodels)
library(ggthemes)
Inference for two-way tables
Warm up
Announcements
- Peer review due Sunday at 11:59 pm
- Monday lab: I will talk to each of you individually for about 15 minutes about the project.
- Make sure to read chapters 16 and 17 in the text book (specifically mathematical model section)
Inference for two-way tables
So far:
- Single proportion: 1 categorical variable with 2 levels (“success” vs “failure”).
- Difference in proportions: 2 categorical variables with 2 levels each (“success” vs “failure” and group 1 vs group 2).
Suppose we have a categorical variable with 3 response levels. How do we define “success”?
What is a difference in proportions in this setting?
In such setting, a difference across two groups is not sufficient, and the proportion of “success” is not well defined if there are 3 or 4 or more possible response levels. The primary way to summarize categorical data where the explanatory and response variables both have 2 or more levels is through a two-way table.
Note that with two-way tables, there is not an obvious single parameter of interest. Instead, research questions usually focus on how the proportions of the response variable changes (or not) across the different levels of the explanatory variable.
Packages
We’ll use the tidyverse and tidymodels packages.
Data
A September 16-19, 2023, asked North Carolina voters, among other issues, about issues of equality and women’s progress. Specifically, one of the questions asked:
If you had to choose just one issue that you would like candidates to talk more about in the 2024 campaigns, what would that issue be?
Economy
Abortion/Reproductive rights
Immigration
Crime
Gun rights/restrictions
Something else
Don’t know
The survey also asked respondents’ party affiliation:
What political party do you most identify with?
Democrat
Republican
Unaffiliated
Other
The results of this survey are summarized in this report and the data is called candidate-talk.csv
.
Data
Exercise 1
Load the data.
= read_csv("candidate-talk.csv") candidate_talk
Exercise 2
Create a two-way table of the issues across parties and visualize the frequency distribution.
# add code here
Exercise 3
Which do you think should be the explanatory variable and which the response variable? Accordingly, create a visualization that shows the correct conditional probabilities.
# add code here
You should also be asking yourself: could we see these results due to chance alone if there really is no difference in the party affiliation, or is this in fact evidence that parties find different issues to be of interest?
Hypotheses
Exercise 4
State the hypotheses for evaluating whether the issue of choice is independent of party affiliation.
Add response here.
Expected counts in two way tables
While we would not expect the number of votes for each issue to be exactly the same across the parties, the proportion of responses seems substantially different across the three groups. In order to investigate whether the differences in rates is due to natural variability in people’s interest in issues or due to a treatment effect (i.e., party affiliation), we need to compute estimated counts for each cell in a two-way table.
From the data, we can compute the proportion of all survey participants who chose Economy
as their number 1 issue to be 325/801 = 0.4057.
If there is really no difference among the parties and 40.57% of voters were going to choose economy as the number 1 issue to discuss, how many of the 286 people in the Democrat
group would we have expected to choose economy as the number 1 issue?
We would repeat this calculation for each party and for each issue.
In general, expected counts for a two-way table may be computed using the row totals, column totals, and the table total. For instance, if there was no difference between the groups, then about 40.57% of each row should be in the Economy
.
The expected value of a cell of column issue i row party p is: \[\left(\frac{\text{column i total}}{\text{table total}}\right) (\text{row p total}).\]
The observed chi-squared statistic
The chi-squared test statistic for a two-way table is found by finding the ratio of how far the observed counts are from the expected counts, as compared to the expected counts, for every cell in the table. For each table count, compute: \[\frac{(\text{observed count - expected count})^2}{\text{expected count}}.\] For example, for democrat/economy cell, we would get \(\frac{(94 - 116.04)^2}{116.04} = 4.19\). Then, to find the chi-squared test statistic \(X^2\), we would add the values each cell gives us.
Testing
Exercise 5
Calculate the observed sample statistic.
# add code here
Exercise 6
Conduct the hypothesis test using randomization and visualize and report the p-value.
# add code here
Exercise 7
What is the conclusion of the hypothesis test?
Add response here.
Mathematical model
Turns out, the chi-squared test statistic follows a Chi-squared distribution when the null hypothesis is true and 2 conditions are satisfied: - Independent observations; - Large samples (5 expected counts in each cell).
For the two-way tables, the degrees of freedom is equal to \(df = (\text{number of rows}-1)\times(\text{number of columns}-1)\). In our example, the degrees of freedom parameter is \(df = (4-1)\times(7-1) = 18\).
Let’s find the p-value using mathematical model.