Inference for two-way tables

Warm up

Announcements

  • Peer review due Sunday at 11:59 pm
  • Monday lab: I will talk to each of you individually for about 15 minutes about the project.
  • Make sure to read chapters 16 and 17 in the text book (specifically mathematical model section)

Inference for two-way tables

So far:

  • Single proportion: 1 categorical variable with 2 levels (“success” vs “failure”).
  • Difference in proportions: 2 categorical variables with 2 levels each (“success” vs “failure” and group 1 vs group 2).
Note

Suppose we have a categorical variable with 3 response levels. How do we define “success”?

What is a difference in proportions in this setting?

In such setting, a difference across two groups is not sufficient, and the proportion of “success” is not well defined if there are 3 or 4 or more possible response levels. The primary way to summarize categorical data where the explanatory and response variables both have 2 or more levels is through a two-way table.

Note that with two-way tables, there is not an obvious single parameter of interest. Instead, research questions usually focus on how the proportions of the response variable changes (or not) across the different levels of the explanatory variable.

Packages

We’ll use the tidyverse and tidymodels packages.

library(tidyverse)
library(tidymodels)
library(ggthemes)

Data

A September 16-19, 2023, asked North Carolina voters, among other issues, about issues of equality and women’s progress. Specifically, one of the questions asked:

If you had to choose just one issue that you would like candidates to talk more about in the 2024 campaigns, what would that issue be?

  • Economy

  • Abortion/Reproductive rights

  • Immigration

  • Crime

  • Gun rights/restrictions

  • Something else

  • Don’t know

The survey also asked respondents’ party affiliation:

What political party do you most identify with?

  • Democrat

  • Republican

  • Unaffiliated

  • Other

The results of this survey are summarized in this report and the data is called candidate-talk.csv.

Data

Exercise 1

Load the data.

candidate_talk = read_csv("candidate-talk.csv")

Exercise 2

Create a two-way table of the issues across parties and visualize the frequency distribution.

# add code here

Exercise 3

Which do you think should be the explanatory variable and which the response variable? Accordingly, create a visualization that shows the correct conditional probabilities.

# add code here

You should also be asking yourself: could we see these results due to chance alone if there really is no difference in the party affiliation, or is this in fact evidence that parties find different issues to be of interest?

Hypotheses

Exercise 4

State the hypotheses for evaluating whether the issue of choice is independent of party affiliation.

Add response here.

Expected counts in two way tables

While we would not expect the number of votes for each issue to be exactly the same across the parties, the proportion of responses seems substantially different across the three groups. In order to investigate whether the differences in rates is due to natural variability in people’s interest in issues or due to a treatment effect (i.e., party affiliation), we need to compute estimated counts for each cell in a two-way table.

From the data, we can compute the proportion of all survey participants who chose Economy as their number 1 issue to be 325/801 = 0.4057.

Note

If there is really no difference among the parties and 40.57% of voters were going to choose economy as the number 1 issue to discuss, how many of the 286 people in the Democrat group would we have expected to choose economy as the number 1 issue?

We would repeat this calculation for each party and for each issue.

In general, expected counts for a two-way table may be computed using the row totals, column totals, and the table total. For instance, if there was no difference between the groups, then about 40.57% of each row should be in the Economy.

The expected value of a cell of column issue i row party p is: \[\left(\frac{\text{column i total}}{\text{table total}}\right) (\text{row p total}).\]

The observed chi-squared statistic

The chi-squared test statistic for a two-way table is found by finding the ratio of how far the observed counts are from the expected counts, as compared to the expected counts, for every cell in the table. For each table count, compute: \[\frac{(\text{observed count - expected count})^2}{\text{expected count}}.\] For example, for democrat/economy cell, we would get \(\frac{(94 - 116.04)^2}{116.04} = 4.19\). Then, to find the chi-squared test statistic \(X^2\), we would add the values each cell gives us.

Testing

Exercise 5

Calculate the observed sample statistic.

# add code here

Exercise 6

Conduct the hypothesis test using randomization and visualize and report the p-value.

# add code here

Exercise 7

What is the conclusion of the hypothesis test?

Add response here.

Mathematical model

Turns out, the chi-squared test statistic follows a Chi-squared distribution when the null hypothesis is true and 2 conditions are satisfied: - Independent observations; - Large samples (5 expected counts in each cell).

For the two-way tables, the degrees of freedom is equal to \(df = (\text{number of rows}-1)\times(\text{number of columns}-1)\). In our example, the degrees of freedom parameter is \(df = (4-1)\times(7-1) = 18\).

Let’s find the p-value using mathematical model.