Explore categorical data

Exploring categorical data

2020 Durham City and County Resident Survey

Sample survey questions - services

Sample survey questions - safety

Sample survey questions - demographics (Today’s focus)

The main question we’ll explore today is “What are the demographics and priorities of City of Durham residents?”

Goals

  • Getting familiar with survey data

  • Visualizing and summarizing categorical data

  • Make connections between concepts of variable types in a study and variable types in R

  • Exploring relationships between categorical variables

  • Improving visualizations for visual appeal and better communication

Packages

library(tidyverse)

Data

The data for this case study come from the 2020 Durham City and County Resident Survey.

First, let’s load the data:

durham <- read_csv("durham-2020.csv")
Rows: 803 Columns: 49
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): please_define_other_32_5, primary_language, please_define_other_34...
dbl (42): id, overall_quality_of_services_3_01, overall_quality_of_services_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Visualizing and summarizing categorical data

Exercise 1

How many rows and columns are in this dataset? Answer in a full sentence using inline code. What does each row represent and what does each column represent?

There are 803 rows and 49 columns in this data set.

Exercise 2

The variables we’ll use in this analysis are as follows. Rename the variables to the updated names shown below.

Original name Updated name
primary_language primary_language
do_you_own_or_rent_your_current_resi_31 own_rent
would_you_say_your_total_annual_hous_35 income
durham <- durham |>
  rename(own_rent = do_you_own_or_rent_your_current_resi_31,
         income = would_you_say_your_total_annual_hous_35)

Exercise 3

What language do Durham residents speak: primary_language?

What is the primary language used in your household?

Add your answer here.

durham |>
  ggplot(aes(x = primary_language)) +
  geom_bar()

durham |>
  count(primary_language)
# A tibble: 4 × 2
  primary_language     n
  <chr>            <int>
1 English            768
2 Other                3
3 Spanish             18
4 <NA>                14

Exercise 4

Make similar bar plots of own_rent and income. What distinct values do these variables take?

durham |>
  ggplot(aes(x = income)) +
  geom_bar()
Warning: Removed 110 rows containing non-finite values (`stat_count()`).

durham |>
  ggplot(aes(x = own_rent)) +
  geom_bar()
Warning: Removed 2 rows containing non-finite values (`stat_count()`).

Exercise 5

The variables own_rent and income are both categorical, but they’re stored as numbers. In R, categorical data are called factors. Recode these variables as factors with the as_factor() function.

durham <- durham |>
  mutate(income = as_factor(income),
         own_rent = as_factor(own_rent))

Exercise 6

Recreate the visualization from the previous exerciseincome` barplot, improving it for both visual appeal and better communication of findings.

Add your answer here.

durham |>
  ggplot(aes(y = income, fill = income)) + 
  geom_bar(show.legend = FALSE) +
  scale_fill_viridis_d(na.value = "darkgray") +
  scale_y_discrete(
    labels = c(
      "1" = "Under $30,000",
      "2" = "$30,000 - $59,999",
      "3" = "$60,000 - $99,999",
      "4" = "$100,000 or more"
    )
  ) + 
  labs(
    x = "Count",
    y = NULL,
    title = "Would you say your total annual household income is ... "
  )

Exercise 7

Recreate the visualization from the previous exercise, but first calculate relative frequencies (proportions) of income (the marginal distribution) and plot the proportions instead of counts.

durham |> 
  count(income) |>
  mutate(prop = n/sum(n)) |>
  ggplot(aes(y = income, x= prop, fill = income)) +
  geom_col(show.legend = FALSE) +
  scale_fill_viridis_d(na.value = "darkgray") +
  scale_y_discrete(
    labels = c(
      "1" = "Under $30,000",
      "2" = "$30,000 - $59,999",
      "3" = "$60,000 - $99,999",
      "4" = "$100,000 or more"
    )
  ) + 
  labs(
    x = "Proportion",
    y = NULL,
    title = "Would you say your total annual household income is ... "
  )

Visualizing relationships

Exercise 8

Visualize and describe the relationship between income and home ownership of Durham residents.

Stretch goal: Customize the colors using named colors from http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf.

durham |>
  select(income, own_rent) |>
  drop_na() |>
  ggplot(aes(y = income, fill = own_rent )) + 
  geom_bar(position = "fill") +
  scale_y_discrete(
    labels = c(
      "1" = "Under $30,000",
      "2" = "$30,000 - $59,999",
      "3" = "$60,000 - $99,999",
      "4" = "$100,000 or more"
    )
  ) + 
  scale_fill_manual(
    values = c("1" = "cadetblue", "2" = "coral"), # can choose your own colors
    labels = c("1" = "Own", "2" = "Rent")
  ) +
  labs(
    x = "Proportion",
    y = "Would you say your total\n annual household incom is...",
    fill = "Do you own\n or rent\n your current\n residence?",
    title = "Income v. home ownership of Durham residents"
  )

Exercise 9

Calculate the proportions of home owners for each category of Durham residents. Describe the relationship between these two variables, this time with the actual values from the conditional distribution of home ownership based on income level.

Add your answer here.

durham |>
  select(income, own_rent) |>
  drop_na() |>
  count(income, own_rent) |>
  group_by(income) |>
  mutate(prop = n/sum(n))
# A tibble: 8 × 4
# Groups:   income [4]
  income own_rent     n   prop
  <fct>  <fct>    <int>  <dbl>
1 1      1           51 0.362 
2 1      2           90 0.638 
3 2      1          105 0.565 
4 2      2           81 0.435 
5 3      1          107 0.552 
6 3      2           87 0.448 
7 4      1          160 0.930 
8 4      2           12 0.0698

Exercise 10

Stretch goal: Recode the levels of these two variables to be more informatively labeled and calculate the proportions from the previous exercise again.

durham <- durham |>
  mutate(
    income = case_when(
      income == "1" ~ "Under $30,000",
      income == "2" ~ "$30,000 - $59,999",
      income == "3" ~ "$60,000 - $99,999",
      income == "4" ~ "$100,000 or more"
    ),
    own_rent = if_else(own_rent == 1, "Own", "Rent") 
  )



durham |>
  select(income, own_rent) |>
  drop_na() |>
  count(income, own_rent) |>
  group_by(income) |>
  mutate(prop = n/sum(n))
# A tibble: 8 × 4
# Groups:   income [4]
  income            own_rent     n   prop
  <chr>             <chr>    <int>  <dbl>
1 $100,000 or more  Own        160 0.930 
2 $100,000 or more  Rent        12 0.0698
3 $30,000 - $59,999 Own        105 0.565 
4 $30,000 - $59,999 Rent        81 0.435 
5 $60,000 - $99,999 Own        107 0.552 
6 $60,000 - $99,999 Rent        87 0.448 
7 Under $30,000     Own         51 0.362 
8 Under $30,000     Rent        90 0.638 

Recap

Conceptual

Some of the terms we introduced are:

  • Marginal distribution: Distribution of a single variable.

  • Conditional distribution: Distribution of a variable conditioned on the values (or levels, in the context of categorical data) of another.

R

In this application exercise we:

  • Defined factors – the data type that R uses for categorical variables, i.e., variables that can take on values from a finite set of levels.
  • Reviewed data imports, visualization, and wrangling functions encountered before:
    • Import: read_csv(): Read data from a CSV (comma separated values) file
    • Visualization:
      • ggplot(): Create a plot using the ggplot2 package
      • aes(): Map variables from the data to aesthetic elements of the plot, generally passed as an argument to ggplot() or to geom_*() functions (define only x or y aesthetic)
      • geom_bar(): Represent data with bars, after calculating heights of bars under the hood
      • labs(): Label x axis, y axis, legend for color of plot, title` of plot, etc.
    • Wrangling:
      • mutate(): Mutate the data frame by creating a new column or overwriting one of the existing columns
      • count(): Count the number of observations for each level of a categorical variable (factor) or each distinct value of any other type of variable
      • group_by(): Perform each subsequent action once per each group of the variable, where groups can be defined based on the levels of one or more variables
  • Introduced new data wrangling functions:
    • rename(): Rename columns in a data frame
    • as_factor(): Convert a variable to a factor
    • drop_na(): Drop rows that have NA in one ore more specified variables
    • if_else(): Write logic for what happens if a condition is true and what happens if it’s not
    • case_when(): Write a generalized if_else() logic for more than one codition
  • Introduced new data visualization functions:
    • geom_col(): Represent data with bars (columns), for heights that have already been calculated (must define x and y aesthetics)
    • scale_fill_viridis_d(): Customize the discrete fill scale, using a color-blind friendly, ordinal discrete color scale
    • scale_y_discrete(): Customize the discrete y scale
    • scale_fill_manual(): Customize the fill scale by manually adjusting values for colors

Quarto

We also introduced chunk options for managing figure sizes:

  • fig-width: Width of figure
  • fig-asp: Aspect ratio of figure (height / width)
  • fig-height: Height of figure – but I recommend using fig-width and fig-asp, instead of fig-width and fig-height

Acknowledgements

This dataset was cleaned and prepared for analysis by Duke StatSci PhD student Sam Rosen.