library(tidyverse)
Explore categorical data
Exploring categorical data
2020 Durham City and County Resident Survey
Sample survey questions - services
Sample survey questions - safety
Sample survey questions - demographics (Today’s focus)
The main question we’ll explore today is “What are the demographics and priorities of City of Durham residents?”
Goals
Getting familiar with survey data
Visualizing and summarizing categorical data
Make connections between concepts of variable types in a study and variable types in R
Exploring relationships between categorical variables
Improving visualizations for visual appeal and better communication
Packages
Data
The data for this case study come from the 2020 Durham City and County Resident Survey.
First, let’s load the data:
<- read_csv("durham-2020.csv") durham
Rows: 803 Columns: 49
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): please_define_other_32_5, primary_language, please_define_other_34...
dbl (42): id, overall_quality_of_services_3_01, overall_quality_of_services_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Visualizing and summarizing categorical data
Exercise 1
How many rows and columns are in this dataset? Answer in a full sentence using inline code. What does each row represent and what does each column represent?
Add your answer here.
Exercise 2
The variables we’ll use in this analysis are as follows. Rename the variables to the updated names shown below.
Original name | Updated name |
---|---|
primary_language |
primary_language |
do_you_own_or_rent_your_current_resi_31 |
own_rent |
would_you_say_your_total_annual_hous_35 |
income |
# add code here
Exercise 3
What language do Durham residents speak: primary_language
?
What is the primary language used in your household?
Add your answer here.
# add code here
Exercise 4
Make similar bar plots of own_rent
and income
. What distinct values do these variables take?
# add code here
Exercise 5
The variables own_rent
and income
are both categorical, but they’re stored as numbers. In R, categorical data are called factors. Recode these variables as factors with the as_factor()
function.
# add code here
Exercise 6
Recreate the visualization from the previous exercise
income` barplot, improving it for both visual appeal and better communication of findings.
Add your answer here.
# add code here
Exercise 7
Recreate the visualization from the previous exercise, but first calculate relative frequencies (proportions) of income (the marginal distribution) and plot the proportions instead of counts.
# add code here
Visualizing relationships
Exercise 8
Visualize and describe the relationship between income and home ownership of Durham residents.
Stretch goal: Customize the colors using named colors from http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf.
Add your answer here.
# add code here
Exercise 9
Calculate the proportions of home owners for each category of Durham residents. Describe the relationship between these two variables, this time with the actual values from the conditional distribution of home ownership based on income level.
Add your answer here.
# add code here
Exercise 10
Stretch goal: Recode the levels of these two variables to be more informatively labeled and calculate the proportions from the previous exercise again.
# add code here
Recap
Conceptual
Some of the terms we introduced are:
Marginal distribution: Distribution of a single variable.
Conditional distribution: Distribution of a variable conditioned on the values (or levels, in the context of categorical data) of another.
R
In this application exercise we:
- Defined factors – the data type that R uses for categorical variables, i.e., variables that can take on values from a finite set of levels.
- Reviewed data imports, visualization, and wrangling functions encountered before:
- Import:
read_csv()
: Read data from a CSV (comma separated values) file - Visualization:
ggplot()
: Create a plot using the ggplot2 packageaes()
: Map variables from the data to aesthetic elements of the plot, generally passed as an argument toggplot()
or togeom_*()
functions (define onlyx
ory
aesthetic)geom_bar()
: Represent data with bars, after calculating heights of bars under the hoodlabs()
: Labelx
axis,y
axis, legend forcolor
of plot, title` of plot, etc.
- Wrangling:
mutate()
: Mutate the data frame by creating a new column or overwriting one of the existing columnscount()
: Count the number of observations for each level of a categorical variable (factor) or each distinct value of any other type of variablegroup_by()
: Perform each subsequent action once per each group of the variable, where groups can be defined based on the levels of one or more variables
- Import:
- Introduced new data wrangling functions:
rename()
: Rename columns in a data frameas_factor()
: Convert a variable to a factordrop_na()
: Drop rows that haveNA
in one ore more specified variablesif_else()
: Write logic for what happens if a condition is true and what happens if it’s notcase_when()
: Write a generalizedif_else()
logic for more than one codition
- Introduced new data visualization functions:
geom_col()
: Represent data with bars (columns), for heights that have already been calculated (must definex
andy
aesthetics)scale_fill_viridis_d()
: Customize the discretefill
scale, using a color-blind friendly, ordinal discrete color scalescale_y_discrete()
: Customize the discretey
scalescale_fill_manual()
: Customize thefill
scale by manually adjusting values for colors
Quarto
We also introduced chunk options for managing figure sizes:
fig-width
: Width of figurefig-asp
: Aspect ratio of figure (height / width)fig-height
: Height of figure – but I recommend usingfig-width
andfig-asp
, instead offig-width
andfig-height
Acknowledgements
This dataset was cleaned and prepared for analysis by Duke StatSci PhD student Sam Rosen.