library(tidyverse)
Lab 3
The goal of this lab is to effectively visualize numerical and categorical data.
For all visualizations you create, be sure to include informative titles for the plot, axes, and legend!
Getting started
Go to Duke Container Manager and create a new folder titled lab-3 - Exploratory data analysis.
Download the files from Canvas and upload them into the new folder you created in step 1.
Complete the exercises in this document.
Packages
In this lab we will work with the tidyverse packages, which is a collection of packages for doing data analysis in a “tidy” way.
Part 1: Nobel laurates
The dataset for this lab can be found in nobel.csv
file in the data
folder of your repository. You can read it in using the following.
<- read_csv("nobel.csv") nobel
The descriptions of the variables are as follows:
id
: ID numberfirstname
: First name of laureatesurname
: Surnameyear
: Year prize woncategory
: Category of prizeaffiliation
: Affiliation of laureatecity
: City of laureate in prize yearcountry
: Country of laureate in prize yearborn_date
: Birth date of laureatedied_date
: Death date of laureategender
: Gender of laureateborn_city
: City where laureate was bornborn_country
: Country where laureate was bornborn_country_code
: Code of country where laureate was borndied_city
: City where laureate dieddied_country
: Country where laureate dieddied_country_code
: Code of country where laureate diedoverall_motivation
: Overall motivation for recognitionshare
: Number of other winners award is shared withmotivation
: Motivation for recognition
In a few cases the name of the city/country changed after laureate was given (e.g. in 1975 Bosnia and Herzegovina was called the Socialist Federative Republic of Yugoslavia). In these cases the variables below reflect a different name than their counterparts without the suffix _original
.
born_country_original
: Original country where laureate was bornborn_city_original
: Original city where laureate was borndied_country_original
: Original country where laureate dieddied_city_original
: Original city where laureate diedcity_original
: Original city where laureate lived at the time of winning the awardcountry_original
: Original country where laureate lived at the time of winning the award
Exercise 1
How many observations and how many variables are in the dataset? Use inline code to answer this question. What does each row represent?
Exercise 2
There are some observations in this dataset that we will exclude from our analysis to match the Buzzfeed results.
Create a new data frame called nobel_living
that filters for
- laureates for whom
country
is available:!is.na(country)
- laureates who are people as opposed to organizations, i.e., organizations are denoted with
"org"
as theirgender
:gender != "org"
- laureates who are still alive, i.e., their
died_date
isNA
:is.na(died_date)
Confirm that once you have filtered for these characteristics you are left with a data frame with 228 observations, once again using inline code.
Most living Nobel laureates were based in the US when they won their prizes
… says the Buzzfeed article. Let’s see if that’s true.
First, we’ll create a new variable to identify whether the laureate was in the US when they won their prize. We’ll use the mutate()
function for this. The following pipeline mutates the nobel_living
data frame by adding a new variable called country_us
. We use an if statement to create this variable. The first argument in the if_else()
function we’re using to write this if statement is the condition we’re testing for. If country
is equal to "USA"
, we set country_us
to "USA"
. If not, we set the country_us
to "Other"
.
<- nobel_living |>
nobel_living mutate(
country_us = if_else(country == "USA", "USA", "Other")
)
Next, we will limit our analysis to only the following categories: Physics, Medicine, Chemistry, and Economics.
<- nobel_living |>
nobel_living_science filter(category %in% c("Physics", "Medicine", "Chemistry", "Economics"))
For the following exercises, work with the nobel_living_science
data frame you created above. This means you’ll need to define this data frame in your Quarto document, even though the next exercise doesn’t explicitly ask you to do so.
Exercise 3
Create a faceted bar plot visualizing the relationship between the category of prize and whether the laureate was in the US when they won the nobel prize. Interpret your visualization, and say a few words about whether the Buzzfeed headline is supported by the data.
- Your visualization should be faceted by category.
- For each facet you should have two bars, one for winners in the US and one for Other.
- Flip the coordinates so the bars are horizontal, not vertical.
Exercise 4
Next, let’s investigate, of those US-based Nobel laureates, what proportion were born in other countries.
Create a new variable called born_country_us
in nobel_living_science
that has the value "USA"
if the laureate is born in the US, and "Other"
otherwise. How many of the winners are born in the US?
Exercise 5
Add a second variable to your visualization from Exercise 3 based on whether the laureate was born in the US or not.
Create two visualizations with this new variable added:
- Plot 1: Segmented frequency bar plot
- Plot 2: Segmented relative frequency bar plot (Hint: Add
position = "fill"
togeom_bar()
.)
Here are some instructions that apply to both of these visualizations:
- Your final visualization should contain a facet for each category.
- Within each facet, there should be two bars for whether the laureate won the award in the US or not.
- Each bar should have segments for whether the laureate was born in the US or not.
Which of these visualizations is a better fit for answering the following question: “Do the data appear to support Buzzfeed’s claim that of those US-based Nobel laureates, many were born in other countries?” First, state which plot you’re using to answer the question. Then, answer the question, explaining your reasoning in 1-2 sentences.
Exercise 6
In a single pipeline, filter the nobel_living_science
data frame for laureates who won their prize in the US, but were born outside of the US, and then create a frequency table (with the count()
function) for their birth country (born_country
) and arrange the resulting data frame in descending order of number of observations for each country. Which country is the most common?
Part 2: IMS Exercises
The exercises in this section do not require code. Make sure to answer the questions in full sentences.
Exercise 7
IMS - Chapter 4 exercises, #4: Raise taxes.
Exercise 8
IMS - Chapter 5 exercises, #4: Office productivity.
Exercise 9
IMS - Chapter 5 exercises, #16: Distributions and appropriate statistics.
Exercise 10
IMS - Chapter 5 exercises, #26: NYC marathon winners.
Wrap up
Submitting
Before you proceed, first, make sure that you have updated the document YAML with your name! Then, render your document one last time, for good measure.
To submit your assignment to Gradescope:
Go to your Files pane and check the box next to the PDF output of your document (
lab-3.pdf
).Then, in the Files pane, go to More > Export. This will download the PDF file to your computer. Save it somewhere you can easily locate, e.g., your Downloads folder or your Desktop.
Go to the course Canvas page and click on Gradescope and then click on the assignment. You’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the papers of your lab should be associated with at least one question (i.e., should be “checked”).
If you fail to mark the pages associated with an exercise, that exercise won’t be graded. This means, if you fail to mark the pages for all exercises, you will receive a 0 on the assignment. The TA can’t mark your pages for you, and for them to be able to grade, you must mark them.
Grading
Exercise | Points |
---|---|
Exercise 1 | 3 |
Exercise 2 | 4 |
Exercise 3 | 8 |
Exercise 4 | 6 |
Exercise 5 | 8 |
Exercise 6 | 8 |
Exercise 7 | 2 |
Exercise 8 | 2 |
Exercise 9 | 5 |
Exercise 10 | 4 |
Total | 50 |