Project

The deliverable for this project (what you will turn in) is a written report and a presentation. See below for more details.

Timeline

Proposal due Tue, May 28 at 11:59 pm

Intro + EDA due Wed, Jun 12 at 11:59 pm

Peer review due Sun, Jun 16 at 11:59 pm

Presentation due Mon, Jun 24 in class

Final report due Wed, Jun 26 at 11:59 pm

Description

TL;DR: Ask a question you’re curious about and answer it with a dataset of your choice. This is your project in a nutshell.

May be too long, but please do read

The final project for this class will consist of analysis on a dataset of your own choosing. The dataset may already exist, or you may collect your own data using a survey or by conducting an experiment. You can choose the data based on your interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.

The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. You do not have to apply every statistical procedure we learned. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, and appropriateness of the statistical analysis should be discussed here.

The project is very open ended. You should

  • create some kind of compelling visualization(s) of this data, and
  • perform statistical inference and/or fit a model or descriptive or predictive purposes

There is no limit on what tools or packages you may use but sticking to packages we learned in class is required. You do not need to visualize all of the data at once. A single high-quality visualization will receive a much higher grade than a large number of poor-quality visualizations. Also pay attention to your presentation. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R, and all components of the project must be reproducible (with the exception of the presentation).

You will work on the project alone.

The five primary deliverables for the final project are

  1. A project proposal due Tuesday, May 28th at 11:59 pm.
  2. Intro and EDA due Wednesday, June 12th at 11:59 pm + email to classmate with pdf of intro for peer review.
  3. Formal peer review of another classmate’s project in your lab section due Sunday, June 16th at 11:59 pm.
  4. A presentation in class on Monday, June 24th (slides submitted on Gradescope before class).
  5. A reproducible project writeup of your analysis submitted on Gradescope by Wednesday, June 26th at 11:59 pm.

Proposal (10 pts)

The main purpose of the project proposal is to help you think about the project thoroughly before you go too far down a path and form an analysis plan that is feasible and that will allow you to be successful for this project.

You will: - Identify two (or three) data sets and a research questions for the final project to choose from. If you’re unsure where to find data, you can use the list of potential data sources in the Resources for datasets section as a starting point. It may also help to think of topics you’re interested in investigating and find data sets on those topics. - Submit the summary of each dataset and research question in order of which one you’d want to work on the most. I will then look through these and help you choose which data I think is better for the project.

Criteria for datasets

The data sets should meet the following criteria:

  • At least 100 observations

  • At least 5 columns

  • At least 4 of the columns must be useful and unique explanatory variables.

    • Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
    • If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.
  • You may not use data that has previously been used in course, or any derivation of data that has been used in course materials (this or previous semesters of STA101).

Tip

Please ask a member of the teaching team if you’re unsure whether your data set meets the criteria.

If you set your hearts on a dataset that has fewer observations or variables than what’s suggested here, that might still be ok; use these numbers as guidance for a successful proposal, not as minimum requirements.

Resources for datasets

You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.

Proposal grading

Each component will be graded as follows:

  • Meets expectations (full credit): Two (or more) data sets and research questions are identified. There is a plan for completing the project as envisioned.

  • Close to expectations (half credit): Only one data set and/or research question identified. The plan for completing the project as envisioned is not well designed.

  • Does not meet expectations (no credit): No dataset or research question identified. There is no plan for completing the project as envisioned.

Even if you earn full credit, it may not mean that your proposal is perfect.

Important

You will only be working with one of the data sets for your project. I ask you to find two or more to have a back-up plan in case I don’t think the first data set will result in a successful project.

Intro + EDA (10 pts)

There are three reasons behind this project check-in:

  • To ensure you have incorporated suggestions given to you during project proposal stage.

  • To get a head-start on the project and receive advice on how to improve before final submission.

  • To prepare an introduction file for a peer review.

You will begin writing the first part of you report. Your report must be written using Quarto. The template will be provided.

Introduction

The introduction provides motivation and context for your research.

To begin, introduce the data set in a few short sentences. Next, create a code book (aka a “data dictionary”) of the variables in the data set. Only include in your report a code book of the variables that you use.

Complete the introduction by providing a concise, clear statement of your research question and hypotheses. Be sure to motivate why the research question is interesting/useful.

Example research question and hypotheses (if we were predicting penguin weights instead of baby weights):

Can we predict body mass with bill depth? We hypothesize that penguins with deeper bills will also have more mass.

EDA

Include any preliminary summary statistics or figures you use to explore the data. Present plots and tables that give a clear idea of the main variables of interest in your data set. If you have many variables, focus on the ones that are most relevant to your research question. Tables and figures should be well formatted with clear labels and descriptions (axes should be labeled with units, and variable names should not be used (e.g., “Height (in)” not “ht_in”). You should organize this section as follows:

  • Primary relationship of interest: Present descriptive statistics or exploratory plots in whichever format you think is best (tables, figures) for your primary relationship of interest (outcome variable and primary independent variable). Describe your findings (no more than 3 plots/tables).

  • Briefly describe other variables in the data. If there are many, do not list them all. Rather, describe the types of variables that are present (e.g., “demographic information”). You do not need to provide plots for these variables.

  • Describe which variable(s) in your data set have missing data, if any.

Tip

If you are not sure if your plots/tables are satisfactory, or if you think you need more than 3 plots/tables, come talk to me.

Intro + EDA grading

You will be graded on whether you incorporated the feedback from project proposal submission as well as whether the requirements listed above have been completed. Even if you earn full credit, it may not mean that your submission is perfect. I expect you to incorporate the feedback in the final report submission.

Peer review (10 pts)

Critically reviewing others’ work is a crucial part of the scientific process, and STA 101 is no exception. You will be assigned one classmate to review. This feedback is intended to help you create a high quality final project, as well as give you experience reading and constructively critiquing the work of others.

At the beginning of the peer feedback process, you will be asked to email your project (PDF of your introduction + EDA) to one classmate (cc me (Kat)) by Wednesday, June 12th at 11:59 pm. The code should be visible in the PDF you share with your reviewer. Then, you will read and provide feedback in the form of an email (with me (Kat) cc’ed) to the peer whose work you’re reviewing by Sunday, June 16th at 11:59 pm, answering the following questions:

  • Peer review by: [NAME OF PEER DOING THE REVIEW]
  • Describe the goal of the project.
  • Describe the data used or collected, if any. If the introduction does not include the use of a specific data set, comment on whether the project would be strengthened by the inclusion of a data set.
  • Provide constructive feedback on how your might be able to improve their project. Make sure your feedback includes at least one comment on the statistical reasoning aspect of the project, but do feel free to comment on aspects beyond the reasoning as well.
  • What aspect of this project are you most interested in and would like to see highlighted in the presentation.
  • Provide constructive feedback on any issues with report or code organization.
  • What have you learned from this project that you are considering implementing in your own project?
  • (Optional) Any further comments or feedback?

Peer reviews will be graded on the extent to which it comprehensively and constructively addresses the components of the reviewee’s report.

Important

Only introduction and EDA sections will likely be completed. Your review should address these sections only! Thus, comments like “lack of analysis” or such will not be constructive, as I do not expect the rest to be completed yet.

Report (50 pts)

Your report must be written using Quarto. Your written report should not exceed six pages inclusive of all tables and figures. If you have additional tables you would like to provide (ex. data dictionary table), you can include it in the appendix. However, your report should be clear and easy to understand even without looking at the appendix. If you choose to include a data dictionary table in the appendix, make sure you describe the variables you are using in the main text too.

Before you finalize your report, make sure the printing of code chunks is turned off by setting echo to false in your document YAML.

Components

You should include, at a minimum, the following sections in your report.

Introduction (10 pts)

The introduction provides motivation and context for your research. Write 1-2 paragraphs of background information to motivate your research question. Your writing should answer the question: “Why is this work important? What might the broader implications be?” You must cite at least two additional references to adequately support your motivation.

Then introduce the data set in a few short sentences. What is the source of the data, how were the data collected, and what general information is contained in the data set? How many observations and variables do you have? Next, create a code book (aka a “data dictionary”) of the variables in the data set. Specifically, only include in your report a code book of the variables that you use.

Complete the introduction by providing a concise, clear statement of your research question and hypotheses. Be sure to motivate why the research question is interesting/useful.

Example research question and hypotheses (if we were predicting penguin weights instead of baby weights):

Can we predict body mass with bill depth? We hypothesize that penguins with deeper bills will also have more mass.

Methodology (15 pts)

Here you should introduce any statistical methods you use and describe why you choose the methods you do to answer your question. This is where you might want to include any preliminary summary statistics or figures you use to explore the data. Your EDA should be supporting your choice of statistical methods (ex. if you don’t observe linear trend, you should not be fitting a linear model).

Results (15 pts)

Place figure(s) here to illustrate the main results from your analysis. 1 beautiful figure is worth more than several poorly formatted figures. You must have at least 1 figure.

Provide only the main results from your analysis. The goal is not to do an exhaustive data analysis (calculate every possible statistic and create every possible model for all variables). Rather, you should demonstrate that you are proficient at asking meaningful questions and answering them using data, that you are skilled in writing about and interpreting results, and that you can accomplish these tasks using R. More is not better.

Discussion (5 pts)

This section is a conclusion and discussion. You should

  • Summarize your main finding in a sentence or two.

  • Discuss your finding and why it is useful (put in the context of your motivation from the introduction).

  • Critique your own analyses and include a brief paragraph on what you would do differently if you were able to start the project over.

Formatting (5 pts)

Your project should be professionally formatted. For example, this means labeling graphs and figures, turning off code chunks, using proper citations and cross-references, and following typical style guidelines.

Submission

  • Upload the PDF submission to Gradescope.

  • Associate all pages with “Full report”.

Presentation (20 pts)

Your presentation must be no longer than 15 minutes (aim for around 10-12 minutes).

Slides

For your presentation, you must create presentation slides that summarize and showcase your project. Introduce your research question and data set, showcase visualizations, and provide some conclusions. These slides should serve as a brief visual accompaniment to your write-up and will be graded for content and quality.

Here is a suggested outline as you think through the slides; you do not have to use this exact format for the slide deck.

  • Title Slide

  • Slide 1: Introduce the topic and motivation

  • Slide 2: Introduce the data

  • Slide 3 - 4: Highlights from exploratory data analysis

  • Slide 4 - 5: Highlights from inference and/or modeling

  • Slide 6: Conclusions + critique/shortcomings

You can create your slides with any software you like (Keynote, PowerPoint, Google Slides, etc.).

Overall grading

The grade breakdown is as follows:

Total 100 pts
Project proposal 10 pts
Intro + EDA update 10 pts
Peer review 10 pts
Final report 50 pts
Presentation 20 pts

Grading summary

Grading of the project will take into account the following:

  • Content - What is the quality of research and/or policy question and relevancy of data to those questions?
  • Correctness - Are statistical procedures carried out and explained correctly?
  • Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

A general breakdown of scoring is as follows:

  • 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
  • 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
  • 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
  • 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
  • Below 60%: Student is not making a sufficient effort.

Late work policy

Be sure to turn in your work early to avoid any technological mishaps.

Warning

There is no late work accepted on the final report and presentation.