HW 5 - Model and review

Due: 10 December, 16:00 UK time

Image by mauro  mora on Unsplash Image by mauro mora on Unsplash

In this two part assignment you get to review modeling and inference and also review two projects from your peers!

Part 1 - General Social Survey

The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviours, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years.

The GSS contains a standard core of demographic, behavioural, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

In this assignment we analyze data from the 2016 GSS, using it to estimate values of population parameters of interest about US adults.1 Smith, Tom W, Peter Marsden, Michael Hout, and Jibum Kim. General Social Surveys, 1972-2016 [machine-readable data file] /Principal Investigator, Tom W. Smith; Co-Principal Investigator, Peter V. Marsden; Co-Principal Investigator, Michael Hout; Sponsored by National Science Foundation. -NORC ed.- Chicago: NORC at the University of Chicago [producer and distributor]. Data accessed from the GSS Data Explorer website at gssdataexplorer.norc.org. We will work with the following packages. They should already be installed in your project, and you can load them with the following:

library(tidyverse)
library(tidymodels)
library(dsbox)

The dataset we will use is called gss16 and it’s in the dsbox package.

Scientific research

In this section we’re going to build a model to predict whether someone agrees or doesn’t agree with the following statement:

Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.

The responses to the question on the GSS about this statement are in the advfront variable.

It’s important that you don’t recode the NAs, just the remaining levels.

  1. Re-level the advfront variable such that it has two levels: Strongly agree and “Agree" combined into a new level called agree and the remaining levels (except NAs) combined into”Not agree". Then, re-order the levels in the following order: "Agree" and "Not agree". Finally, count() how many times each new level appears in the advfront variable.

You can do this in various ways. One option is to use the str_detect() function to detect the existence of words like liberal or conservative. Note that these sometimes show up with lowercase first letters and sometimes with upper case first letters. To detect either in the str_detect() function, you can use “[Ll]iberal” and “[Cc]onservative”. But feel free to solve the problem however you like, this is just one option!

  1. Combine the levels of the polviews variable such that levels that have the word “liberal” in them are lumped into a level called "Liberal" and those that have the word conservative in them are lumped into a level called "Conservative". Then, re-order the levels in the following order: "Conservative" , "Moderate", and "Liberal". Finally, count() how many times each new level appears in the polviews variable.
  1. Create a new data frame called gss16_advfront that includes the variables advfront, educ, polviews, and wrkstat. Then, use the drop_na() function to remove rows that contain NAs from this new data frame. Sample code is provided below.
gss16_advfront <- gss16 %>%
  select(___, ___, ___, ___) %>%
  drop_na()
  1. Split the data into training (75%) and testing (25%) data sets. Make sure to set a seed before you do the initial_split(). Call the training data gss16_train and the testing data gss16_test. Sample code is provided below. Use these specific names to make it easier to follow the rest of the instructions.
set.seed(___)
gss16_split <- initial_split(gss16_advfront)
gss16_train <- training(gss16_split)
gss16_test  <- testing(gss16_split)
  1. Create a recipe with the following steps for predicting advfront from polviews, wrkstat, and educ. Name this recipe gss16_rec_1. (We’ll create one more recipe later, that’s why we’re naming this recipe _1.) Sample code is provided below.

    • step_other() to pool values that occur less than 10% of the time (threshold = 0.10) in the wrkstat variable into "Other".

    • step_dummy() to create dummy variables for all_nominal() variables that are predictors, i.e. all_predictors()

gss16_rec_1 <- recipe(___ ~ ___, data = ___) %>%
  step_other(wrkstat, threshold = ___, other = "Other") %>%
  step_dummy(all_nominal(), -all_outcomes())
  1. Specify a logistic regression model using "glm" as the engine. Name this specification gss16_spec. Sample code is provided below.
gss16_spec <- ___() %>%
  set_engine("___")
  1. Build a workflow that uses the recipe you defined (gss16_rec) and the model you specified (gss16_spec). Name this workflow gss16_wflow_1. Sample code is provided below.
gss16_wflow_1 <- workflow() %>%
  add_model(___) %>%
  add_recipe(___)
  1. Perform 5-fold cross validation. specifically,

    • split the training data into 5 folds (don’t forget to set a seed first!),

    • apply the workflow you defined earlier to the folds with fit_resamples(), and

    • collect_metrics() and comment on the consistency of metrics across folds (you can get the area under the ROC curve and the accuracy for each fold by setting summarize = FALSE in collect_metrics())

    • report the average area under the ROC curve and the accuracy for all cross validation folds collect_metrics()

set.seed(___)
gss16_folds <- vfold_cv(___, v = ___)

gss16_fit_rs_1 <- gss16_wflow_1 %>%
  fit_resamples(___)

collect_metrics(___, summarize = FALSE)
collect_metrics(___)
  1. Now, try a different, simpler model: predict advfront from only polviews and educ. Specifically,

    • update the recipe to reflect this simpler model specification (and name it gss16_rec_2),

    • redefine the workflow with the new recipe (and name this new workflow gss16_wflow_2),

    • perform cross validation, and

    • report the average area under the ROC curve and the accuracy for all cross validation folds collect_metrics().

  1. Comment on which model performs better (one including wrkstat, model 1, or the one excluding wrkstat, model 2) on the training data based on area under the ROC curve.
  2. Fit both models to the testing data, plot the ROC curves for the predictions for both models, and calculate the areas under the ROC curve. Does your answer to the previous exercise hold for the testing data as well? Explain your reasoning. Note: If you haven’t yet done so, you’ll need to first train your workflows on the training data with the following, and then use these fit objects to calculate predictions for the test data.
gss16_fit_1 <- gss16_wflow_1 %>%
  fit(gss16_train)

gss16_fit_2 <- gss16_wflow_2 %>%
  fit(gss16_train)

Harassment at work

In 2016, the GSS added a new question on harassment at work. The question is phrased as the following.

Over the past five years, have you been harassed by your superiors or co-workers at your job, for example, have you experienced any bullying, physical or psychological abuse?

Answers to this question are stored in the harass5 variable in our dataset.

  1. Create a subset of the data that only contains Yes and No answers for the harassment question. How many responses chose each of these answers?
  2. Describe how bootstrapping can be used to estimate the proportion of Americans who have been harassed by their superiors or co-workers at their job.
  3. Calculate a 95% bootstrap confidence interval for the proportion of Americans who have been harassed by their superiors or co-workers at their job. Interpret this interval in context of the data.
  4. Would you expect a 90% confidence interval to be wider or narrower than the interval you calculated above? Explain your reasoning.

Part 2 - Project peer review

In this part you’re asked to review three project presentations we didn’t get to watch during the presentations on Friday. To complete this part:

  1. Write a paragraph that very briefly describes each project (1-2 sentences each), highlights what you liked in each project and what you think can be improved, and finally comment on how you could improve your own project (analysis, presentation, or any other aspect) based on strengths of these three projects, if you had extra time to do so.