Image by mauro mora on Unsplash
In this two part assignment you get to review modeling and inference and also review two projects from your peers!
The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviours, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years.
The GSS contains a standard core of demographic, behavioural, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.
In this assignment we analyze data from the 2016 GSS, using it to estimate values of population parameters of interest about US adults.1 Smith, Tom W, Peter Marsden, Michael Hout, and Jibum Kim. General Social Surveys, 1972-2016 [machine-readable data file] /Principal Investigator, Tom W. Smith; Co-Principal Investigator, Peter V. Marsden; Co-Principal Investigator, Michael Hout; Sponsored by National Science Foundation. -NORC ed.- Chicago: NORC at the University of Chicago [producer and distributor]. Data accessed from the GSS Data Explorer website at gssdataexplorer.norc.org. We will work with the following packages. They should already be installed in your project, and you can load them with the following:
The dataset we will use is called gss16
and it’s in the dsbox
package.
In this section we’re going to build a model to predict whether someone agrees or doesn’t agree with the following statement:
Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.
The responses to the question on the GSS about this statement are in the advfront
variable.
It’s important that you don’t recode the NAs, just the remaining levels.
advfront
variable such that it has two levels: Strongly agree
and “Agree"
combined into a new level called agree
and the remaining levels (except NA
s) combined into”Not agree"
. Then, re-order the levels in the following order: "Agree"
and "Not agree"
. Finally, count()
how many times each new level appears in the advfront
variable.You can do this in various ways. One option is to use the str_detect()
function to detect the existence of words like liberal or conservative. Note that these sometimes show up with lowercase first letters and sometimes with upper case first letters. To detect either in the str_detect()
function, you can use “[Ll]iberal” and “[Cc]onservative”. But feel free to solve the problem however you like, this is just one option!
polviews
variable such that levels that have the word “liberal” in them are lumped into a level called "Liberal"
and those that have the word conservative in them are lumped into a level called "Conservative"
. Then, re-order the levels in the following order: "Conservative"
, "Moderate"
, and "Liberal"
. Finally, count()
how many times each new level appears in the polviews
variable.gss16_advfront
that includes the variables advfront
, educ
, polviews
, and wrkstat
. Then, use the drop_na()
function to remove rows that contain NA
s from this new data frame. Sample code is provided below.initial_split()
. Call the training data gss16_train
and the testing data gss16_test
. Sample code is provided below. Use these specific names to make it easier to follow the rest of the instructions.set.seed(___)
gss16_split <- initial_split(gss16_advfront)
gss16_train <- training(gss16_split)
gss16_test <- testing(gss16_split)
Create a recipe with the following steps for predicting advfront
from polviews
, wrkstat
, and educ
. Name this recipe gss16_rec_1
. (We’ll create one more recipe later, that’s why we’re naming this recipe _1
.) Sample code is provided below.
step_other()
to pool values that occur less than 10% of the time (threshold = 0.10
) in the wrkstat
variable into "Other"
.
step_dummy()
to create dummy variables for all_nominal()
variables that are predictors, i.e. all_predictors()
gss16_rec_1 <- recipe(___ ~ ___, data = ___) %>%
step_other(wrkstat, threshold = ___, other = "Other") %>%
step_dummy(all_nominal(), -all_outcomes())
"glm"
as the engine. Name this specification gss16_spec
. Sample code is provided below.gss16_rec
) and the model you specified (gss16_spec
). Name this workflow gss16_wflow_1
. Sample code is provided below.Perform 5-fold cross validation. specifically,
split the training data into 5 folds (don’t forget to set a seed first!),
apply the workflow you defined earlier to the folds with fit_resamples()
, and
collect_metrics()
and comment on the consistency of metrics across folds (you can get the area under the ROC curve and the accuracy for each fold by setting summarize = FALSE
in collect_metrics()
)
report the average area under the ROC curve and the accuracy for all cross validation folds collect_metrics()
set.seed(___)
gss16_folds <- vfold_cv(___, v = ___)
gss16_fit_rs_1 <- gss16_wflow_1 %>%
fit_resamples(___)
collect_metrics(___, summarize = FALSE)
collect_metrics(___)
Now, try a different, simpler model: predict advfront
from only polviews
and educ
. Specifically,
update the recipe to reflect this simpler model specification (and name it gss16_rec_2
),
redefine the workflow with the new recipe (and name this new workflow gss16_wflow_2
),
perform cross validation, and
report the average area under the ROC curve and the accuracy for all cross validation folds collect_metrics()
.
wrkstat
, model 1, or the one excluding wrkstat
, model 2) on the training data based on area under the ROC curve.In 2016, the GSS added a new question on harassment at work. The question is phrased as the following.
Over the past five years, have you been harassed by your superiors or co-workers at your job, for example, have you experienced any bullying, physical or psychological abuse?
Answers to this question are stored in the harass5
variable in our dataset.
Yes
and No
answers for the harassment question. How many responses chose each of these answers?In this part you’re asked to review three project presentations we didn’t get to watch during the presentations on Friday. To complete this part: