class: center, middle, inverse, title-slide # Feature engineering ##
Introduction to Data Science ###
introds.org
###
Dr. Mine Çetinkaya-Rundel --- layout: true <div class="my-footer"> <span> <a href="https://introds.org" target="_blank">introds.org</a> </span> </div> --- class: middle # Feature engineering --- ## Feature engineering - We prefer simple models when possible, but **parsimony** does not mean sacrificing accuracy (or predictive performance) in the interest of simplicity -- - Variables that go into the model and how they are represented are just as critical to success of the model -- - **Feature engineering** allows us to get creative with our predictors in an effort to make them more useful for our model (to increase its predictive performance) --- ## Same training and testing sets as before ```r # Fix random numbers by setting the seed # Enables analysis to be reproducible when random numbers are used set.seed(1116) # Put 80% of the data into the training set email_split <- initial_split(email, prop = 0.80) # Create data frames for the two sets: train_data <- training(email_split) test_data <- testing(email_split) ``` --- ## A simple approach: `mutate()` ```r train_data %>% mutate( date = date(time), dow = wday(time), month = month(time) ) %>% select(time, date, dow, month) %>% sample_n(size = 5) # shuffle to show a variety ``` ``` ## # A tibble: 5 x 4 ## time date dow month ## <dttm> <date> <dbl> <dbl> ## 1 2012-03-18 16:47:58 2012-03-18 1 3 ## 2 2012-02-08 04:50:43 2012-02-08 4 2 ## 3 2012-03-16 15:06:21 2012-03-16 6 3 ## 4 2012-02-26 05:37:38 2012-02-26 1 2 ## 5 2012-01-21 15:08:13 2012-01-21 7 1 ``` --- ## Modeling workflow, revisited - Create a **recipe** for feature engineering steps to be applied to the training data -- - Fit the model to the training data after these steps have been applied -- - Using the model estimates from the training data, predict outcomes for the test data -- - Evaluate the performance of the model on the test data --- class: middle # Building recipes --- ## Initiate a recipe ```r email_rec <- recipe( spam ~ ., # formula data = train_data # data to use for cataloguing names and types of variables ) summary(email_rec) ``` .xsmall[ ``` ## # A tibble: 21 x 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 to_multiple nominal predictor original ## 2 from nominal predictor original ## 3 cc numeric predictor original ## 4 sent_email nominal predictor original ## 5 time date predictor original ## 6 image numeric predictor original ## 7 attach numeric predictor original ## 8 dollar numeric predictor original ## 9 winner nominal predictor original ## 10 inherit numeric predictor original ## 11 viagra numeric predictor original ## 12 password numeric predictor original ## 13 num_char numeric predictor original ## 14 line_breaks numeric predictor original ## 15 format nominal predictor original ## 16 re_subj nominal predictor original ## 17 exclaim_subj numeric predictor original ## 18 urgent_subj nominal predictor original ## 19 exclaim_mess numeric predictor original ## 20 number nominal predictor original ## 21 spam nominal outcome original ``` ] --- ## Remove certain variables ```r email_rec <- email_rec %>% step_rm(from, sent_email) ``` .small[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ``` ] --- ## Feature engineer date ```r email_rec <- email_rec %>% step_date(time, features = c("dow", "month")) %>% step_rm(time) ``` .small[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ``` ] --- ## Discretize numeric variables ```r email_rec <- email_rec %>% step_cut(cc, attach, dollar, breaks = c(0, 1)) %>% step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20)) ``` .small[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ## Cut numeric for cc, attach, dollar ## Cut numeric for inherit, password ``` ] --- ## Create dummy variables ```r email_rec <- email_rec %>% step_dummy(all_nominal(), -all_outcomes()) ``` .small[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ## Cut numeric for cc, attach, dollar ## Cut numeric for inherit, password ## Dummy variables from all_nominal(), -all_outcomes() ``` ] --- ## Remove zero variance variables Variables that contain only a single value ```r email_rec <- email_rec %>% step_zv(all_predictors()) ``` .small[ ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 20 ## ## Operations: ## ## Delete terms from, sent_email ## Date features from time ## Delete terms time ## Cut numeric for cc, attach, dollar ## Cut numeric for inherit, password ## Dummy variables from all_nominal(), -all_outcomes() ## Zero variance filter on all_predictors() ``` ] --- ## All in one place ```r email_rec <- recipe(spam ~ ., data = email) %>% step_rm(from, sent_email) %>% step_date(time, features = c("dow", "month")) %>% step_rm(time) %>% step_cut(cc, attach, dollar, breaks = c(0, 1)) %>% step_cut(inherit, password, breaks = c(0, 1, 5, 10, 20)) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_zv(all_predictors()) ``` --- class: middle # Building workflows --- ## Define model ```r email_mod <- logistic_reg() %>% set_engine("glm") email_mod ``` ``` ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ``` --- ## Define workflow **Workflows** bring together models and recipes so that they can be easily applied to both the training and test data. ```r email_wflow <- workflow() %>% add_model(email_mod) %>% add_recipe(email_rec) ``` .small[ ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: logistic_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────── ## 7 Recipe Steps ## ## ● step_rm() ## ● step_date() ## ● step_rm() ## ● step_cut() ## ● step_cut() ## ● step_dummy() ## ● step_zv() ## ## ── Model ─────────────────────────────────────────────────────────────────────────────────────────── ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ``` ] --- ## Fit model to training data ```r email_fit <- email_wflow %>% fit(data = train_data) ``` --- .small[ ```r tidy(email_fit) %>% print(n = 31) ``` ``` ## # A tibble: 31 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.950 0.272 -3.50 4.74e- 4 ## 2 image -1.02 0.662 -1.54 1.23e- 1 ## 3 num_char 0.0405 0.0294 1.38 1.68e- 1 ## 4 line_breaks -0.00588 0.00163 -3.61 3.08e- 4 ## 5 exclaim_subj 0.0843 0.268 0.315 7.53e- 1 ## 6 exclaim_mess 0.0112 0.00221 5.09 3.66e- 7 ## 7 to_multiple_X1 -3.17 0.453 -6.99 2.70e-12 ## 8 cc_X.1.68. -0.0389 0.456 -0.0852 9.32e- 1 ## 9 attach_X.1.21. 1.84 0.398 4.63 3.67e- 6 ## 10 dollar_X.1.64. -0.0217 0.230 -0.0941 9.25e- 1 ## 11 winner_yes 2.08 0.408 5.10 3.31e- 7 ## 12 inherit_X.1.5. -10.5 1478. -0.00710 9.94e- 1 ## 13 inherit_X.5.10. 1.94 1.27 1.52 1.28e- 1 ## 14 password_X.1.5. -2.50 1.03 -2.43 1.50e- 2 ## 15 password_X.5.10. -13.3 817. -0.0162 9.87e- 1 ## 16 password_X.10.20. -15.1 1148. -0.0132 9.89e- 1 ## 17 password_X.20.28. -14.9 1330. -0.0112 9.91e- 1 ## 18 format_X1 -0.797 0.161 -4.95 7.35e- 7 ## 19 re_subj_X1 -3.03 0.441 -6.88 6.05e-12 ## 20 urgent_subj_X1 3.80 1.03 3.68 2.37e- 4 ## 21 number_small -0.654 0.168 -3.89 9.84e- 5 ## 22 number_big 0.143 0.249 0.574 5.66e- 1 ## 23 time_dow_Mon -0.310 0.320 -0.971 3.32e- 1 ## 24 time_dow_Tue 0.128 0.287 0.447 6.55e- 1 ## 25 time_dow_Wed -0.141 0.283 -0.499 6.18e- 1 ## 26 time_dow_Thu -0.0452 0.287 -0.157 8.75e- 1 ## 27 time_dow_Fri 0.119 0.283 0.421 6.74e- 1 ## 28 time_dow_Sat 0.352 0.301 1.17 2.42e- 1 ## 29 time_month_Feb 0.869 0.181 4.81 1.51e- 6 ## 30 time_month_Mar 0.464 0.184 2.52 1.18e- 2 ## 31 time_month_Apr -13.9 990. -0.0141 9.89e- 1 ``` ] --- ## Make predictions for test data ```r email_pred <- predict(email_fit, test_data, type = "prob") %>% bind_cols(test_data) email_pred ``` ``` ## # A tibble: 784 x 23 ## .pred_0 .pred_1 spam to_multiple from cc sent_email ## <dbl> <dbl> <fct> <fct> <fct> <int> <fct> ## 1 0.960 0.0404 0 0 1 0 0 ## 2 0.937 0.0634 0 0 1 0 0 ## 3 0.923 0.0766 0 0 1 0 1 ## 4 0.999 0.00144 0 0 1 2 0 ## 5 0.904 0.0955 0 0 1 0 0 ## 6 0.908 0.0919 0 0 1 0 0 ## # … with 778 more rows, and 16 more variables: time <dttm>, ## # image <dbl>, attach <dbl>, dollar <dbl>, winner <fct>, ## # inherit <dbl>, viagra <dbl>, password <dbl>, num_char <dbl>, ## # line_breaks <int>, format <fct>, re_subj <fct>, ## # exclaim_subj <dbl>, urgent_subj <fct>, exclaim_mess <dbl>, ## # number <fct> ``` --- ## Evaluate the performance .pull-left[ ```r email_pred %>% roc_curve( truth = spam, .pred_1, event_level = "second" ) %>% autoplot() ``` ] .pull-right[ <img src="w9-d04-feature-engineering_files/figure-html/unnamed-chunk-22-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Evaluate the performance .pull-left[ ```r email_pred %>% roc_auc( truth = spam, .pred_1, event_level = "second" ) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 roc_auc binary 0.829 ``` ] .pull-right[ <img src="w9-d04-feature-engineering_files/figure-html/unnamed-chunk-24-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Making decisions --- ## Cutoff probability: 0.5 .panelset[ .panel[.panel-name[Output] Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.5**. | | Email is not spam| Email is spam| |:-----------------------|-----------------:|-------------:| |Email labelled not spam | 703| 67| |Email labelled spam | 4| 10| ] .panel[.panel-name[Code] ```r cutoff_prob <- 0.5 email_pred %>% mutate( spam = if_else(spam == 1, "Email is spam", "Email is not spam"), spam_pred = if_else(.pred_1 > cutoff_prob, "Email labelled spam", "Email labelled not spam") ) %>% count(spam_pred, spam) %>% pivot_wider(names_from = spam, values_from = n) %>% kable(col.names = c("", "Email is not spam", "Email is spam")) ``` ] ] --- ## Cutoff probability: 0.25 .panelset[ .panel[.panel-name[Output] Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.25**. | | Email is not spam| Email is spam| |:-----------------------|-----------------:|-------------:| |Email labelled not spam | 659| 40| |Email labelled spam | 48| 37| ] .panel[.panel-name[Code] ```r cutoff_prob <- 0.25 email_pred %>% mutate( spam = if_else(spam == 1, "Email is spam", "Email is not spam"), spam_pred = if_else(.pred_1 > cutoff_prob, "Email labelled spam", "Email labelled not spam") ) %>% count(spam_pred, spam) %>% pivot_wider(names_from = spam, values_from = n) %>% kable(col.names = c("", "Email is not spam", "Email is spam")) ``` ] ] --- ## Cutoff probability: 0.75 .panelset[ .panel[.panel-name[Output] Suppose we decide to label an email as spam if the model predicts the probability of spam to be **more than 0.75**. | | Email is not spam| Email is spam| |:-----------------------|-----------------:|-------------:| |Email labelled not spam | 706| 72| |Email labelled spam | 1| 5| ] .panel[.panel-name[Code] ```r cutoff_prob <- 0.75 email_pred %>% mutate( spam = if_else(spam == 1, "Email is spam", "Email is not spam"), spam_pred = if_else(.pred_1 > cutoff_prob, "Email labelled spam", "Email labelled not spam") ) %>% count(spam_pred, spam) %>% pivot_wider(names_from = spam, values_from = n) %>% kable(col.names = c("", "Email is not spam", "Email is spam")) ``` ] ]