class: center, middle, inverse, title-slide # Bootstrapping ##
Introduction to Data Science ###
introds.org
###
Dr. Mine Çetinkaya-Rundel --- layout: true <div class="my-footer"> <span> <a href="https://introds.org" target="_blank">introds.org</a> </span> </div> --- class: middle # Rent in Edinburgh --- ## Rent in Edinburgh .question[ Take a guess! How much does a typical 3 BR flat in Edinburgh rents for? ] --- ## Sample Fifteen 3 BR flats in Edinburgh were randomly selected on rightmove.co.uk. ```r library(tidyverse) edi_3br <- read_csv2("data/edi-3br.csv") # ; separated ``` .small[ ``` ## # A tibble: 15 x 4 ## flat_id rent title address ## <chr> <dbl> <chr> <chr> ## 1 flat_01 825 3 bedroom apartmen… Burnhead Grove, Edinburgh, M… ## 2 flat_02 2400 3 bedroom flat to … Simpson Loan, Quartermile, E… ## 3 flat_03 1900 3 bedroom flat to … FETTES ROW, NEW TOWN, EH3 6SE ## 4 flat_04 1500 3 bedroom apartmen… Eyre Crescent, Edinburgh, Mi… ## 5 flat_05 3250 3 bedroom flat to … Walker Street, Edinburgh ## 6 flat_06 2145 3 bedroom flat to … George Street, City Centre, … ## # … with 9 more rows ``` ] --- ## Observed sample <img src="w10-d06-bootstrap_files/figure-html/unnamed-chunk-4-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Observed sample Sample mean ≈ £1895 😱 <br> <img src="img/rent-bootsamp.png" width="90%" style="display: block; margin: auto;" /> --- ## Bootstrap population Generated assuming there are more flats like the ones in the observed sample... Population mean = ❓ <img src="img/rent-bootpop.png" width="65%" style="display: block; margin: auto;" /> --- ## Bootstrapping scheme 1. Take a bootstrap sample - a random sample taken **with replacement** from the original sample, of the same size as the original sample 2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples 3. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics 4. Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution --- class: middle # Bootstrapping with tidymodels --- ## Generate bootstrap means ```r edi_3br %>% # specify the variable of interest specify(response = rent) ``` --- ## Generate bootstrap means ```r edi_3br %>% # specify the variable of interest specify(response = rent) # generate 15000 bootstrap samples generate(reps = 15000, type = "bootstrap") ``` --- ## Generate bootstrap means ```r edi_3br %>% # specify the variable of interest specify(response = rent) # generate 15000 bootstrap samples generate(reps = 15000, type = "bootstrap") # calculate the mean of each bootstrap sample calculate(stat = "mean") ``` --- ## Generate bootstrap means ```r # save resulting bootstrap distribution boot_df <- edi_3br %>% # specify the variable of interest specify(response = rent) %>% # generate 15000 bootstrap samples generate(reps = 15000, type = "bootstrap") %>% # calculate the mean of each bootstrap sample calculate(stat = "mean") ``` --- ## The bootstrap sample .question[ How many observations are there in `boot_df`? What does each observation represent? ] ```r boot_df ``` ``` ## # A tibble: 15,000 x 2 ## replicate stat ## * <int> <dbl> ## 1 1 1793. ## 2 2 1938. ## 3 3 2175 ## 4 4 2159. ## 5 5 2084 ## 6 6 1761 ## # … with 14,994 more rows ``` --- ## Visualize the bootstrap distribution ```r ggplot(data = boot_df, mapping = aes(x = stat)) + geom_histogram(binwidth = 100) + labs(title = "Bootstrap distribution of means") ``` <img src="w10-d06-bootstrap_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Calculate the confidence interval A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution. ```r boot_df %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975)) ``` ``` ## # A tibble: 1 x 2 ## lower upper ## <dbl> <dbl> ## 1 1603. 2213. ``` --- ## Visualize the confidence interval <img src="w10-d06-bootstrap_files/figure-html/unnamed-chunk-16-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Interpret the confidence interval .question[ The 95% confidence interval for the mean rent of three bedroom flats in Edinburgh was calculated as (1603, 2213). Which of the following is the correct interpretation of this interval? **(a)** 95% of the time the mean rent of three bedroom flats in this sample is between £1603 and £2213. **(b)** 95% of all three bedroom flats in Edinburgh have rents between £1603 and £2213. **(c)** We are 95% confident that the mean rent of all three bedroom flats is between £1603 and £2213. **(d)** We are 95% confident that the mean rent of three bedroom flats in this sample is between £1603 and £2213. ] --- class: middle # Accuracy vs. precision --- ## Confidence level **We are 95% confident that ...** - Suppose we took many samples from the original population and built a 95% confidence interval based on each sample. - Then about 95% of those intervals would contain the true population parameter. --- ## Commonly used confidence levels .question[ Which line (orange dash/dot, blue dash, green dot) represents which confidence level? ] <img src="w10-d06-bootstrap_files/figure-html/unnamed-chunk-17-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Precision vs. accuracy .question[ If we want to be very certain that we capture the population parameter, should we use a wider or a narrower interval? What drawbacks are associated with using a wider interval? ] -- <img src="img/garfield.png" width="60%" style="display: block; margin: auto;" /> -- .question[ How can we get best of both worlds -- high precision and high accuracy? ] --- ## Changing confidence level .question[ How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval? ] ```r edi_3br %>% specify(response = rent) %>% generate(reps = 15000, type = "bootstrap") %>% calculate(stat = "mean") %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975)) ``` --- ## Recap - Sample statistic `\(\ne\)` population parameter, but if the sample is good, it can be a good estimate - We report the estimate with a confidence interval, and the width of this interval depends on the variability of sample statistics from different samples from the population - Since we can't continue sampling from the population, we bootstrap from the one sample we have to estimate sampling variability - We can do this for any sample statistic: - For a mean: `calculate(stat = "mean")` - For a median: `calculate(stat = "median")` - Learn about calculating bootstrap intervals for other statistics in your homework