Web scraping considerations

# Web scraping considerations
## <br><br> Introduction to Data Science
### <a href="https://introds.org/">introds.org</a>
### <br> Dr. Mine Çetinkaya-Rundel

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://introds.org" target="_blank">introds.org</a>
</span>
</div>

---

# Ethics

---

## "Can you?" vs "Should you?"

.footnote[.small[
Source: Brian Resnick, [Researchers just released profile data on 70,000 OkCupid users without permission](https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release), Vox.
]]

---

## "Can you?" vs "Should you?"

---

# Challenges

---

## Unreliable formatting at the source

---

## Data broken into many pages

---

# Workflow

---

## Screen scraping vs. APIs

Two different scenarios for web scraping:

- Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy)

- Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files

---

## A new R workflow

- When working in an R Markdown document, your analysis is re-run each time you knit

- If web scraping in an R Markdown document, you'd be re-scraping the data each time you knit, which is undesirable (and not *nice*)!

- An alternative workflow: 
  - Use an R script to save your code 
  - Saving interim data scraped using the code in the script as CSV or RDS files
  - Use the saved data in your analysis in your R Markdown document