Scraping top 250 movies on IMDB

# Scraping top 250 movies on IMDB
## Introduction to Data Science
### <a href="https://introds.org/">introds.org</a>
### Dr. Mine Çetinkaya-Rundel

---

layout: true
 
<div class="my-footer">

<a href="https://introds.org" target="_blank">introds.org</a>

</div>

---

# Top 250 movies on IMDB

---

## Top 250 movies on IMDB

Take a look at the source code, look for the tag `table` tag:
 
http://www.imdb.com/chart/top

.pull-left[
<img src="img/imdb-top-250.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="img/imdb-top-250-source.png" width="94%" style="display: block; margin: auto;" />
]

---

## First check if you're allowed!

```r
library(robotstxt)
paths_allowed("http://www.imdb.com")
```

```
## [1] TRUE
```

vs. e.g.

```r
paths_allowed("http://www.facebook.com")
```

```
## [1] FALSE
```

---

## Plan

---

## Plan

1. Read the whole page

2. Scrape movie titles and save as `titles`

3. Scrape years movies were made in and save as `years`

4. Scrape IMDB ratings and save as `ratings`

5. Create a data frame called `imdb_top_250` with variables `title`, `year`, and `rating`

---

# Step 1. Read the whole page

---

## Read the whole page

```r
page <- read_html("https://www.imdb.com/chart/top/")
page
```

```
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ...
## [2] <body id="styleguide-v2" class="fixed">\n <img ...
```

---

## A webpage in R

- Result is a list with 2 elements

```r
typeof(page)
```

```
## [1] "list"
```

- that we need to convert to something more familiar, like a data frame....

```r
class(page)
```

```
## [1] "xml_document" "xml_node"
```

---

# Step 2. Scrape movie titles and save as `titles`

---

## Scrape movie titles

---

## Scrape the nodes

```r
page %>%
  html_nodes(".titleColumn a")
```

```
## {xml_nodeset (250)}
## [1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [3] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [4] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [9] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [10] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [11] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [12] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [13] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [14] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [15] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
...
```
]
.pull-right[
<img src="img/titles.png" width="100%" style="display: block; margin: auto;" />
]

---

## Extract the text from the nodes

```r
page %>%
  html_nodes(".titleColumn a") %>%
  html_text()
```

```
## [1] "The Shawshank Redemption" 
## [2] "The Godfather" 
## [3] "The Godfather: Part II" 
## [4] "The Dark Knight" 
## [5] "12 Angry Men" 
## [6] "Schindler's List" 
## [7] "The Lord of the Rings: The Return of the King" 
## [8] "Pulp Fiction" 
## [9] "The Good, the Bad and the Ugly" 
## [10] "The Lord of the Rings: The Fellowship of the Ring" 
## [11] "Fight Club" 
## [12] "Forrest Gump" 
## [13] "Inception" 
## [14] "The Lord of the Rings: The Two Towers" 
## [15] "Star Wars: Episode V - The Empire Strikes Back" 
## [16] "The Matrix" 
...
```
]
.pull-right[
<img src="img/titles.png" width="100%" style="display: block; margin: auto;" />
]

---

## Save as `titles`

```r
titles <- page %>%
 html_nodes(".titleColumn a") %>%
 html_text()

titles
```

```
## [1] "The Shawshank Redemption" 
## [2] "The Godfather" 
## [3] "The Godfather: Part II" 
## [4] "The Dark Knight" 
## [5] "12 Angry Men" 
## [6] "Schindler's List" 
## [7] "The Lord of the Rings: The Return of the King" 
## [8] "Pulp Fiction" 
## [9] "The Good, the Bad and the Ugly" 
## [10] "The Lord of the Rings: The Fellowship of the Ring" 
## [11] "Fight Club" 
## [12] "Forrest Gump" 
## [13] "Inception" 
## [14] "The Lord of the Rings: The Two Towers" 
...
```
]
.pull-right[
<img src="img/titles.png" width="100%" style="display: block; margin: auto;" />
]

---

# Step 3. Scrape year movies were made and save as `years`

---

## Scrape years movies were made in

---

## Scrape the nodes

```r
page %>%
  html_nodes(".secondaryInfo")
```

```
## {xml_nodeset (250)}
## [1] (1994)
## [2] (1972)
## [3] (1974)
## [4] (2008)
## [5] (1957)
## [6] (1993)
## [7] (2003)
## [8] (1994)
## [9] (1966)
## [10] (2001)
## [11] (1999)
## [12] (1994)
## [13] (2010)
## [14] (2002)
## [15] (1980)
## [16] (1999)
...
```
]
.pull-right[
<img src="img/years.png" width="100%" style="display: block; margin: auto;" />
]

---

## Extract the text from the nodes

```r
page %>%
  html_nodes(".secondaryInfo") %>%
  html_text()
```

```
## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)"
## [7] "(2003)" "(1994)" "(1966)" "(2001)" "(1999)" "(1994)"
## [13] "(2010)" "(2002)" "(1980)" "(1999)" "(1990)" "(1975)"
## [19] "(1954)" "(1995)" "(1997)" "(2002)" "(1991)" "(1946)"
## [25] "(1977)" "(1998)" "(2001)" "(1999)" "(2019)" "(2014)"
## [31] "(1994)" "(1995)" "(1962)" "(1994)" "(1985)" "(2002)"
## [37] "(1991)" "(1998)" "(1936)" "(1960)" "(2000)" "(1931)"
## [43] "(2006)" "(2011)" "(2020)" "(2014)" "(2006)" "(1988)"
## [49] "(1968)" "(1942)" "(1988)" "(1954)" "(1979)" "(1979)"
## [55] "(2000)" "(1981)" "(1940)" "(2012)" "(2006)" "(2019)"
## [61] "(1957)" "(2008)" "(1980)" "(2018)" "(1950)" "(1957)"
## [67] "(2018)" "(2003)" "(1997)" "(1964)" "(2012)" "(1984)"
## [73] "(1986)" "(2016)" "(2019)" "(2017)" "(1999)" "(1995)"
## [79] "(2009)" "(1981)" "(1995)" "(1963)" "(1984)" "(2018)"
## [85] "(2007)" "(2009)" "(1983)" "(1992)" "(1997)" "(1968)"
## [91] "(2000)" "(1958)" "(1931)" "(2004)" "(2016)" "(2012)"
...
```
]
.pull-right[
<img src="img/years.png" width="100%" style="display: block; margin: auto;" />
]

---

## Clean up the text

We need to go from `"(1994)"` to `1994`:

- Remove `(` and `)`: string manipulation

- Convert to numeric: `as.numeric()`

---

## stringr

.pull-left-wide[
- **stringr** provides a cohesive set of functions designed to make working with strings as easy as possible
- Functions in stringr start with `str_*()`, e.g.
 - `str_remove()` to remove a pattern from a string
 
 ```r
 str_remove(string = "jello", pattern = "el")
 ```
 
 ```
 ## [1] "jlo"
 ```
 - `str_replace()` to replace a pattern with another
 
 ```r
 str_replace(string = "jello", pattern = "j", replacement = "h")
 ```
 
 ```
 ## [1] "hello"
 ```
]
.pull-right-narrow[
<img src="img/stringr.png" width="100%" style="display: block; margin: auto auto auto 0;" />
]

---

## Clean up the text

```r
page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_remove("\\(") # remove (
```

```
##   [1] "1994)" "1972)" "1974)" "2008)" "1957)" "1993)" "2003)"
##   [8] "1994)" "1966)" "2001)" "1999)" "1994)" "2010)" "2002)"
##  [15] "1980)" "1999)" "1990)" "1975)" "1954)" "1995)" "1997)"
##  [22] "2002)" "1991)" "1946)" "1977)" "1998)" "2001)" "1999)"
##  [29] "2019)" "2014)" "1994)" "1995)" "1962)" "1994)" "1985)"
##  [36] "2002)" "1991)" "1998)" "1936)" "1960)" "2000)" "1931)"
##  [43] "2006)" "2011)" "2020)" "2014)" "2006)" "1988)" "1968)"
##  [50] "1942)" "1988)" "1954)" "1979)" "1979)" "2000)" "1981)"
##  [57] "1940)" "2012)" "2006)" "2019)" "1957)" "2008)" "1980)"
##  [64] "2018)" "1950)" "1957)" "2018)" "2003)" "1997)" "1964)"
##  [71] "2012)" "1984)" "1986)" "2016)" "2019)" "2017)" "1999)"
##  [78] "1995)" "2009)" "1981)" "1995)" "1963)" "1984)" "2018)"
##  [85] "2007)" "2009)" "1983)" "1992)" "1997)" "1968)" "2000)"
##  [92] "1958)" "1931)" "2004)" "2016)" "2012)" "1941)" "2019)"
##  [99] "1987)" "1948)" "1921)" "1952)" "1971)" "1959)" "2000)"
...
```

---

## Clean up the text

```r
page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_remove("\$") %>% # remove (
  str_remove("\$") # remove )
```

```
##   [1] "1994" "1972" "1974" "2008" "1957" "1993" "2003" "1994"
##   [9] "1966" "2001" "1999" "1994" "2010" "2002" "1980" "1999"
##  [17] "1990" "1975" "1954" "1995" "1997" "2002" "1991" "1946"
##  [25] "1977" "1998" "2001" "1999" "2019" "2014" "1994" "1995"
##  [33] "1962" "1994" "1985" "2002" "1991" "1998" "1936" "1960"
##  [41] "2000" "1931" "2006" "2011" "2020" "2014" "2006" "1988"
##  [49] "1968" "1942" "1988" "1954" "1979" "1979" "2000" "1981"
##  [57] "1940" "2012" "2006" "2019" "1957" "2008" "1980" "2018"
##  [65] "1950" "1957" "2018" "2003" "1997" "1964" "2012" "1984"
##  [73] "1986" "2016" "2019" "2017" "1999" "1995" "2009" "1981"
##  [81] "1995" "1963" "1984" "2018" "2007" "2009" "1983" "1992"
##  [89] "1997" "1968" "2000" "1958" "1931" "2004" "2016" "2012"
##  [97] "1941" "2019" "1987" "1948" "1921" "1952" "1971" "1959"
## [105] "2000" "1983" "1976" "1952" "1962" "2001" "2010" "1973"
...
```

---

## Convert to numeric

```r
page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_remove("\$") %>% # remove (
  str_remove("\$") %>% # remove )
  as.numeric()
```

```
##   [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994
##  [13] 2010 2002 1980 1999 1990 1975 1954 1995 1997 2002 1991 1946
##  [25] 1977 1998 2001 1999 2019 2014 1994 1995 1962 1994 1985 2002
##  [37] 1991 1998 1936 1960 2000 1931 2006 2011 2020 2014 2006 1988
##  [49] 1968 1942 1988 1954 1979 1979 2000 1981 1940 2012 2006 2019
##  [61] 1957 2008 1980 2018 1950 1957 2018 2003 1997 1964 2012 1984
##  [73] 1986 2016 2019 2017 1999 1995 2009 1981 1995 1963 1984 2018
##  [85] 2007 2009 1983 1992 1997 1968 2000 1958 1931 2004 2016 2012
##  [97] 1941 2019 1987 1948 1921 1952 1971 1959 2000 1983 1976 1952
## [109] 1962 2001 2010 1973 1927 2011 2010 1965 1985 1960 1944 1962
## [121] 2009 1989 1997 1995 1988 1975 1950 1961 2005 2018 1997 2004
## [133] 1992 1985 1959 2004 2001 1950 1995 1963 2013 2006 1971 2009
## [145] 1998 1980 1988 2007 1961 1948 2017 1954 2017 1974 1925 2005
...
```

---

## Save as `years`

```r
years <- page %>%
 html_nodes(".secondaryInfo") %>%
 html_text() %>%
 str_remove("\$") %>% # remove (
 str_remove("\$") %>% # remove )
 as.numeric()

years
```

```
## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994
## [13] 2010 2002 1980 1999 1990 1975 1954 1995 1997 2002 1991 1946
## [25] 1977 1998 2001 1999 2019 2014 1994 1995 1962 1994 1985 2002
## [37] 1991 1998 1936 1960 2000 1931 2006 2011 2020 2014 2006 1988
## [49] 1968 1942 1988 1954 1979 1979 2000 1981 1940 2012 2006 2019
## [61] 1957 2008 1980 2018 1950 1957 2018 2003 1997 1964 2012 1984
## [73] 1986 2016 2019 2017 1999 1995 2009 1981 1995 1963 1984 2018
## [85] 2007 2009 1983 1992 1997 1968 2000 1958 1931 2004 2016 2012
## [97] 1941 2019 1987 1948 1921 1952 1971 1959 2000 1983 1976 1952
## [109] 1962 2001 2010 1973 1927 2011 2010 1965 1985 1960 1944 1962
## [121] 2009 1989 1997 1995 1988 1975 1950 1961 2005 2018 1997 2004
...
```
]
.pull-right[
<img src="img/years.png" width="100%" style="display: block; margin: auto;" />
]

---

# Step 4. Scrape IMDB ratings and save as `ratings`

---

## Scrape IMDB ratings

---

## Scrape the nodes

```r
page %>%
  html_nodes("strong")
```

```
## {xml_nodeset (250)}
## [1] 9.2</ ...
## [2] 9.1</ ...
## [3] 9.0</ ...
## [4] 9.0</ ...
## [5] 8.9</st ...
## [6] 8.9</ ...
## [7] 8.9</ ...
## [8] 8.8</ ...
## [9] 8.8</st ...
## [10] 8.8</ ...
## [11] 8.8</ ...
## [12] 8.8</ ...
## [13] 8.7</ ...
## [14] 8.7</ ...
## [15] 8.7</ ...
## [16] 8.6</ ...
...
```
]
.pull-right[
<img src="img/ratings.png" width="100%" style="display: block; margin: auto;" />
]

---

## Extract the text from the nodes

```r
page %>%
  html_nodes("strong") %>%
  html_text()
```

```
## [1] "9.2" "9.1" "9.0" "9.0" "8.9" "8.9" "8.9" "8.8" "8.8" "8.8"
## [11] "8.8" "8.8" "8.7" "8.7" "8.7" "8.6" "8.6" "8.6" "8.6" "8.6"
## [21] "8.6" "8.6" "8.6" "8.6" "8.6" "8.5" "8.5" "8.5" "8.5" "8.5"
## [31] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5"
## [41] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.4" "8.4"
## [51] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4"
## [61] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.3"
## [71] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3"
## [81] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3"
## [91] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.2" "8.2"
## [101] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [111] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [121] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [131] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [141] "8.2" "8.2" "8.2" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"
## [151] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"
...
```
]
.pull-right[
<img src="img/ratings.png" width="100%" style="display: block; margin: auto;" />
]

---

## Convert to numeric

```r
page %>%
  html_nodes("strong") %>%
  html_text() %>%
  as.numeric()
```

```
## [1] 9.2 9.1 9.0 9.0 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.8 8.7 8.7 8.7
## [16] 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 8.5 8.5 8.5
## [31] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5
## [46] 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4
## [61] 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3
## [76] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3
## [91] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [121] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [136] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [151] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [166] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [181] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [196] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [211] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0
...
```
]
.pull-right[
<img src="img/ratings.png" width="100%" style="display: block; margin: auto;" />
]

---

## Save as `ratings`

```r
ratings <- page %>%
 html_nodes("strong") %>%
 html_text() %>%
 as.numeric()

ratings
```

```
## [1] 9.2 9.1 9.0 9.0 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.8 8.7 8.7 8.7
## [16] 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 8.5 8.5 8.5
## [31] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5
## [46] 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4
## [61] 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3
## [76] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3
## [91] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [121] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [136] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [151] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [166] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
...
```
]
.pull-right[
<img src="img/ratings.png" width="100%" style="display: block; margin: auto;" />
]

---

# Step 5. Create a data frame called `imdb_top_250`

---

## Create a data frame: `imdb_top_250`

```r
imdb_top_250 <- tibble(
 title = titles, 
 year = years, 
 rating = ratings
 )

imdb_top_250
```

```
## # A tibble: 250 x 3
## title year rating
## <chr> <dbl> <dbl>
## 1 The Shawshank Redemption 1994 9.2
## 2 The Godfather 1972 9.1
## 3 The Godfather: Part II 1974 9 
## 4 The Dark Knight 2008 9 
## 5 12 Angry Men 1957 8.9
## 6 Schindler's List 1993 8.9
## # … with 244 more rows
```

---

<div id="htmlwidget-8f836166d559454ecd73" style="width:100%;height:400px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-8f836166d559454ecd73">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122","123","124","125","126","127","128","129","130","131","132","133","134","135","136","137","138","139","140","141","142","143","144","145","146","147","148","149","150","151","152","153","154","155","156","157","158","159","160","161","162","163","164","165","166","167","168","169","170","171","172","173","174","175","176","177","178","179","180","181","182","183","184","185","186","187","188","189","190","191","192","193","194","195","196","197","198","199","200","201","202","203","204","205","206","207","208","209","210","211","212","213","214","215","216","217","218","219","220","221","222","223","224","225","226","227","228","229","230","231","232","233","234","235","236","237","238","239","240","241","242","243","244","245","246","247","248","249","250"],["The Shawshank Redemption","The Godfather","The Godfather: Part II","The Dark Knight","12 Angry Men","Schindler's List","The Lord of the Rings: The Return of the King","Pulp Fiction","The Good, the Bad and the Ugly","The Lord of the Rings: The Fellowship of the Ring","Fight Club","Forrest Gump","Inception","The Lord of the Rings: The Two Towers","Star Wars: Episode V - The Empire Strikes Back","The Matrix","Goodfellas","One Flew Over the Cuckoo's Nest","Seven Samurai","Seven","Life Is Beautiful","City of God","The Silence of the Lambs","It's a Wonderful Life","Star Wars: Episode IV - A New Hope","Saving Private Ryan","Spirited Away","The Green Mile","Parasite","Interstellar","Leon","The Usual Suspects","Harakiri","The Lion King","Back to the Future","The Pianist","Terminator 2: Judgment Day","American History X","Modern Times","Psycho","Gladiator","City Lights","The Departed","Untouchable","Hamilton","Whiplash","The Prestige","Grave of the Fireflies","Once Upon a Time in the West","Casablanca","Cinema Paradiso","Rear Window","Alien","Apocalypse Now","Memento","Raiders of the Lost Ark","The Great Dictator","Django Unchained","The Lives of Others","Joker","Paths of Glory","WALL·E","The Shining","Avengers: Infinity War","Sunset Blvd.","Witness for the Prosecution","Spider-Man: Into the Spider-Verse","Oldboy","Princess Mononoke","Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb","The Dark Knight Rises","Once Upon a Time in America","Aliens","Your Name.","Avengers: Endgame","Coco","American Beauty","Braveheart","3 Idiots","Das Boot","Toy Story","High and Low","Amadeus","Capernaum","Taare Zameen Par","Inglourious Basterds","Star Wars: Episode VI - Return of the Jedi","Reservoir Dogs","Good Will Hunting","2001: A Space Odyssey","Requiem for a Dream","Vertigo","M","Eternal Sunshine of the Spotless Mind","Dangal","The Hunt","Citizen Kane","1917","Full Metal Jacket","Bicycle Thieves","The Kid","Singin' in the Rain","A Clockwork Orange","North by Northwest","Snatch","Scarface","Taxi Driver","Ikiru","Lawrence of Arabia","Amélie","Toy Story 3","The Sting","Metropolis","A Separation","Incendies","For a Few Dollars More","Come and See","The Apartment","Double Indemnity","To Kill a Mockingbird","Up","Indiana Jones and the Last Crusade","L.A. Confidential","Heat","Die Hard","Monty Python and the Holy Grail","Rashomon","Yojimbo","Batman Begins","Green Book","Children of Heaven","Downfall","Unforgiven","Ran","Some Like It Hot","Howl's Moving Castle","A Beautiful Mind","All About Eve","Casino","The Great Escape","The Wolf of Wall Street","Pan's Labyrinth","Anand","The Secret in Their Eyes","Lock, Stock and Two Smoking Barrels","Raging Bull","My Neighbour Totoro","There Will Be Blood","Judgment at Nuremberg","The Treasure of the Sierra Madre","Three Billboards Outside Ebbing, Missouri","Dial M for Murder","Vikram Vedha","Chinatown","The Gold Rush","My Father and My Son","Shutter Island","No Country for Old Men","V for Vendetta","The Seventh Seal","Inside Out","Warrior","The Elephant Man","The Thing","The Sixth Sense","Trainspotting","Jurassic Park","Gone with the Wind","The Truman Show","Wild Strawberries","Finding Nemo","Blade Runner","Stalker","Kill Bill: Vol. 1","Room","The Bridge on the River Kwai","Fargo","Memories of Murder","Tokyo Story","The Third Man","Gran Torino","On the Waterfront","Wild Tales","The Deer Hunter","Klaus","In the Name of the Father","Mary and Max","Gone Girl","Andhadhun","Hacksaw Ridge","The Grand Budapest Hotel","Before Sunrise","Catch Me If You Can","The Big Lebowski","Persona","To Be or Not to Be","Prisoners","The Bandit","Sherlock Jr.","The General","How to Train Your Dragon","Le Mans '66","Mr. Smith Goes to Washington","12 Years a Slave","Barry Lyndon","Mad Max: Fury Road","Million Dollar Baby","Stand by Me","Network","Cool Hand Luke","Dead Poets Society","Ben-Hur","Hachi: A Dog's Tale","Harry Potter and the Deathly Hallows: Part 2","Platoon","Into the Wild","Logan","Monty Python's Life of Brian","Rush","The Wages of Fear","The Handmaiden","Joan of Arc","The 400 Blows","Andrei Rublev","Hotel Rwanda","Spotlight","Amores Perros","Rififi","La Haine","Nausicaä of the Valley of the Wind","Gangs of Wasseypur","Rocky","Monsters, Inc.","Rebecca","Rang De Basanti","Before Sunset","Portrait of a Lady on Fire","In the Mood for Love","Paris, Texas","It Happened One Night","Drishyam","The Invisible Guest","The Help","The Princess Bride","The Circus","The Battle of Algiers","The Terminator","Tangerines","Aladdin","A Silent Voice"],[1994,1972,1974,2008,1957,1993,2003,1994,1966,2001,1999,1994,2010,2002,1980,1999,1990,1975,1954,1995,1997,2002,1991,1946,1977,1998,2001,1999,2019,2014,1994,1995,1962,1994,1985,2002,1991,1998,1936,1960,2000,1931,2006,2011,2020,2014,2006,1988,1968,1942,1988,1954,1979,1979,2000,1981,1940,2012,2006,2019,1957,2008,1980,2018,1950,1957,2018,2003,1997,1964,2012,1984,1986,2016,2019,2017,1999,1995,2009,1981,1995,1963,1984,2018,2007,2009,1983,1992,1997,1968,2000,1958,1931,2004,2016,2012,1941,2019,1987,1948,1921,1952,1971,1959,2000,1983,1976,1952,1962,2001,2010,1973,1927,2011,2010,1965,1985,1960,1944,1962,2009,1989,1997,1995,1988,1975,1950,1961,2005,2018,1997,2004,1992,1985,1959,2004,2001,1950,1995,1963,2013,2006,1971,2009,1998,1980,1988,2007,1961,1948,2017,1954,2017,1974,1925,2005,2010,2007,2005,1957,2015,2011,1980,1982,1999,1996,1993,1939,1998,1957,2003,1982,1979,2003,2015,1957,1996,2003,1953,1949,2008,1954,2014,1978,2019,1993,2009,2014,2018,2016,2014,1995,2002,1998,1966,1942,2013,1996,1924,1926,2010,2019,1939,2013,1975,2015,2004,1986,1976,1967,1989,1959,2009,2011,1986,2007,2017,1979,2013,1953,2016,1928,1959,1966,2004,2015,2000,1955,1995,1984,2012,1976,2001,1940,2006,2004,2019,2000,1984,1934,2015,2016,2011,1987,1928,1966,1984,2013,1992,2016],[9.2,9.1,9,9,8.9,8.9,8.9,8.8,8.8,8.8,8.8,8.8,8.7,8.7,8.7,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>title<\/th>\n <th>year<\/th>\n <th>rating<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"dom":"p","pageLength":8,"columnDefs":[{"className":"dt-right","targets":[2,3]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[8,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

## Clean up / enhance

May or may not be a lot of work depending on how messy the data are

- See if you like what you got:

```r
glimpse(imdb_top_250)
```

```
## Rows: 250
## Columns: 3
## $ title <chr> "The Shawshank Redemption", "The Godfather", "T…
## $ year <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994,…
## $ rating <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.8, 8.8, 8.…
```

- Add a variable for rank

```r
imdb_top_250 <- imdb_top_250 %>%
 mutate(rank = 1:nrow(imdb_top_250)) %>%
 relocate(rank)
```

---

```
## # A tibble: 250 x 4
## rank title year rating
## <int> <chr> <dbl> <dbl>
## 1 1 The Shawshank Redemption 1994 9.2
## 2 2 The Godfather 1972 9.1
## 3 3 The Godfather: Part II 1974 9 
## 4 4 The Dark Knight 2008 9 
## 5 5 12 Angry Men 1957 8.9
## 6 6 Schindler's List 1993 8.9
## 7 7 The Lord of the Rings: The Return of the K… 2003 8.9
## 8 8 Pulp Fiction 1994 8.8
## 9 9 The Good, the Bad and the Ugly 1966 8.8
## 10 10 The Lord of the Rings: The Fellowship of t… 2001 8.8
## 11 11 Fight Club 1999 8.8
## 12 12 Forrest Gump 1994 8.8
## 13 13 Inception 2010 8.7
## 14 14 The Lord of the Rings: The Two Towers 2002 8.7
## 15 15 Star Wars: Episode V - The Empire Strikes … 1980 8.7
## 16 16 The Matrix 1999 8.6
## 17 17 Goodfellas 1990 8.6
## 18 18 One Flew Over the Cuckoo's Nest 1975 8.6
## 19 19 Seven Samurai 1954 8.6
## 20 20 Seven 1995 8.6
## # … with 230 more rows
```

---

# What next?

---

```r
imdb_top_250 %>% 
  count(year, sort = TRUE)
```

```
## # A tibble: 84 x 2
## year n
## <dbl> <int>
## 1 1995 8
## 2 2019 7
## 3 1957 6
## 4 2000 6
## 5 2004 6
## 6 2009 6
## # … with 78 more rows
```

---

```r
imdb_top_250 %>% 
  filter(year == 1995) %>%
  print(n = 8)
```

```
## # A tibble: 8 x 4
## rank title year rating
## <int> <chr> <dbl> <dbl>
## 1 20 Seven 1995 8.6
## 2 32 The Usual Suspects 1995 8.5
## 3 78 Braveheart 1995 8.3
## 4 81 Toy Story 1995 8.3
## 5 124 Heat 1995 8.2
## 6 139 Casino 1995 8.2
## 7 192 Before Sunrise 1995 8.1
## 8 229 La Haine 1995 8
```

---

.question[
Visualize the average yearly rating for movies that made it on the top 250 list over time.
]

.panelset[
.panel[.panel-name[Plot]
<img src="w6-d03-top-250-imdb_files/figure-html/unnamed-chunk-46-1.png" width="58%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

```r
imdb_top_250 %>% 
  group_by(year) %>%
  summarise(avg_score = mean(rating)) %>%
  ggplot(aes(y = avg_score, x = year)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Year", y = "Average score")
```
]
]