class: center, middle, inverse, title-slide # Scraping top 250 movies on IMDB ##
Introduction to Data Science ###
introds.org
###
Dr. Mine Çetinkaya-Rundel --- layout: true <div class="my-footer"> <span> <a href="https://introds.org" target="_blank">introds.org</a> </span> </div> --- class: middle # Top 250 movies on IMDB --- ## Top 250 movies on IMDB Take a look at the source code, look for the tag `table` tag: <br> http://www.imdb.com/chart/top .pull-left[ <img src="img/imdb-top-250.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/imdb-top-250-source.png" width="94%" style="display: block; margin: auto;" /> ] --- ## First check if you're allowed! ```r library(robotstxt) paths_allowed("http://www.imdb.com") ``` ``` ## [1] TRUE ``` vs. e.g. ```r paths_allowed("http://www.facebook.com") ``` ``` ## [1] FALSE ``` --- ## Plan <img src="img/plan.png" width="90%" style="display: block; margin: auto;" /> --- ## Plan 1. Read the whole page 2. Scrape movie titles and save as `titles` 3. Scrape years movies were made in and save as `years` 4. Scrape IMDB ratings and save as `ratings` 5. Create a data frame called `imdb_top_250` with variables `title`, `year`, and `rating` --- class: middle # Step 1. Read the whole page --- ## Read the whole page ```r page <- read_html("https://www.imdb.com/chart/top/") page ``` ``` ## {html_document} ## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ... ## [2] <body id="styleguide-v2" class="fixed">\n <img ... ``` --- ## A webpage in R - Result is a list with 2 elements ```r typeof(page) ``` ``` ## [1] "list" ``` -- - that we need to convert to something more familiar, like a data frame.... ```r class(page) ``` ``` ## [1] "xml_document" "xml_node" ``` --- class: middle # Step 2. Scrape movie titles and save as `titles` --- ## Scrape movie titles <img src="img/titles.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_nodes(".titleColumn a") ``` ``` ## {xml_nodeset (250)} ## [1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [3] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [4] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [9] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [10] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [11] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [12] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [13] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [14] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [15] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ... ``` ] .pull-right[ <img src="img/titles.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_nodes(".titleColumn a") %>% html_text() ``` ``` ## [1] "The Shawshank Redemption" ## [2] "The Godfather" ## [3] "The Godfather: Part II" ## [4] "The Dark Knight" ## [5] "12 Angry Men" ## [6] "Schindler's List" ## [7] "The Lord of the Rings: The Return of the King" ## [8] "Pulp Fiction" ## [9] "The Good, the Bad and the Ugly" ## [10] "The Lord of the Rings: The Fellowship of the Ring" ## [11] "Fight Club" ## [12] "Forrest Gump" ## [13] "Inception" ## [14] "The Lord of the Rings: The Two Towers" ## [15] "Star Wars: Episode V - The Empire Strikes Back" ## [16] "The Matrix" ... ``` ] .pull-right[ <img src="img/titles.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Save as `titles` .pull-left[ ```r titles <- page %>% html_nodes(".titleColumn a") %>% html_text() titles ``` ``` ## [1] "The Shawshank Redemption" ## [2] "The Godfather" ## [3] "The Godfather: Part II" ## [4] "The Dark Knight" ## [5] "12 Angry Men" ## [6] "Schindler's List" ## [7] "The Lord of the Rings: The Return of the King" ## [8] "Pulp Fiction" ## [9] "The Good, the Bad and the Ugly" ## [10] "The Lord of the Rings: The Fellowship of the Ring" ## [11] "Fight Club" ## [12] "Forrest Gump" ## [13] "Inception" ## [14] "The Lord of the Rings: The Two Towers" ... ``` ] .pull-right[ <img src="img/titles.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Step 3. Scrape year movies were made and save as `years` --- ## Scrape years movies were made in <img src="img/years.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_nodes(".secondaryInfo") ``` ``` ## {xml_nodeset (250)} ## [1] <span class="secondaryInfo">(1994)</span> ## [2] <span class="secondaryInfo">(1972)</span> ## [3] <span class="secondaryInfo">(1974)</span> ## [4] <span class="secondaryInfo">(2008)</span> ## [5] <span class="secondaryInfo">(1957)</span> ## [6] <span class="secondaryInfo">(1993)</span> ## [7] <span class="secondaryInfo">(2003)</span> ## [8] <span class="secondaryInfo">(1994)</span> ## [9] <span class="secondaryInfo">(1966)</span> ## [10] <span class="secondaryInfo">(2001)</span> ## [11] <span class="secondaryInfo">(1999)</span> ## [12] <span class="secondaryInfo">(1994)</span> ## [13] <span class="secondaryInfo">(2010)</span> ## [14] <span class="secondaryInfo">(2002)</span> ## [15] <span class="secondaryInfo">(1980)</span> ## [16] <span class="secondaryInfo">(1999)</span> ... ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_nodes(".secondaryInfo") %>% html_text() ``` ``` ## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)" ## [7] "(2003)" "(1994)" "(1966)" "(2001)" "(1999)" "(1994)" ## [13] "(2010)" "(2002)" "(1980)" "(1999)" "(1990)" "(1975)" ## [19] "(1954)" "(1995)" "(1997)" "(2002)" "(1991)" "(1946)" ## [25] "(1977)" "(1998)" "(2001)" "(1999)" "(2019)" "(2014)" ## [31] "(1994)" "(1995)" "(1962)" "(1994)" "(1985)" "(2002)" ## [37] "(1991)" "(1998)" "(1936)" "(1960)" "(2000)" "(1931)" ## [43] "(2006)" "(2011)" "(2020)" "(2014)" "(2006)" "(1988)" ## [49] "(1968)" "(1942)" "(1988)" "(1954)" "(1979)" "(1979)" ## [55] "(2000)" "(1981)" "(1940)" "(2012)" "(2006)" "(2019)" ## [61] "(1957)" "(2008)" "(1980)" "(2018)" "(1950)" "(1957)" ## [67] "(2018)" "(2003)" "(1997)" "(1964)" "(2012)" "(1984)" ## [73] "(1986)" "(2016)" "(2019)" "(2017)" "(1999)" "(1995)" ## [79] "(2009)" "(1981)" "(1995)" "(1963)" "(1984)" "(2018)" ## [85] "(2007)" "(2009)" "(1983)" "(1992)" "(1997)" "(1968)" ## [91] "(2000)" "(1958)" "(1931)" "(2004)" "(2016)" "(2012)" ... ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Clean up the text We need to go from `"(1994)"` to `1994`: - Remove `(` and `)`: string manipulation - Convert to numeric: `as.numeric()` --- ## stringr .pull-left-wide[ - **stringr** provides a cohesive set of functions designed to make working with strings as easy as possible - Functions in stringr start with `str_*()`, e.g. - `str_remove()` to remove a pattern from a string ```r str_remove(string = "jello", pattern = "el") ``` ``` ## [1] "jlo" ``` - `str_replace()` to replace a pattern with another ```r str_replace(string = "jello", pattern = "j", replacement = "h") ``` ``` ## [1] "hello" ``` ] .pull-right-narrow[ <img src="img/stringr.png" width="100%" style="display: block; margin: auto auto auto 0;" /> ] --- ## Clean up the text ```r page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") # remove ( ``` ``` ## [1] "1994)" "1972)" "1974)" "2008)" "1957)" "1993)" "2003)" ## [8] "1994)" "1966)" "2001)" "1999)" "1994)" "2010)" "2002)" ## [15] "1980)" "1999)" "1990)" "1975)" "1954)" "1995)" "1997)" ## [22] "2002)" "1991)" "1946)" "1977)" "1998)" "2001)" "1999)" ## [29] "2019)" "2014)" "1994)" "1995)" "1962)" "1994)" "1985)" ## [36] "2002)" "1991)" "1998)" "1936)" "1960)" "2000)" "1931)" ## [43] "2006)" "2011)" "2020)" "2014)" "2006)" "1988)" "1968)" ## [50] "1942)" "1988)" "1954)" "1979)" "1979)" "2000)" "1981)" ## [57] "1940)" "2012)" "2006)" "2019)" "1957)" "2008)" "1980)" ## [64] "2018)" "1950)" "1957)" "2018)" "2003)" "1997)" "1964)" ## [71] "2012)" "1984)" "1986)" "2016)" "2019)" "2017)" "1999)" ## [78] "1995)" "2009)" "1981)" "1995)" "1963)" "1984)" "2018)" ## [85] "2007)" "2009)" "1983)" "1992)" "1997)" "1968)" "2000)" ## [92] "1958)" "1931)" "2004)" "2016)" "2012)" "1941)" "2019)" ## [99] "1987)" "1948)" "1921)" "1952)" "1971)" "1959)" "2000)" ... ``` --- ## Clean up the text ```r page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") # remove ) ``` ``` ## [1] "1994" "1972" "1974" "2008" "1957" "1993" "2003" "1994" ## [9] "1966" "2001" "1999" "1994" "2010" "2002" "1980" "1999" ## [17] "1990" "1975" "1954" "1995" "1997" "2002" "1991" "1946" ## [25] "1977" "1998" "2001" "1999" "2019" "2014" "1994" "1995" ## [33] "1962" "1994" "1985" "2002" "1991" "1998" "1936" "1960" ## [41] "2000" "1931" "2006" "2011" "2020" "2014" "2006" "1988" ## [49] "1968" "1942" "1988" "1954" "1979" "1979" "2000" "1981" ## [57] "1940" "2012" "2006" "2019" "1957" "2008" "1980" "2018" ## [65] "1950" "1957" "2018" "2003" "1997" "1964" "2012" "1984" ## [73] "1986" "2016" "2019" "2017" "1999" "1995" "2009" "1981" ## [81] "1995" "1963" "1984" "2018" "2007" "2009" "1983" "1992" ## [89] "1997" "1968" "2000" "1958" "1931" "2004" "2016" "2012" ## [97] "1941" "2019" "1987" "1948" "1921" "1952" "1971" "1959" ## [105] "2000" "1983" "1976" "1952" "1962" "2001" "2010" "1973" ... ``` --- ## Convert to numeric ```r page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") %>% # remove ) as.numeric() ``` ``` ## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994 ## [13] 2010 2002 1980 1999 1990 1975 1954 1995 1997 2002 1991 1946 ## [25] 1977 1998 2001 1999 2019 2014 1994 1995 1962 1994 1985 2002 ## [37] 1991 1998 1936 1960 2000 1931 2006 2011 2020 2014 2006 1988 ## [49] 1968 1942 1988 1954 1979 1979 2000 1981 1940 2012 2006 2019 ## [61] 1957 2008 1980 2018 1950 1957 2018 2003 1997 1964 2012 1984 ## [73] 1986 2016 2019 2017 1999 1995 2009 1981 1995 1963 1984 2018 ## [85] 2007 2009 1983 1992 1997 1968 2000 1958 1931 2004 2016 2012 ## [97] 1941 2019 1987 1948 1921 1952 1971 1959 2000 1983 1976 1952 ## [109] 1962 2001 2010 1973 1927 2011 2010 1965 1985 1960 1944 1962 ## [121] 2009 1989 1997 1995 1988 1975 1950 1961 2005 2018 1997 2004 ## [133] 1992 1985 1959 2004 2001 1950 1995 1963 2013 2006 1971 2009 ## [145] 1998 1980 1988 2007 1961 1948 2017 1954 2017 1974 1925 2005 ... ``` --- ## Save as `years` .pull-left[ ```r years <- page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") %>% # remove ) as.numeric() years ``` ``` ## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994 ## [13] 2010 2002 1980 1999 1990 1975 1954 1995 1997 2002 1991 1946 ## [25] 1977 1998 2001 1999 2019 2014 1994 1995 1962 1994 1985 2002 ## [37] 1991 1998 1936 1960 2000 1931 2006 2011 2020 2014 2006 1988 ## [49] 1968 1942 1988 1954 1979 1979 2000 1981 1940 2012 2006 2019 ## [61] 1957 2008 1980 2018 1950 1957 2018 2003 1997 1964 2012 1984 ## [73] 1986 2016 2019 2017 1999 1995 2009 1981 1995 1963 1984 2018 ## [85] 2007 2009 1983 1992 1997 1968 2000 1958 1931 2004 2016 2012 ## [97] 1941 2019 1987 1948 1921 1952 1971 1959 2000 1983 1976 1952 ## [109] 1962 2001 2010 1973 1927 2011 2010 1965 1985 1960 1944 1962 ## [121] 2009 1989 1997 1995 1988 1975 1950 1961 2005 2018 1997 2004 ... ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Step 4. Scrape IMDB ratings and save as `ratings` --- ## Scrape IMDB ratings <img src="img/ratings.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_nodes("strong") ``` ``` ## {xml_nodeset (250)} ## [1] <strong title="9.2 based on 2,298,079 user ratings">9.2</ ... ## [2] <strong title="9.1 based on 1,586,009 user ratings">9.1</ ... ## [3] <strong title="9.0 based on 1,108,096 user ratings">9.0</ ... ## [4] <strong title="9.0 based on 2,262,171 user ratings">9.0</ ... ## [5] <strong title="8.9 based on 675,216 user ratings">8.9</st ... ## [6] <strong title="8.9 based on 1,192,643 user ratings">8.9</ ... ## [7] <strong title="8.9 based on 1,615,457 user ratings">8.9</ ... ## [8] <strong title="8.8 based on 1,794,283 user ratings">8.8</ ... ## [9] <strong title="8.8 based on 677,585 user ratings">8.8</st ... ## [10] <strong title="8.8 based on 1,631,223 user ratings">8.8</ ... ## [11] <strong title="8.8 based on 1,821,575 user ratings">8.8</ ... ## [12] <strong title="8.8 based on 1,770,767 user ratings">8.8</ ... ## [13] <strong title="8.7 based on 2,024,196 user ratings">8.7</ ... ## [14] <strong title="8.7 based on 1,460,135 user ratings">8.7</ ... ## [15] <strong title="8.7 based on 1,140,047 user ratings">8.7</ ... ## [16] <strong title="8.6 based on 1,645,106 user ratings">8.6</ ... ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_nodes("strong") %>% html_text() ``` ``` ## [1] "9.2" "9.1" "9.0" "9.0" "8.9" "8.9" "8.9" "8.8" "8.8" "8.8" ## [11] "8.8" "8.8" "8.7" "8.7" "8.7" "8.6" "8.6" "8.6" "8.6" "8.6" ## [21] "8.6" "8.6" "8.6" "8.6" "8.6" "8.5" "8.5" "8.5" "8.5" "8.5" ## [31] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" ## [41] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.4" "8.4" ## [51] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" ## [61] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.3" ## [71] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" ## [81] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" ## [91] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.2" "8.2" ## [101] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [111] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [121] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [131] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [141] "8.2" "8.2" "8.2" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ## [151] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Convert to numeric .pull-left[ ```r page %>% html_nodes("strong") %>% html_text() %>% as.numeric() ``` ``` ## [1] 9.2 9.1 9.0 9.0 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.8 8.7 8.7 8.7 ## [16] 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 8.5 8.5 8.5 ## [31] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 ## [46] 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ## [61] 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3 ## [76] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [91] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [121] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [136] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [151] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [166] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [181] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [196] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [211] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Save as `ratings` .pull-left[ ```r ratings <- page %>% html_nodes("strong") %>% html_text() %>% as.numeric() ratings ``` ``` ## [1] 9.2 9.1 9.0 9.0 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.8 8.7 8.7 8.7 ## [16] 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 8.5 8.5 8.5 ## [31] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 ## [46] 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ## [61] 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3 ## [76] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [91] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [121] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [136] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [151] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [166] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Step 5. Create a data frame called `imdb_top_250` --- ## Create a data frame: `imdb_top_250` ```r imdb_top_250 <- tibble( title = titles, year = years, rating = ratings ) imdb_top_250 ``` ``` ## # A tibble: 250 x 3 ## title year rating ## <chr> <dbl> <dbl> ## 1 The Shawshank Redemption 1994 9.2 ## 2 The Godfather 1972 9.1 ## 3 The Godfather: Part II 1974 9 ## 4 The Dark Knight 2008 9 ## 5 12 Angry Men 1957 8.9 ## 6 Schindler's List 1993 8.9 ## # … with 244 more rows ``` ---
--- ## Clean up / enhance May or may not be a lot of work depending on how messy the data are - See if you like what you got: ```r glimpse(imdb_top_250) ``` ``` ## Rows: 250 ## Columns: 3 ## $ title <chr> "The Shawshank Redemption", "The Godfather", "T… ## $ year <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994,… ## $ rating <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.8, 8.8, 8.… ``` - Add a variable for rank ```r imdb_top_250 <- imdb_top_250 %>% mutate(rank = 1:nrow(imdb_top_250)) %>% relocate(rank) ``` --- ``` ## # A tibble: 250 x 4 ## rank title year rating ## <int> <chr> <dbl> <dbl> ## 1 1 The Shawshank Redemption 1994 9.2 ## 2 2 The Godfather 1972 9.1 ## 3 3 The Godfather: Part II 1974 9 ## 4 4 The Dark Knight 2008 9 ## 5 5 12 Angry Men 1957 8.9 ## 6 6 Schindler's List 1993 8.9 ## 7 7 The Lord of the Rings: The Return of the K… 2003 8.9 ## 8 8 Pulp Fiction 1994 8.8 ## 9 9 The Good, the Bad and the Ugly 1966 8.8 ## 10 10 The Lord of the Rings: The Fellowship of t… 2001 8.8 ## 11 11 Fight Club 1999 8.8 ## 12 12 Forrest Gump 1994 8.8 ## 13 13 Inception 2010 8.7 ## 14 14 The Lord of the Rings: The Two Towers 2002 8.7 ## 15 15 Star Wars: Episode V - The Empire Strikes … 1980 8.7 ## 16 16 The Matrix 1999 8.6 ## 17 17 Goodfellas 1990 8.6 ## 18 18 One Flew Over the Cuckoo's Nest 1975 8.6 ## 19 19 Seven Samurai 1954 8.6 ## 20 20 Seven 1995 8.6 ## # … with 230 more rows ``` --- class: middle # What next? --- .question[ Which years have the most movies on the list? ] -- ```r imdb_top_250 %>% count(year, sort = TRUE) ``` ``` ## # A tibble: 84 x 2 ## year n ## <dbl> <int> ## 1 1995 8 ## 2 2019 7 ## 3 1957 6 ## 4 2000 6 ## 5 2004 6 ## 6 2009 6 ## # … with 78 more rows ``` --- .question[ Which 1995 movies made the list? ] -- ```r imdb_top_250 %>% filter(year == 1995) %>% print(n = 8) ``` ``` ## # A tibble: 8 x 4 ## rank title year rating ## <int> <chr> <dbl> <dbl> ## 1 20 Seven 1995 8.6 ## 2 32 The Usual Suspects 1995 8.5 ## 3 78 Braveheart 1995 8.3 ## 4 81 Toy Story 1995 8.3 ## 5 124 Heat 1995 8.2 ## 6 139 Casino 1995 8.2 ## 7 192 Before Sunrise 1995 8.1 ## 8 229 La Haine 1995 8 ``` --- .question[ Visualize the average yearly rating for movies that made it on the top 250 list over time. ] -- .panelset[ .panel[.panel-name[Plot] <img src="w6-d03-top-250-imdb_files/figure-html/unnamed-chunk-46-1.png" width="58%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r imdb_top_250 %>% group_by(year) %>% summarise(avg_score = mean(rating)) %>% ggplot(aes(y = avg_score, x = year)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Year", y = "Average score") ``` ] ]