class: center, middle, inverse, title-slide # Visualising categorical data ##
Introduction to Data Science ###
introds.org
###
Dr. Mine Çetinkaya-Rundel --- layout: true <div class="my-footer"> <span> <a href="https://introds.org" target="_blank">introds.org</a> </span> </div> --- class: middle # Recap --- ## Variables - **Numerical** variables can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. - If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. --- ### Data ```r library(openintro) loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income) glimpse(loans) ``` ``` ## Rows: 10,000 ## Columns: 8 ## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, … ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72,… ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36,… ## $ grade <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, … ## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL,… ## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 3400… ## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, … ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46,… ``` --- class: middle # Bar plot --- ## Bar plot ```r ggplot(loans, aes(x = homeownership)) + geom_bar() ``` <img src="w2-d05-viz-cat_files/figure-html/unnamed-chunk-3-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Segmented bar plot ```r ggplot(loans, aes(x = homeownership, * fill = grade)) + geom_bar() ``` <img src="w2-d05-viz-cat_files/figure-html/unnamed-chunk-4-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Segmented bar plot ```r ggplot(loans, aes(x = homeownership, fill = grade)) + * geom_bar(position = "fill") ``` <img src="w2-d05-viz-cat_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" /> --- .question[ Which bar plot is a more useful representation for visualizing the relationship between homeownership and grade? ] .pull-left[ <img src="w2-d05-viz-cat_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="w2-d05-viz-cat_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Customizing bar plots .panelset[ .panel[.panel-name[Plot] <img src="w2-d05-viz-cat_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r *ggplot(loans, aes(y = homeownership, fill = grade)) + geom_bar(position = "fill") + * labs( * x = "Proportion", * y = "Homeownership", * fill = "Grade", * title = "Grades of Lending Club loans", * subtitle = "and homeownership of lendee" * ) ``` ] ] --- class: middle # Relationships between numerical and categorical variables --- ## Already talked about... - Colouring and faceting histograms and density plots - Side-by-side box plots --- ## Violin plots ```r ggplot(loans, aes(x = homeownership, y = loan_amount)) + geom_violin() ``` <img src="w2-d05-viz-cat_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Ridge plots ```r library(ggridges) ggplot(loans, aes(x = loan_amount, y = grade, fill = grade, color = grade)) + geom_density_ridges(alpha = 0.5) ``` <img src="w2-d05-viz-cat_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> ---