In this assignment we’ll work with two datasets, one on earnings of different college majors and the other on lego sales. The assignment assumes that you’ve worked through materials up to week 5.
IMPORTANT:
If there is no hw-02
GitHub repo created for you for this assignment, it means I didn’t have your GitHub username as of when I assigned the homework. Please let me know your GitHub username asap and I can create your repo.
In this assigment you will have two automated checks: One checking whether your Rmd file knits and the other checking the structure of your Rmd file. Passing the first check requires that, well, your Rmd file knits without errors. Passing the second check requires (1) that you haven’t changed the structure of the template you’re working off of (e.g. haven’t removed headings like Exercise 1
, Code
, Narrative
and have not changed the labels of the code chunks) and (2) that you attempted all questions. These checks are meant to guide you along the way, they’re not an indicator for your score. You can’t get full marks on the assignment if you didn’t pass both, but failing one of them doesn’t mean you get a 0 either.
For each assignment in this course you will start with a GitHub repo that I created for you and that contains the starter documents you will build upon when working on your assignment. The first step is always to bring these files into RStudio so that you can edit them, run them, view your results, and interpret them. This action is called cloning.
Then you will work in RStudio on the data analysis, making commits along the way (snapshots of your changes) and finally push all your work back to GitHub.
Next, get the URL of the repo to be cloned, go to RStudio Cloud and navigate to the course workspace via the left sidebar, and clone the repo. If you would like to step-by-step instructions with screenshots, please review HW 00.
Before we introduce the data, let’s warm up with some simple exercises.
Chances are your browser has already saved your password, but if not, you can ask Git to save (cache) your password for a period of time, where you indicate the period of time in seconds. For example, if you want it to cache your password for 1 hour, that would be 3,600 seconds. To do so, run the following in the console, usethis::use_git_config(credential.helper = "cache --timeout=3600")
. If you want to cache it for a longer amount of time, you can adjust the number in the code.
R is an open-source language, and developers contribute functionality to R via packages. In this assignment we will use the following packages:
We use the library()
function to load packages. In your R Markdown document you should see an R chunk labelled load-packages
which has the necessary code for loading both packages. You should also load these packages in your Console, which you can do by sending the code to your Console by clicking on the Run Current Chunk icon (green arrow pointing right icon).
Note that these packages are also get loaded in your R Markdown environment when you Knit your R Markdown document.
In this assignment we explore data on college majors and earnings, specifically the data in the FiveThirtyEight story The Economic Guide To Picking A College Major.). These data originally come from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series. While this is outside the scope of this assignment, if you are curious about how raw data from the ACS were cleaned and prepared, see the code “the code”) FiveThirtyEight authors used.
We should also note that there are many considerations that go into picking a major. Earnings potential and employment prospects are two of them, and they are important, but they don’t tell the whole story. Keep this in mind as you analyze the data.
Before working on this part of the homework we recommend that you work through the interactive tutorial titled What should I major in?. This tutorial will introduce you to the dataset and walk you through simple exercises with instant, automated feedback. Think of it as warming up!
If you’ve worked through the tutorial, you already know that the data frame we are working with is called college_recent_grads
and it’s in the fivethirtyeight package.
To find out more about the dataset, type the following in your Console: ?college_recent_grads
. A question mark before the name of an object will always bring up its help file. This command must be ran in the Console.
You can take a quick peek at your data frame and view its dimensions with the glimpse()
function. You can find out more about the dataset by inspecting its documentation, which you can access by running ?college_recent_grads
in the Console or using the Help menu in RStudio to search for college_recent_grads
. You can also find this information here.
The college_recent_grads
data frame is a trove of information. Let’s think about some questions we might want to answer with these data:
p25th
, median
, and p75th
. These correspond to the 25th, 50th, and 75th percentiles of the income distribution of sampled individuals for a given major. Why do we often choose the median, rather than the mean, to describe the typical income of a group of people?✏️ Write your answer in your R Markdown document under the appropriate exercise, knit the document, commit your changes with a meaningful commit message and push. Make sure you commit and push all files so your Git pane is clear after the push.
Hint: The label_dollar()
function can be helpful for the x-axis.
✏️ Write your answer in your R Markdown document under the appropriate exercise, knit the document, commit your changes with a meaningful commit message and push. Make sure you commit and push all files so your Git pane is clear after the push.
Tip: If you don’t feel line writing out the names of the majors by hand, you can use inline code with the glue_collapse()
function from the glue package after you pull()
the names of the majors out and save them as a vector. So, pull()
the names of the majors, save them as a vector, then use inline code where you glue_collapse()
the vector, separated by commas. This would be a nice exercise in constructing sentences programmatically!
"Biology & Life Science"
, "Computers & Mathematics"
, "Engineering"
, and "Physical Sciences"
.✏️ Write your answer in your R Markdown document under the appropriate exercise, knit the document, commit your changes with a meaningful commit message and push. Make sure you commit and push all files so your Git pane is clear after the push.
✏️ Write your answer in your R Markdown document under the appropriate exercise, knit the document, commit your changes with a meaningful commit message and push. Make sure you commit and push all files so your Git pane is clear after the push.
Before working on this part of the homework we recommend that you work through the interactive tutorial titled Lego sales. Just like the previous one, this tutorial will introduce you to the dataset and walk you through simple exercises with instant, automated feedback. The exercises in this homework are slightly more advanced than the ones in the tutorial, so it’s useful to get acquainted with the data there first.
If you’ve already worked through this tutorial, you already know that we’re working with Lego sales data from 2018 for a sample of customers who bought Legos in the United States.
The dataset is available as part of the dsbox package, and it’s called lego_sales
.
You can take a quick peek at your data frame and view its dimensions with the glimpse
function. You can find out more about the dataset by inspecting its documentation, which you can access by running ?lego_sales
in the Console or using the Help menu in RStudio to search for lego_sales
. You can also find this information here.
For each of the exercises below, first think about which variables you need to consider from the dataset and write down some notes on how you’ll go about answering the question, and then start writing the code.
✏️ Write your answer in your R Markdown document under the appropriate exercise, knit the document, commit your changes with a meaningful commit message and push. Make sure you commit and push all files so your Git pane is clear after the push.
Hint: The str_sub()
function will be helpful here!
✏️ Now is a good time to commit and push your changes to GitHub with an appropriate commit message. Commit and push all changed files so that your Git pane is cleared up afterwards. Make sure that your last push to the repo comes before the deadline, 29 October, 16:00 UK time. You can confirm that what you committed and pushed are indeed in your repo that we will see by visiting your repo on GitHub. Make sure your Rmd and md files are there and that your md file contains all of your plots as well.
If you have any questions about the assignment, please post them on Piazza and/or stop by student hours!