Tidying data

# Tidying data
## <br><br> Introduction to Data Science
### <a href="https://introds.org/">introds.org</a>
### <br> Dr. Mine Çetinkaya-Rundel

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://introds.org" target="_blank">introds.org</a>
</span>
</div>

---

# .hand[We...]

---

## Data: Sales

<br>

```
## # A tibble: 2 x 4
##   customer_id item_1 item_2       item_3
##         <dbl> <chr>  <chr>        <chr> 
## 1           1 bread  milk         banana
## 2           2 milk   toilet paper <NA>
```
]

--
.pull-right[
### .pink[We want...]

```
## # A tibble: 6 x 3
##   customer_id item_no item        
##         <dbl> <chr>   <chr>       
## 1           1 item_1  bread       
## 2           1 item_2  milk        
## 3           1 item_3  banana      
## 4           2 item_1  milk        
## 5           2 item_2  toilet paper
## 6           2 item_3  <NA>
```
]

---

## A grammar of data tidying

.pull-left[
<img src="img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" />
]
.pull-right[
The goal of tidyr is to help you tidy your data via

- pivoting for going between wide and long data
- splitting and combining character columns
- nesting and unnesting columns
- clarifying how `NA`s should be treated
]

---

# Pivoting data

---

## Not this...

---

## but this!

.center[
<img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]

---

## Wider vs. longer

--
.pull-right[
### .pink[longer]
more rows

---

## `pivot_longer()`

```r
pivot_longer(
* data,
  cols, 
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

```r
pivot_longer(
  data, 
* cols,
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
]
.pull-right[

```r
pivot_longer(
  data, 
  cols, 
* names_to = "name",
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
- `values_to`: name of the column where data in pivoted variables go (character string)
]
.pull-right[

```r
pivot_longer(
  data, 
  cols, 
  names_to = "name", 
* values_to = "value"
  )
```
]

---

## Customers `\(\rightarrow\)` purchases

```r
purchases <- customers %>%
* pivot_longer(
*   cols = item_1:item_3,  # variables item_1 to item_3
*   names_to = "item_no",  # column names -> new column called item_no
*   values_to = "item"     # values in columns -> new column called item
*   )

purchases
```

---

## Why pivot?

Most likely, because the next step of your analysis needs it

```r
prices
```

```
## # A tibble: 5 x 2
##   item         price
##   <chr>        <dbl>
## 1 avocado       0.5 
## 2 banana        0.15
## 3 bread         1   
## 4 milk          0.8 
## 5 toilet paper  3
```
]
.pull-right[

```r
purchases %>%
* left_join(prices)
```

```
## # A tibble: 6 x 4
##   customer_id item_no item         price
##         <dbl> <chr>   <chr>        <dbl>
## 1           1 item_1  bread         1   
## 2           1 item_2  milk          0.8 
## 3           1 item_3  banana        0.15
## 4           2 item_1  milk          0.8 
## 5           2 item_2  toilet paper  3   
## 6           2 item_3  <NA>         NA
```
]

---

## Purchases `\(\rightarrow\)` customers

.pull-left-narrow[
- `data` (as usual)
- `names_from`: which column in the long format contains the what should be column names in the wide format
- `values_from`: which column in the long format contains the what should be values in the new columns in the wide format
]
.pull-right-wide[

```r
purchases %>%
* pivot_wider(
*   names_from = item_no,
*   values_from = item
* )
```

---

# Case study: Approval rating of Donald Trump

---

.footnote[
Source: [FiveThirtyEight](https://projects.fivethirtyeight.com/trump-approval-ratings/adults/)
]

---

## Data

```r
trump
```

```
## # A tibble: 2,702 x 4
##    subgroup date       approval disapproval
##    <chr>    <date>        <dbl>       <dbl>
##  1 Voters   2020-10-04     44.7        52.2
##  2 Adults   2020-10-04     43.2        52.6
##  3 Adults   2020-10-03     43.2        52.6
##  4 Voters   2020-10-03     45.0        51.7
##  5 Adults   2020-10-02     43.3        52.4
##  6 Voters   2020-10-02     44.5        52.1
##  7 Voters   2020-10-01     44.1        52.8
##  8 Adults   2020-10-01     42.7        53.3
##  9 Adults   2020-09-30     42.2        53.7
## 10 Voters   2020-09-30     44.2        52.7
## # … with 2,692 more rows
```

---

## Goal

.pull-left-wide[
<img src="w3-d06-tidying_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" />
]
--
.pull-right-narrow[
**Aesthetic mappings:**  
✅  x = `date`  
❌      y = `rating_value`  
❌      color = `rating_type`

**Facet:**  
✅  `subgroup` (Adults and Voters)
]

---

## Pivot

```r
trump_longer <- trump %>%
  pivot_longer(
    cols = c(approval, disapproval),
    names_to = "rating_type",
    values_to = "rating_value"
  )

trump_longer
```

```
## # A tibble: 5,404 x 4
##    subgroup date       rating_type rating_value
##    <chr>    <date>     <chr>              <dbl>
##  1 Voters   2020-10-04 approval            44.7
##  2 Voters   2020-10-04 disapproval         52.2
##  3 Adults   2020-10-04 approval            43.2
##  4 Adults   2020-10-04 disapproval         52.6
##  5 Adults   2020-10-03 approval            43.2
##  6 Adults   2020-10-03 disapproval         52.6
##  7 Voters   2020-10-03 approval            45.0
##  8 Voters   2020-10-03 disapproval         51.7
...
```

---

## Plot

```r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup)
```

---

```r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup) +
* scale_color_manual(values = c("darkgreen", "orange")) +
* labs(
*   x = "Date", y = "Rating",
*   color = NULL,
*   title = "How (un)popular is Donald Trump?",
*   subtitle = "Estimates based on polls of all adults and polls of likely/registered voters",
*   caption = "Source: FiveThirtyEight modeling estimates"
* )
```
]

.panel[.panel-name[Plot]
<img src="w3-d06-tidying_files/figure-html/unnamed-chunk-23-1.png" width="75%" style="display: block; margin: auto;" />
]

]

---

```r
ggplot(trump_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type, group = rating_type)) +
  geom_line() +
  facet_wrap(~ subgroup) +
  scale_color_manual(values = c("darkgreen", "orange")) + 
  labs( 
    x = "Date", y = "Rating", 
    color = NULL, 
    title = "How (un)popular is Donald Trump?", 
    subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", 
    caption = "Source: FiveThirtyEight modeling estimates" 
  ) + 
* theme_minimal() +
* theme(legend.position = "bottom")
```
]

.panel[.panel-name[Plot]
<img src="w3-d06-tidying_files/figure-html/unnamed-chunk-24-1.png" width="75%" style="display: block; margin: auto;" />
]

]