The concept of tidy data is an extremely important one. Often, we spend a lot of our time preparing the data to be analyzed instead of actually conducting the analysis. Hadley Wickham, the creator of tidyr and the tidyverse wrote a foundational paper on the topic in 2014. I suggest giving that paper a read, then coming back to learn about tidyr.
library('tidyverse')
Fill
When dealing with missing data, it can be the case that you know that missing values are supposed to be carried on from the last observation. Something along the line of “ditto” quotations on a sign-up sheet. In the tibble treatment, we see just that.
treatment
## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 <NA> 2 10
## 3 <NA> 3 9
## 4 Katherine Burke 1 4
The function fill()
is the perfect fix for this situation. fill()
takes a set of columns where you want missing values to be replaced with
the most recent non-missing value. Simply input the column in question
as the argument in fill()
, and let R do the rest. In the case of the
tibble treatment
, the column in question is person
.
treatment %>%
fill(person)
## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 Derrick Whitmore 2 10
## 3 Derrick Whitmore 3 9
## 4 Katherine Burke 1 4
Complete
When dealing with missing data it’s often important to turn implicitly missing values to explicit missing values. There are two missing values from the stocks tibble, 4th quarter 2015 and 1st quarter 2016.
stocks
## # A tibble: 7 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 2 0.92
## 6 2016 3 0.17
## 7 2016 4 2.66
The complete()
function takes a set of columns, and finds all unique
combinations. It ensures the original dataset contains all those values,
explicitly filling in NA
when necessary. The input arguments of
complete()
are simply the columns you want to cross reference. In the
case of stocks
we want to find all of the combinations between the
year
and qtr
variable, as to fill in implicit missing variables with
NA
.
stocks %>%
complete(year, qtr)
## # A tibble: 8 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66
Until next time,
- Fisher
Comments