The concept of tidy data is an extremely important one. Often, we spend a lot of our time preparing the data to be analyzed instead of actually conducting the analysis. Hadley Wickham, the creator of tidyr and the tidyverse wrote a foundational paper on the topic in 2014. I suggest giving that paper a read, then coming back to learn about tidyr.
library('tidyverse')
Separate
Sometimes a column contains more than variable worth of information in
each element. In table3
this is exemplified by the rate
column.
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
A simple problem to fix, the separate()
function splits two variables
apart from a single column. The first argument in separate()
is the
column in question, in this case it’s rate
. The next argument is a bit
more complex; the into
argument needs a concatenated input, c()
to
define the names and number of new columns. As for table3, the new
columns are defined as into = c("cases", "population")
. Finally, a
separator character can be defined in the third argument. The separator
defaults to any non-alphanumeric character, but can be customized.
table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Unite
When a single variable is spread between multiple columns, it’s time to
unite them. In table5
the century and year variables should be united
to create a single variable, the full year.
table5
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
The unite()
function works in a similar way to separate()
, as it is
it’s inverse. The first argument of unite()
is to define the name of
the new, united, variable. In the case of table3
, this variable shall
be called full_year
. The next arguments define the columns that need
to be united; this can be any number of columns. century
and year
need to be united to form full_year
in this example. Finally, it’s
sometimes wise to indicate a separator character, just as with
separate()
. The separator character in unite()
defaults to the
underscore, _
.
table5 %>%
unite(full_year, century, year, sep = "")
## # A tibble: 6 x 3
## country full_year rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Thanks for reading!
- Fisher
Comments