tidyr - Gather and Spread Functions


The concept of tidy data is an extremely important one. Often, we spend a lot of our time preparing the data to be analyzed instead of actually conducting the analysis. Hadley Wickham, the creator of tidyr and the tidyverse wrote a foundational paper on the topic in 2014. I suggest giving that paper a read, then coming back to learn about tidyr.


library('tidyverse')


Gather

Often, a dataset has column names that are not the names of variables, but the values of variables. take table4a for example.


table4a

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766


In table4a, the column names 1999 and 2000 are values of the variable year, not variables themselves. The gather() function can fix this, it is perhaps the most commonly used function in tidyr. Begin by listing the column names that are actually variable names into gather() as its first arguments. Once these column names have been entered, define a key as the second argument. In this case, 1999 and 2000 are both years, so the key, or new variable name, is year. Finally, define a value to be associated with the key as the last argument in gather(). The value is the name given to the elements that were previously linked to the previous 1999 and 2000 false variables. Because these two years have been gathered into a new year variable, the associated elements need a new column to call home. In this case, the values represent a number positive cases, so the value is now defined as cases.


table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766


Spread

When an observation is scattered across multiple rows, variable names appear repeatedly as elements in a dataframe. In table2 a single observation is a country in each year, but each observation is spread across two rows.


table2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583


To tidy this data, utilize spread(), one of the most common functions in tidyr. The first argument in spread() is the column that contains the variable names, the key column. In the case of table2, the key column is type, because cases and population are both variables, not values. The second and final argument in spread() is the column that contains values associated with the previously defined key, the value column. In table2 the it is the count column that contains the associated values.


table2 %>%
spread(key = type, value = count)

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583


Thanks for reading!

- Fisher



Comments