Unpacking the Tidyverse - dplyr


Introduction

This is the third of eight installments of my Unpacking the Tidyverse series. Each installment focuses on one of the eight core packages in Hadley Wickham’s tidyverse. Instructions given in each post are mainly derived from Hadley’s textbook, R for Data Science, and CRAN package documentation. This installment of Unpacking the Tidyverse focuses on the data-wrangling package, dplyr. The previous installment focuses on the readr package, and can be found here. The next installment focuses on the tidyr package, and can be found here.

When it comes to data wrangling, the best teacher is experience. Only through hours of using dplyr will you become proficient with it. Many professional data scientists report that more than 80% of their time is spent data wrangling, so there is endless opportunity to practice with dplyr!

dplyr is by far the largest tidyverse package, it contains countless functions to help with all your data wrangling needs. In this post, I’ll focus on the three families of dplyr functions that I find the most useful: the five data-wrangling verbs, six types of data joins, and an assortment of helper functions.


library('tidyverse')      # includes dplyr
library('nycflights13')   # example dataset 


Dplyr Overview


Data Manipulation Verbs

filter()       # select rows based on value 
arrange()      # sort rows based on values 
select()       # zoom in on specified columns
mutate()       # create new variables from existing ones 
summarize()    # collapses a data frame into a single summary
group_by()     # analyze by specified group, useful for summarize 
%>%            # connecting pipe, read as "then"


Two-Table Functions

left_join()    # retains all data from left table, includes matching data from right
right_join()   # retains all data from right table, includes matching data from left
inner_join()   # retains only the data that is present in both tables
full_join()    # combines both tables completely 
semi_join()    # retains observations in the left table that have a match in the right table
anti_join()    # retains observations in the left table that do not have a match in the right table


Data Wrangling Functions

transmute()    # create new variables from existing ones, remove existing variables
ungroup()      # literally the name
desc()         # descending order (large to small, z to a)
tally()        # sums the number of entries in a dataset
count()        # groups then sums the number of entries 
n_distinct()   # sums the number of unique values in a dataset
starts_with()  # used with select, match entries that start with a string
ends_with()    # used with select, match entries that end with a string
contains()     # used with select, match entries that contain a string
matches()      # used with select, match entries with a regular expression
num_range()    # used with select, match a number between defined range
bind_rows()    # combines two data frames according to similar rows
bind_cols()    # combines two data frames according to similar columns
rename()       # renames columns
first()        # the first element of vector x
last()         # the last element
nth()          # the nth element 


I’ve broken down this post on dplyr into three distinct sections - data manipulation verbs, two-table functions, and data wrangling functions. The first section, data manipulation verbs, covers what I believe are the six most important functions in dplyr along with “then” pipe. If you only read one section of this post, I suggest you make it this section. Also of importance are the two-table functions introduced in dplyr. These six “join” functions allow the user complete control when combining multiple datasets. Finally, the data wrangling functions are functions that I’ve found useful in niche cases while exploring and analyzing datasets. This isn’t a complete list of data wrangling functions in dplyr, they’re just functions that I’ve personally found useful in my projects thus far.

To demonstrate the power of dplyr, we’ll be working with the flights dataset loaded from library(nycflights13). Flights is a large and diverse dataset, containing 19 variables and over 335,000 observations. Flights gives on-time data for all flights that departed from the three airports associated with New York City in 2013.


flights

## # A tibble: 336,776 x 19
##     year month   day dep_t… sched_… dep_d… arr_… sched… arr_d… carr… flig…
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013     1     1    517     515   2.00   830    819  11.0  UA     1545
##  2  2013     1     1    533     529   4.00   850    830  20.0  UA     1714
##  3  2013     1     1    542     540   2.00   923    850  33.0  AA     1141
##  4  2013     1     1    544     545  -1.00  1004   1022 -18.0  B6      725
##  5  2013     1     1    554     600  -6.00   812    837 -25.0  DL      461
##  6  2013     1     1    554     558  -4.00   740    728  12.0  UA     1696
##  7  2013     1     1    555     600  -5.00   913    854  19.0  B6      507
##  8  2013     1     1    557     600  -3.00   709    723 -14.0  EV     5708
##  9  2013     1     1    557     600  -3.00   838    846 - 8.00 B6       79
## 10  2013     1     1    558     600  -2.00   753    745   8.00 AA      301
## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>, origin
## #   <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute
## #   <dbl>, time_hour <dttm>


Filter

filter() is a simple function that finds, or ‘filters’ observations that match true to a declared condition. In this example filter() first’s argument is the flights dataset, the following arguments declare the conditions to be met. Retain observations (rows) with the months variable equal to 12, and the day variable equal to 25. The result is a data frame with 719 flights that departed New York City on Christmas Day, 2013.

filter(flights, month == 12 & day == 25)

## # A tibble: 719 x 19
##     year month   day dep_t… sched_… dep_d… arr_… sched… arr_d… carr… flig…
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013    12    25    456     500  -4.00   649    651 - 2.00 US     1895
##  2  2013    12    25    524     515   9.00   805    814 - 9.00 UA     1016
##  3  2013    12    25    542     540   2.00   832    850 -18.0  AA     2243
##  4  2013    12    25    546     550  -4.00  1022   1027 - 5.00 B6      939
##  5  2013    12    25    556     600  -4.00   730    745 -15.0  AA      301
##  6  2013    12    25    557     600  -3.00   743    752 - 9.00 DL      731
##  7  2013    12    25    557     600  -3.00   818    831 -13.0  DL      904
##  8  2013    12    25    559     600  -1.00   855    856 - 1.00 B6      371
##  9  2013    12    25    559     600  -1.00   849    855 - 6.00 B6      605
## 10  2013    12    25    600     600   0      850    846   4.00 B6      583
## # ... with 709 more rows, and 8 more variables: tailnum <chr>, origin
## #   <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute
## #   <dbl>, time_hour <dttm>


Arrange

Often observations within a data frame are ordered arbitrarily, or unproductively. arrange() reorders observations according to its user specified arguments. In this example, the flights data frame is called upon as the first argument once again; this is common syntax within dplyr functions and I won’t be referencing it henceforth. The second argument is desc(dep_time). desc() is one of the secondary functions listed previously; it simply transforms a vector into a format that will be sorted in descending order. dep_time is the flights variable departure time, a number from 0000 to 2400, indicating the actual departure time of an individual aircraft. The resulting data frame has all 336,776 observations sorted from latest (highest) to earliest (lowest) departure time.

arrange(flights, desc(dep_time))

## # A tibble: 336,776 x 19
##     year month   day dep_t… sched… dep_de… arr_… sche… arr_de… carr… flig…
##    <int> <int> <int>  <int>  <int>   <dbl> <int> <int>   <dbl> <chr> <int>
##  1  2013    10    30   2400   2359    1.00   327   337 - 10.0  B6      839
##  2  2013    11    27   2400   2359    1.00   515   445   30.0  B6      745
##  3  2013    12     5   2400   2359    1.00   427   440 - 13.0  B6     1503
##  4  2013    12     9   2400   2359    1.00   432   440 -  8.00 B6     1503
##  5  2013    12     9   2400   2250   70.0     59  2356   63.0  B6     1816
##  6  2013    12    13   2400   2359    1.00   432   440 -  8.00 B6     1503
##  7  2013    12    19   2400   2359    1.00   434   440 -  6.00 B6     1503
##  8  2013    12    29   2400   1700  420      302  2025  397    AA     2379
##  9  2013     2     7   2400   2359    1.00   432   436 -  4.00 B6      727
## 10  2013     2     7   2400   2359    1.00   443   444 -  1.00 B6      739
## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>, origin
## #   <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute
## #   <dbl>, time_hour <dttm>


Select

In a data frame with numerous variables it’s easy become overwhelmed and broad with analysis. select() reduces the number of variables in a data frame, only keeping variables that the user inputs as arguments. In this example, flights are reduced from 19 variables to the variables year, month and day.

select(flights, year:day)

## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows


Mutate

mutate() allows users to alter current variables and create new ones through various vectorized functions. In the above example the variable speed is created and is equal to distance divided by air time multiplied by 60. The pipe function %>%, is used to move the mutate() output to select(). The new variable, speed, is output by the select() function, along with each flight’s tail number, air time, and travel distance.

mutate(flights, speed = distance / air_time * 60) %>%
  select(tailnum, distance, air_time, speed)

## # A tibble: 336,776 x 4
##    tailnum distance air_time speed
##    <chr>      <dbl>    <dbl> <dbl>
##  1 N14228      1400    227     370
##  2 N24211      1416    227     374
##  3 N619AA      1089    160     408
##  4 N804JB      1576    183     517
##  5 N668DN       762    116     394
##  6 N39463       719    150     288
##  7 N516JB      1065    158     404
##  8 N829AS       229     53.0   259
##  9 N593JB       944    140     405
## 10 N3ALAA       733    138     319
## # ... with 336,766 more rows


Summarize

summarize() is perhaps the most complicated function in dplyr. Often used in conjunction with group_by(), summarize() collapses many values into a single summary. In the above example. summarize() finds the mean departure delay of all 336,776 observations in the flights dataset. As a result, the single value 12.64 (minutes) summarizes the mean departure delay.

summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6


Group By

summarize() becomes extremely useful when paired with the final primary dplyr function, group_by(). group_by() changes the unit of analysis from the complete dataset to an individual group, thus changing the scope of summarize. In the above example, the flights dataset is grouped by the dest variable, destination. Grouping by destination by itself does nothing to the data frame, so the then pipe, %>%, is used to push the output to the summarize() function. Summarize first creates a count variable that is equivalent to the function n(). n() is another one of those secondary dplyr functions that often comes in handy; n() is a function that finds the number of observations in the current group. Next, summarize() creates a variable named distance based off the mean distance travelled by each observation, as grouped by destination. Finally, summarize() creates a variable named delay, based off the mean arrival delay each observation experienced, as grouped by destination. The resulting data frame gives excellent insight into each of the 105 destinations present in the flights dataset.

group_by(flights, dest) %>%                       
  summarize(count = n(),                          
            dist = mean(distance, na.rm = TRUE),  
            delay = mean(arr_delay, na.rm = TRUE) 
            ) 

## # A tibble: 105 x 4
##    dest  count  dist  delay
##    <chr> <int> <dbl>  <dbl>
##  1 ABQ     254  1826   4.38
##  2 ACK     265   199   4.85
##  3 ALB     439   143  14.4 
##  4 ANC       8  3370 - 2.50
##  5 ATL   17215   757  11.3 
##  6 AUS    2439  1514   6.02
##  7 AVL     275   584   8.00
##  8 BDL     443   116   7.05
##  9 BGR     375   378   8.03
## 10 BHM     297   866  16.9 
## # ... with 95 more rows


Complex Inquiries

denver_xmas_delay <- flights %>%                          # create new dataframe, denver_xmas_delay, then
  select(-tailnum) %>%                                    # select all variables except for tailnum 
  filter(month == 12 & day == 25 & dest == "DEN") %>%     # filter only flights with destination Denver on Christmas
  group_by(carrier) %>%                                   # now group fights by carrier company for summary analysis
  summarize(num_flights = n(),                            # create num_flights variable, equal to the count sorted by carrier
            avg_delay = mean(dep_delay)) %>%              # create avg_delay variable, equal to mean departure delay by carrier
  arrange(desc(avg_delay))                                # arrange these carriers by the new avg_delay variable
denver_xmas_delay                                         # print results

## # A tibble: 5 x 3
##   carrier num_flights avg_delay
##   <chr>         <int>     <dbl>
## 1 UA                6     14.0 
## 2 WN                4     10.8 
## 3 DL                3    - 1.00
## 4 F9                2    - 4.00
## 5 B6                1    - 5.00



Two-Table Functions

There are two types of two-table joining functions, mutating joins and filtering joins. Mutating joins allow you to add new variables from one table to matching observations in another table. these functions are left_join(), right_join(), inner_join(), and outter_join(). Filtering Joins filter observations from one table based on matched observations in the secondary table; these two functions are semi_join() and anti_join().

All six of the join functions are structure for the first arguments to be x and y. These are the tables that are combined or filtered; the output always takes the same structure as the left table, x. The third argument, by defines the “key”, a column that occurs in both tables that you want to join. They key in the first table is declared the “primary key”, the key in the second table is declared the secondary, or foreign key. The Primary key must uniquely ID each row in the first dataset, but not necessary the second dataset. This means in some cases that the primary key is more than one column.

They trick to understanding the six join functions is understanding how to use the key values. Let’s get started with the most popular function, left_join().


Left Join

The left_join() function keeps all observations in your left table (argument x). Your primary table will never lose observations when using left join, it will only gain additional observations from the right table (argument y) based on the defined key.

Here we’ll set up a simple example tibbles -

tib_1 <- tibble(x = 1:2, y = 2:1)
tib_1

## # A tibble: 2 x 2
##       x     y
##   <int> <int>
## 1     1     2
## 2     2     1

tib_2 <- tibble(x = c(1,3), a = 10, b = "a")
tib_2

## # A tibble: 2 x 3
##       x     a b    
##   <dbl> <dbl> <chr>
## 1  1.00  10.0 a    
## 2  3.00  10.0 a


left_join(tib_1, tib_2, by = "x")

## # A tibble: 2 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1  1.00     2  10.0 a    
## 2  2.00     1  NA   <NA>


When you don’t list a by argument, dplyr finds the key columns on its own, most of the time it’s right, but not always.

left_join(tib_1, tib_2)

## Joining, by = "x"

## # A tibble: 2 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1  1.00     2  10.0 a    
## 2  2.00     1  NA   <NA>


Right Join

The mirror image of left_join(), right_join() includes all of the columns in the y table, and adds matching observations from the x table.

right_join(tib_1, tib_2)

## Joining, by = "x"

## # A tibble: 2 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1  1.00     2  10.0 a    
## 2  3.00    NA  10.0 a


Inner Join

inner_join() is simple, it only includes observations that are present in both x and y. Every row must be present in both datasets. With the key by = "x", only one row is present in both example tibbles.

inner_join(tib_1, tib_2, by = "x")

## # A tibble: 1 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1  1.00     2  10.0 a


Full Join

The most inclusive join of all, full_join() joins every observation from both tables, as the name suggests.

full_join(tib_1, tib_2, by = "x")

## # A tibble: 3 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1  1.00     2  10.0 a    
## 2  2.00     1  NA   <NA> 
## 3  3.00    NA  10.0 a


Semi Join

The first of two filter joins, semi_join(), keeps all observations in the x argument that have a match in the y argument. In this example, the key is the “x” column of tib_1; the only row that matches in tib_2 is the observation x = 1. Thus, the resulting semi_join only has one observation filtered from tib_1.

semi_join(tib_1, tib_2, by = "x")

## # A tibble: 1 x 2
##       x     y
##   <int> <int>
## 1     1     2


Anti Join

anti_join() is the antithesis of semi_join(), the function only returns observations that are not present in the y dataset.

anti_join(tib_1, tib_2)

## Joining, by = "x"

## # A tibble: 1 x 2
##       x     y
##   <int> <int>
## 1     2     1


Data Wrangling Functions


transmute()

Technically a data manipulation verb, transmute adds a new variable to the data structure, just like mutate. The difference between transmute and mutate is that mutate preserves the pre-existing variables while transmute drops them. Usage arguments are the same as mutate.

ungroup()

After using the function group_by() data operations are performed on the groups defined by this functions output. Ungroup will undo the effects of group_by(), allowing for verbs like summarize() to once again be conducted on the entire dataset.

desc()

When using the data wrangling verb arrange(), dplyr automatically sorts a datasets observations alphanumerically. If, for instance, you wanted to arrange a set of values from highest to lowest, you must use the desc() function inside arrange. I exemplify desc() in the arrange() example code above.

tally()

tally() is short hand for the function summarize(). The function simply calls the n() function, which defines the number of entries in your x argument. If you’re calling tally() multiple times for a single dataset, it will call sum(n), and create a cumulative tally. tally() will takes an x argument and optionally a wt and sort argument. x is the table to count, wt allows for a weighted tally through the defined variable wt, and sort will sort the output in descending order of n if left as TRUE.

count()

A very similar function to tally(), count() calls group_by() before performing the tally, then calls ungroup() once it’s complete.

n_distinct()

number of unique values in vector x

starts_with()

Useful with the select() data wrangling verb, starts_with() allows you to select data entries that start with a defined prefix. The first argument is a string that you want to match within the dataset; the second argument is ignore.case which defaults to TRUE.

ends_with()

Similar to starts_with(), this verb allows you to select entries from a dataset that end with a defined string.

contains()

Allows you to select() data entries that contain a defined literal string anywhere within them.

matches()

Allows you to select() data entries that match a regular expression anywhere within them.

num_range()

Allows you to select() data entries that contain a defined range of numbers given as the first argument in num_range().

bind_rows()

An efficient implementation of base R’s rbind() function, bind_rows() takes data frames as arguments and combines the rows, or observations, according to the columns they have in common.

bind_cols()

bind_rows() counterpart, bind_cols() also takes data frames as arguments and combines them column-wise according to their common observations.

rename()

rename() does exactly what it’s name suggest, taking a dataset as its first argument, and an expression defining the new names based on the old names as its second argument. for example, to rename multiple columns you would write a statement like rename(data, c(new_name_1, new_name_2) = c(old_name_1, old_name_2)).

first()

A simple wrapper that will select and return the value that occupies the first slot in a data structure. The first and only argument defines the data structure.

last()

A simple wrapper that will select and return the value that occupies the last slot in a data structure. The first and only argument defines the data structure.

nth()

A simple wrapper than will select and return the value that occupies the nth slot in a data structure. The first argument defines the data structure and the second argument defines the nth slot.


Additional Resources


Data Wrangling can sometimes be a frustrating process, and it often takes up so much time that the task seems endless. dplyr makes the task easier, but not simple. It takes hours of practice to get “good” at data wrangling, I’m certainly not there yet. Hopefully this short reference guide exposed you to an interesting function, or clarified how to use a wrangling verb for you. Stay tuned for part four of this eight part series, Unpacking the Tidyverse - tidyr.

Until next time,
- Fisher

Comments