Introduction
This is the third of eight installments of my Unpacking the Tidyverse series. Each installment focuses on one of the eight core packages in Hadley Wickham’s tidyverse. Instructions given in each post are mainly derived from Hadley’s textbook, R for Data Science, and CRAN package documentation. This installment of Unpacking the Tidyverse focuses on the data-wrangling package, dplyr. The previous installment focuses on the readr package, and can be found here. The next installment focuses on the tidyr package, and can be found here.
When it comes to data wrangling, the best teacher is experience. Only through hours of using dplyr will you become proficient with it. Many professional data scientists report that more than 80% of their time is spent data wrangling, so there is endless opportunity to practice with dplyr!
dplyr is by far the largest tidyverse package, it contains countless functions to help with all your data wrangling needs. In this post, I’ll focus on the three families of dplyr functions that I find the most useful: the five data-wrangling verbs, six types of data joins, and an assortment of helper functions.
library('tidyverse') # includes dplyr
library('nycflights13') # example dataset
Dplyr Overview
Data Manipulation Verbs
filter() # select rows based on value
arrange() # sort rows based on values
select() # zoom in on specified columns
mutate() # create new variables from existing ones
summarize() # collapses a data frame into a single summary
group_by() # analyze by specified group, useful for summarize
%>% # connecting pipe, read as "then"
Two-Table Functions
left_join() # retains all data from left table, includes matching data from right
right_join() # retains all data from right table, includes matching data from left
inner_join() # retains only the data that is present in both tables
full_join() # combines both tables completely
semi_join() # retains observations in the left table that have a match in the right table
anti_join() # retains observations in the left table that do not have a match in the right table
Data Wrangling Functions
transmute() # create new variables from existing ones, remove existing variables
ungroup() # literally the name
desc() # descending order (large to small, z to a)
tally() # sums the number of entries in a dataset
count() # groups then sums the number of entries
n_distinct() # sums the number of unique values in a dataset
starts_with() # used with select, match entries that start with a string
ends_with() # used with select, match entries that end with a string
contains() # used with select, match entries that contain a string
matches() # used with select, match entries with a regular expression
num_range() # used with select, match a number between defined range
bind_rows() # combines two data frames according to similar rows
bind_cols() # combines two data frames according to similar columns
rename() # renames columns
first() # the first element of vector x
last() # the last element
nth() # the nth element
I’ve broken down this post on dplyr into three distinct sections - data manipulation verbs, two-table functions, and data wrangling functions. The first section, data manipulation verbs, covers what I believe are the six most important functions in dplyr along with “then” pipe. If you only read one section of this post, I suggest you make it this section. Also of importance are the two-table functions introduced in dplyr. These six “join” functions allow the user complete control when combining multiple datasets. Finally, the data wrangling functions are functions that I’ve found useful in niche cases while exploring and analyzing datasets. This isn’t a complete list of data wrangling functions in dplyr, they’re just functions that I’ve personally found useful in my projects thus far.
To demonstrate the power of dplyr, we’ll be working with the flights
dataset loaded from library(nycflights13)
. Flights is a large and
diverse dataset, containing 19 variables and over 335,000 observations.
Flights gives on-time data for all flights that departed from the three
airports associated with New York City in 2013.
flights
## # A tibble: 336,776 x 19
## year month day dep_t… sched_… dep_d… arr_… sched… arr_d… carr… flig…
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545
## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714
## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141
## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725
## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461
## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696
## 7 2013 1 1 555 600 -5.00 913 854 19.0 B6 507
## 8 2013 1 1 557 600 -3.00 709 723 -14.0 EV 5708
## 9 2013 1 1 557 600 -3.00 838 846 - 8.00 B6 79
## 10 2013 1 1 558 600 -2.00 753 745 8.00 AA 301
## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>, origin
## # <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute
## # <dbl>, time_hour <dttm>
Filter
filter()
is a simple function that finds, or ‘filters’ observations
that match true to a declared condition. In this example filter()
first’s argument is the flights dataset, the following arguments declare
the conditions to be met. Retain observations (rows) with the months
variable equal to 12, and the day variable equal to 25. The result is a
data frame with 719 flights that departed New York City on Christmas
Day, 2013.
filter(flights, month == 12 & day == 25)
## # A tibble: 719 x 19
## year month day dep_t… sched_… dep_d… arr_… sched… arr_d… carr… flig…
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 12 25 456 500 -4.00 649 651 - 2.00 US 1895
## 2 2013 12 25 524 515 9.00 805 814 - 9.00 UA 1016
## 3 2013 12 25 542 540 2.00 832 850 -18.0 AA 2243
## 4 2013 12 25 546 550 -4.00 1022 1027 - 5.00 B6 939
## 5 2013 12 25 556 600 -4.00 730 745 -15.0 AA 301
## 6 2013 12 25 557 600 -3.00 743 752 - 9.00 DL 731
## 7 2013 12 25 557 600 -3.00 818 831 -13.0 DL 904
## 8 2013 12 25 559 600 -1.00 855 856 - 1.00 B6 371
## 9 2013 12 25 559 600 -1.00 849 855 - 6.00 B6 605
## 10 2013 12 25 600 600 0 850 846 4.00 B6 583
## # ... with 709 more rows, and 8 more variables: tailnum <chr>, origin
## # <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute
## # <dbl>, time_hour <dttm>
Arrange
Often observations within a data frame are ordered arbitrarily, or
unproductively. arrange()
reorders observations according to its user
specified arguments. In this example, the flights data frame is called
upon as the first argument once again; this is common syntax within
dplyr functions and I won’t be referencing it henceforth. The second
argument is desc(dep_time)
. desc()
is one of the secondary functions
listed previously; it simply transforms a vector into a format that will
be sorted in descending order. dep_time
is the flights variable
departure time, a number from 0000 to 2400, indicating the actual
departure time of an individual aircraft. The resulting data frame has
all 336,776 observations sorted from latest (highest) to earliest
(lowest) departure time.
arrange(flights, desc(dep_time))
## # A tibble: 336,776 x 19
## year month day dep_t… sched… dep_de… arr_… sche… arr_de… carr… flig…
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 10 30 2400 2359 1.00 327 337 - 10.0 B6 839
## 2 2013 11 27 2400 2359 1.00 515 445 30.0 B6 745
## 3 2013 12 5 2400 2359 1.00 427 440 - 13.0 B6 1503
## 4 2013 12 9 2400 2359 1.00 432 440 - 8.00 B6 1503
## 5 2013 12 9 2400 2250 70.0 59 2356 63.0 B6 1816
## 6 2013 12 13 2400 2359 1.00 432 440 - 8.00 B6 1503
## 7 2013 12 19 2400 2359 1.00 434 440 - 6.00 B6 1503
## 8 2013 12 29 2400 1700 420 302 2025 397 AA 2379
## 9 2013 2 7 2400 2359 1.00 432 436 - 4.00 B6 727
## 10 2013 2 7 2400 2359 1.00 443 444 - 1.00 B6 739
## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>, origin
## # <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute
## # <dbl>, time_hour <dttm>
Select
In a data frame with numerous variables it’s easy become overwhelmed and
broad with analysis. select()
reduces the number of variables in a
data frame, only keeping variables that the user inputs as arguments. In
this example, flights are reduced from 19 variables to the variables
year, month and day.
select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
Mutate
mutate()
allows users to alter current variables and create new ones
through various vectorized functions. In the above example the variable
speed is created and is equal to distance divided by air time multiplied
by 60. The pipe function %>%, is used to move the mutate()
output
to select()
. The new variable, speed, is output by the select()
function, along with each flight’s tail number, air time, and travel
distance.
mutate(flights, speed = distance / air_time * 60) %>%
select(tailnum, distance, air_time, speed)
## # A tibble: 336,776 x 4
## tailnum distance air_time speed
## <chr> <dbl> <dbl> <dbl>
## 1 N14228 1400 227 370
## 2 N24211 1416 227 374
## 3 N619AA 1089 160 408
## 4 N804JB 1576 183 517
## 5 N668DN 762 116 394
## 6 N39463 719 150 288
## 7 N516JB 1065 158 404
## 8 N829AS 229 53.0 259
## 9 N593JB 944 140 405
## 10 N3ALAA 733 138 319
## # ... with 336,766 more rows
Summarize
summarize()
is perhaps the most complicated function in dplyr. Often
used in conjunction with group_by()
, summarize()
collapses many
values into a single summary. In the above example. summarize()
finds
the mean departure delay of all 336,776 observations in the flights
dataset. As a result, the single value 12.64 (minutes) summarizes the
mean departure delay.
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6
Group By
summarize()
becomes extremely useful when paired with the final
primary dplyr function, group_by()
. group_by()
changes the unit of
analysis from the complete dataset to an individual group, thus changing
the scope of summarize. In the above example, the flights dataset is
grouped by the dest
variable, destination. Grouping by destination by
itself does nothing to the data frame, so the then pipe, %>%, is used
to push the output to the summarize()
function. Summarize first
creates a count variable that is equivalent to the function n()
. n()
is another one of those secondary dplyr functions that often comes in
handy; n()
is a function that finds the number of observations in the
current group. Next, summarize()
creates a variable named distance
based off the mean distance travelled by each observation, as grouped by
destination. Finally, summarize()
creates a variable named delay,
based off the mean arrival delay each observation experienced, as
grouped by destination. The resulting data frame gives excellent insight
into each of the 105 destinations present in the flights dataset.
group_by(flights, dest) %>%
summarize(count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
## # A tibble: 105 x 4
## dest count dist delay
## <chr> <int> <dbl> <dbl>
## 1 ABQ 254 1826 4.38
## 2 ACK 265 199 4.85
## 3 ALB 439 143 14.4
## 4 ANC 8 3370 - 2.50
## 5 ATL 17215 757 11.3
## 6 AUS 2439 1514 6.02
## 7 AVL 275 584 8.00
## 8 BDL 443 116 7.05
## 9 BGR 375 378 8.03
## 10 BHM 297 866 16.9
## # ... with 95 more rows
Complex Inquiries
denver_xmas_delay <- flights %>% # create new dataframe, denver_xmas_delay, then
select(-tailnum) %>% # select all variables except for tailnum
filter(month == 12 & day == 25 & dest == "DEN") %>% # filter only flights with destination Denver on Christmas
group_by(carrier) %>% # now group fights by carrier company for summary analysis
summarize(num_flights = n(), # create num_flights variable, equal to the count sorted by carrier
avg_delay = mean(dep_delay)) %>% # create avg_delay variable, equal to mean departure delay by carrier
arrange(desc(avg_delay)) # arrange these carriers by the new avg_delay variable
denver_xmas_delay # print results
## # A tibble: 5 x 3
## carrier num_flights avg_delay
## <chr> <int> <dbl>
## 1 UA 6 14.0
## 2 WN 4 10.8
## 3 DL 3 - 1.00
## 4 F9 2 - 4.00
## 5 B6 1 - 5.00
Two-Table Functions
There are two types of two-table joining functions, mutating joins and
filtering joins. Mutating joins allow you to add new variables from one
table to matching observations in another table. these functions are
left_join()
, right_join()
, inner_join()
, and outter_join()
.
Filtering Joins filter observations from one table based on matched
observations in the secondary table; these two functions are
semi_join()
and anti_join()
.
All six of the join functions are structure for the first arguments to
be x
and y
. These are the tables that are combined or filtered; the
output always takes the same structure as the left table, x
. The third
argument, by
defines the “key”, a column that occurs in both tables
that you want to join. They key in the first table is declared the
“primary key”, the key in the second table is declared the secondary, or
foreign key. The Primary key must uniquely ID each row in the first
dataset, but not necessary the second dataset. This means in some cases
that the primary key is more than one column.
They trick to understanding the six join functions is understanding how
to use the key values. Let’s get started with the most popular function,
left_join()
.
Left Join
The left_join()
function keeps all observations in your left table
(argument x
). Your primary table will never lose observations when
using left join, it will only gain additional observations from the
right table (argument y
) based on the defined key.
Here we’ll set up a simple example tibbles -
tib_1 <- tibble(x = 1:2, y = 2:1)
tib_1
## # A tibble: 2 x 2
## x y
## <int> <int>
## 1 1 2
## 2 2 1
tib_2 <- tibble(x = c(1,3), a = 10, b = "a")
tib_2
## # A tibble: 2 x 3
## x a b
## <dbl> <dbl> <chr>
## 1 1.00 10.0 a
## 2 3.00 10.0 a
left_join(tib_1, tib_2, by = "x")
## # A tibble: 2 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1.00 2 10.0 a
## 2 2.00 1 NA <NA>
When you don’t list a by
argument, dplyr finds the key columns on its
own, most of the time it’s right, but not always.
left_join(tib_1, tib_2)
## Joining, by = "x"
## # A tibble: 2 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1.00 2 10.0 a
## 2 2.00 1 NA <NA>
Right Join
The mirror image of left_join()
, right_join()
includes all of the
columns in the y
table, and adds matching observations from the x
table.
right_join(tib_1, tib_2)
## Joining, by = "x"
## # A tibble: 2 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1.00 2 10.0 a
## 2 3.00 NA 10.0 a
Inner Join
inner_join()
is simple, it only includes observations that are present
in both x
and y
. Every row must be present in both datasets. With
the key by = "x"
, only one row is present in both example tibbles.
inner_join(tib_1, tib_2, by = "x")
## # A tibble: 1 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1.00 2 10.0 a
Full Join
The most inclusive join of all, full_join()
joins every observation
from both tables, as the name suggests.
full_join(tib_1, tib_2, by = "x")
## # A tibble: 3 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1.00 2 10.0 a
## 2 2.00 1 NA <NA>
## 3 3.00 NA 10.0 a
Semi Join
The first of two filter joins, semi_join()
, keeps all observations in
the x
argument that have a match in the y
argument. In this example,
the key is the “x” column of tib_1
; the only row that matches in
tib_2
is the observation x = 1. Thus, the resulting semi_join only
has one observation filtered from tib_1
.
semi_join(tib_1, tib_2, by = "x")
## # A tibble: 1 x 2
## x y
## <int> <int>
## 1 1 2
Anti Join
anti_join()
is the antithesis of semi_join()
, the function only
returns observations that are not present in the y
dataset.
anti_join(tib_1, tib_2)
## Joining, by = "x"
## # A tibble: 1 x 2
## x y
## <int> <int>
## 1 2 1
Data Wrangling Functions
transmute()
Technically a data manipulation verb, transmute adds a new variable to the data structure, just like mutate. The difference between transmute and mutate is that mutate preserves the pre-existing variables while transmute drops them. Usage arguments are the same as mutate.
ungroup()
After using the function group_by()
data operations are performed on
the groups defined by this functions output. Ungroup will undo the
effects of group_by()
, allowing for verbs like summarize()
to once
again be conducted on the entire dataset.
desc()
When using the data wrangling verb arrange()
, dplyr automatically
sorts a datasets observations alphanumerically. If, for instance, you
wanted to arrange a set of values from highest to lowest, you must use
the desc()
function inside arrange. I exemplify desc()
in the
arrange()
example code above.
tally()
tally()
is short hand for the function summarize()
. The function
simply calls the n()
function, which defines the number of entries in
your x
argument. If you’re calling tally()
multiple times for a
single dataset, it will call sum(n)
, and create a cumulative tally.
tally()
will takes an x
argument and optionally a wt
and sort
argument. x
is the table to count, wt
allows for a weighted tally
through the defined variable wt
, and sort
will sort the output in
descending order of n if left as TRUE.
count()
A very similar function to tally()
, count()
calls group_by()
before performing the tally, then calls ungroup()
once it’s complete.
n_distinct()
number of unique values in vector x
starts_with()
Useful with the select()
data wrangling verb, starts_with()
allows
you to select data entries that start with a defined prefix. The first
argument is a string that you want to match within the dataset; the
second argument is ignore.case
which defaults to TRUE.
ends_with()
Similar to starts_with()
, this verb allows you to select entries from
a dataset that end with a defined string.
contains()
Allows you to select()
data entries that contain a defined literal
string anywhere within them.
matches()
Allows you to select()
data entries that match a regular expression
anywhere within them.
num_range()
Allows you to select()
data entries that contain a defined range of
numbers given as the first argument in num_range()
.
bind_rows()
An efficient implementation of base R’s rbind()
function,
bind_rows()
takes data frames as arguments and combines the rows, or
observations, according to the columns they have in common.
bind_cols()
bind_rows()
counterpart, bind_cols()
also takes data frames as
arguments and combines them column-wise according to their common
observations.
rename()
rename()
does exactly what it’s name suggest, taking a dataset as its
first argument, and an expression defining the new names based on the
old names as its second argument. for example, to rename multiple
columns you would write a statement like
rename(data, c(new_name_1, new_name_2) = c(old_name_1, old_name_2))
.
first()
A simple wrapper that will select and return the value that occupies the first slot in a data structure. The first and only argument defines the data structure.
last()
A simple wrapper that will select and return the value that occupies the last slot in a data structure. The first and only argument defines the data structure.
nth()
A simple wrapper than will select and return the value that occupies the nth slot in a data structure. The first argument defines the data structure and the second argument defines the nth slot.
Additional Resources
- R For Data Science Chapter 5
- R For Data Science Chapter 13
- CRAN Dplyr Documentation
- CRAN Introduction to Dplyr
- CRAN Two Table Verbs
- R Studio Data Wrangling Cheat Sheet
Data Wrangling can sometimes be a frustrating process, and it often takes up so much time that the task seems endless. dplyr makes the task easier, but not simple. It takes hours of practice to get “good” at data wrangling, I’m certainly not there yet. Hopefully this short reference guide exposed you to an interesting function, or clarified how to use a wrangling verb for you. Stay tuned for part four of this eight part series, Unpacking the Tidyverse - tidyr.
Until next time,
- Fisher
Comments