Dplyr - Basic Joins


There are two types of two-table joining functions, mutating joins and filtering joins. Mutating joins allow you to add new variables from one table to matching observations in another table. these functions are left_join(), right_join(), and inner_join(). Filtering joins filter observations from one table based on matched observations in the secondary table; these two functions are semi_join() and anti_join().

All six of the join functions are structure for the first arguments to be x and y. These are the tables that are combined or filtered; the output always takes the same structure as the left table, x. The third argument, by defines the “key”, a column that occurs in both tables that you want to join. They key in the first table is declared the “primary key”, the key in the second table is declared the secondary, or foreign key. The Primary key must uniquely ID each row in the first dataset, but not necessary the second dataset. This means in some cases that the primary key is more than one column.

They trick to understanding the six join functions is understanding how to use the key values. Let’s get started with the most popular function, left_join().


library('tidyverse')
library('dplyr')


Left Join

The left_join() function keeps all observations in your left table (argument x). Your primary table will never lose observations when using left join, it will only gain additional observations from the right table (argument y) based on the defined key.

Here we’ll set up a simple example tibbles -


tib_1 <- tibble(x = 1:2, y = 2:1)
tib_1

## # A tibble: 2 x 2
##       x     y
##   <int> <int>
## 1     1     2
## 2     2     1


tib_2 <- tibble(x = c(1,3), a = 10, b = "a")
tib_2

## # A tibble: 2 x 3
##       x     a b    
##   <dbl> <dbl> <chr>
## 1     1    10 a    
## 2     3    10 a


left_join(tib_1, tib_2, by = "x")

## # A tibble: 2 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA>


When you don’t list a by argument, dplyr finds the key columns on its own, most of the time it’s right, but not always.


left_join(tib_1, tib_2)

## Joining, by = "x"

## # A tibble: 2 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA>


Right Join

The mirror image of left_join(), right_join() includes all of the columns in the y table, and adds matching observations from the x table.


right_join(tib_1, tib_2)

## Joining, by = "x"

## # A tibble: 2 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     3    NA    10 a


Inner Join

inner_join() is simple, it only includes observations that are present in both x and y. Every row must be present in both datasets. With the key by = "x", only one row is present in both example tibbles.


inner_join(tib_1, tib_2, by = "x")

## # A tibble: 1 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a


That’s all for now! Check out my article on advanced joins in R for more information on this subject.

- Fisher



Comments