Dplyr - Advanced Joins


This is a continuation of the dplyr basic joins post. In that data toolkit article I covered the most basic joins: left, right, and inner. In this section, I’ll be covering full, semi, and anti joins.


library('tidyverse')
library('dplyr')


To effectively demonstrate dplyr’s join functions, I’ll again have to prepare some data.


tib_1 <- tibble(x = 1:2, y = 2:1)
tib_1

## # A tibble: 2 x 2
##       x     y
##   <int> <int>
## 1     1     2
## 2     2     1


tib_2 <- tibble(x = c(1,3), a = 10, b = "a")
tib_2

## # A tibble: 2 x 3
##       x     a b    
##   <dbl> <dbl> <chr>
## 1     1    10 a    
## 2     3    10 a


Full Join

The most inclusive join of all, full_join() joins every observation from both tables, as the name suggests. If there is no match, the cell is filled in with NA.


full_join(tib_1, tib_2, by = "x")

## # A tibble: 3 x 4
##       x     y     a b    
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a    
## 2     2     1    NA <NA> 
## 3     3    NA    10 a


Semi Join

The first of two filter joins, semi_join(), keeps all observations in the x argument that have a match in the y argument. In this example, the key is the “x” column of tib_1; the only row that matches in tib_2 is the observation x = 1. Thus, the resulting semi_join only has one observation filtered from tib_1.


semi_join(tib_1, tib_2, by = "x")

## # A tibble: 1 x 2
##       x     y
##   <int> <int>
## 1     1     2


Anti Join

anti_join() is the antithesis of semi_join(), the function only returns observations that are not present in the y dataset.


anti_join(tib_1, tib_2)

## Joining, by = "x"

## # A tibble: 1 x 2
##       x     y
##   <int> <int>
## 1     2     1


That’s it for joining! Really an easy concept once you map out what exactly you’re looking for.

Thanks for reading!

- Fisher



Comments