Unpacking the Tidyverse - tibble


Introduction

This is the fifth of eight installments of my Unpacking the Tidyverse series. Each installment focuses on one of the eight core packages in Hadley Wickham’s tidyverse. Instructions given in each post are mainly derived from Hadley’s textbook, R for Data Science, and CRAN package documentation. This installment of Unpacking the Tidyverse focuses on the modernized data frame package, tibble. The previous installment focuses on the tidyr package, and can be found [here]{ site.baseurl }/r4ds-tidyr). The next installment focuses on the stringr package, and can be found here.

Tibbles are an updated version of the base R data frames, the fundamental data structure used in R. Tibbles are default throughout the tidyverse and are compatible with most other modern packages, they keep the best qualities of the data frame while dropping the features that are less than desirable.

library('tidyverse')


Creating Tibbles

It’s simple to create a tibble - instead of using base R’s data.frame() function, use tibble’s tibble() function. If you’re looking to coerce an object into a tibble, use as_tibble() instead of as.data.frame(). The function as_tibble() was created with speed in mind, it is much quicker than the base R counterpart.

Using tibbles instead of data frames is an easy habit to form, and the benefits of using tibbles make it time well spent. Tibbles never change input types like data frames do, they also never adjust the names of variables. Tibbles evaluate arguments lazily and sequentially, resulting in more user-friendly structure creation and manipulation. They also don’t use rownames() and store variables as special attributes; tibbles are a standardized data frame that consistently simplify the user experience.


Tibble vs Data Frames

In addition to the previously mentioned benefits of tibbles, here are perhaps the three most important changes made from the outdated data frame.


Printing

Objects as a data.frame will print every column in the data frame. This behavior is rarely useful, so I’ve used the head() function to limit the output.

head(iris, n = 10)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa


When an object is stored as a tibble, calling it will automatically limit the output to ten rows.

iris.tib <- as_tibble(iris)
iris.tib

## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fctr> 
##  1         5.10        3.50         1.40       0.200 setosa 
##  2         4.90        3.00         1.40       0.200 setosa 
##  3         4.70        3.20         1.30       0.200 setosa 
##  4         4.60        3.10         1.50       0.200 setosa 
##  5         5.00        3.60         1.40       0.200 setosa 
##  6         5.40        3.90         1.70       0.400 setosa 
##  7         4.60        3.40         1.40       0.300 setosa 
##  8         5.00        3.40         1.50       0.200 setosa 
##  9         4.40        2.90         1.40       0.200 setosa 
## 10         4.90        3.10         1.50       0.100 setosa 
## # ... with 140 more rows


You’ll also notice that tibbles inform you on the data structures and dimensions, data frames do not. If you want to view the entire dataset, the View() function in RStudio is a great option.


Subsetting

Tibbles are more strict on subsetting; remember that a single bracket [ will produce another tibble (multiple vectors) and a double bracket [[ will produce a single vector.

  • [= Multiple Vectors
  • [[ = Single Vector

You can also use the $ to pull single vector of information, but only by its name.

When using $ within a tibble, don’t expect the partial matching behavior that’s found in data frames.

df <- data.frame(abc = 1)
df$a

## [1] 1


df2 <- tibble(abc = 1)
df2$a

## Warning: Unknown or uninitialised column: 'a'.

## NULL


If you’re a fan of the magrittr pipe like I am, you’ll need to use the special character . to subset the tibble.

df <- tibble(
  x = runif(5),
  y = rnorm(5)
)

df %>% .$x

## [1] 0.20769996 0.44721826 0.17946917 0.05599387 0.84797192

df %>% .[["x"]]

## [1] 0.20769996 0.44721826 0.17946917 0.05599387 0.84797192


Recycling

My favorite from data frames is the lack of vector recycling in tibbles. Within data.frames, if a vector doesn’t fit the structures dimensions it is repeated or “recycled” until it does.

data.frame(a = 1:6, b = 1:2)

##   a b
## 1 1 1
## 2 2 2
## 3 3 1
## 4 4 2
## 5 5 1
## 6 6 2


Tibbles don’t recycle vectors, unless they’re of length 1.

tibble(a = 1:6, b = 1:2)

## Error: Column `b` must be length 1 or 6, not 2


And that does it for the tibble package! A simple but useful component of the tidyverse that lays a great foundation that the other packages build from. If you’d like additional information on tibbles, check out the additional resources I’ve listed below. Keep an eye out for my next Unpacking the Tidyverse post, stringr.


Additional Resources -


Until next time,
- Fisher

Comments