Introduction
This is the sixth of eight installments of my Unpacking the Tidyverse series. Each installment focuses on one of the eight core packages in Hadley Wickham’s tidyverse. Instructions given in each post are mainly derived from Hadley’s textbook, R for Data Science, and CRAN package documentation. This installment of Unpacking the Tidyverse focuses on the string manipulation package, stringr. The previous installment focuses on the tibble package, and can be found here. The next installment focuses on the forcats package, and can be found here.
Strings are simply arrays of characters in your dataset. Strings can contain numbers, letters, special symbols; really anything you decide to stick between two quotation marks. The tidyverse’s stringr function allows you to manipulate these strings with the help of regular expressions. While I won’t be explaining regular expressions in this post, I’ll link to some useful resources to help you learn to use them. But before we do anything, we must load the ever-useful tidyverse.
library('tidyverse')
Creating Strings
Let’s quickly cover how to make strings, before we learn how to manipulate them.
string_a <- "this is how you make a string"
string_b <- 'double or single quotes work'
string_c_d <- c("what if I want quotes you ask?",
"easy! put a \ before your quotes")
strings <- c(string_a, string_b, string_c_d)
Simple Stringr Functions
All stringr functions start with str_
, so if you’ve forgot the name of
the stringr function you’re looking for, RStudio will provide you with
all the functions in its autocomplete for str_
.
The easiest stringr function is str_length()
. str_length()
measures
the number of characters in the its primary argument. There are 23
letters and 6 spaces, making a total of 29 characters in string_a
.
str_length(string_a)
## [1] 29
If you give the function a vector of strings, it will measure the length of each individual element.
str_length(strings)
## [1] 29 28 30 31
Now that we can count them, let’s concatenate them! str_c
is a
function that tacks one string onto the end of another, given a defined
separator.
str_c(string_a, string_b, sep=", ")
## [1] "this is how you make a string, double or single quotes work"
Another potentially useful function is substrings, str_sub
. Substrings
extracts and replaces substrings from a character vector, typically
taking 3 input arguments. The first is your character vector, the second
is where to start the substitution, and the third argument is where to
end the substitution.
str_sub(strings, 10) <- " ...PSYCH!"
strings
## [1] "this is h ...PSYCH!" "double or ...PSYCH!" "what if I ...PSYCH!"
## [4] "easy! put ...PSYCH!"
The final fun-and-easy function I’ll cover is str_sort
. String sort
alphabetically sorts strings in a character vector. It has several
useful options allowing for reverse alphabetical sorting, different
languages, and NA values.
str_sort(strings)
## [1] "double or ...PSYCH!" "easy! put ...PSYCH!" "this is h ...PSYCH!"
## [4] "what if I ...PSYCH!"
Regular Expression Resources
Stringr really becomes useful when you use regular expressions to help it sort through massive amounts of character vectors. RStudio’s stringr cheatsheet has an excellent section on RegEx’s on its second page. If you’re still not sure on how regular expressions work, check out this article on the stringr website.
Pattern Matching with Stringr
The vignette associated with stringr has wonderful examples on how to use stringr’s pattern matching functions; check it out for great examples of all the pattern matching functions and more.
The pattern matching functions all work in a similar way, so I won’t be giving examples for each of them. All the functions take a targeted character vector as their first argument. The regular expression that describes the pattern you’re searching for is the second argument. Things like separators and other options come last. Here’s an overview of the pattern matching functions and what they do.
Stringr Pattern Matching Functions
str_detect() - returns T/F if pattern is found
str_subset() - returns the elements that match pattern
str_count() - counts the number of pattern matches
str_locate() - returns the location of the first pattern match
str_locate_all() - returns the location of all pattern matches
str_extract() - extracts the first pattern match
str_extract_all() - extracts all pattern matches
str_replace() - replaces the first pattern match
str_replace_all() - replaces all pattern matches
str_split() - splits pattern matches into individual elements
To demonstrate two of my favorite pattern matching functions, I’ll create some data to sift through. This data comes from the stringr vignette.
# Create some strings that include phone numbers
strings <- c(
"Ralph",
"970 313 8955",
"989-344-6480",
"Work: 822-541-1234; Home: 543.355.3679"
)
# Create a regular expression that identifies phone numbers
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
String count, str_count
is a simple function that tells you how
pattern matches are in each string:
str_count(strings, phone)
## [1] 0 1 1 2
String replace all has obvious uses; it’s important to note that
str_replace_all
takes a third argument defining the replacement
string.
str_replace_all(strings, phone, "XXX-XXX-XXXX")
## [1] "Ralph"
## [2] "XXX-XXX-XXXX"
## [3] "XXX-XXX-XXXX"
## [4] "Work: XXX-XXX-XXXX; Home: XXX-XXX-XXXX"
The rest of the pattern matching functions work in a similar fashion. Sometimes stringr’s usefulness is not obvious, but that’s what makes programming fun - finding creative ways to implement what you thought was a useless function to solve a useful problem.
Whitespace Manipulation
Lastly, I’ll cover what I find to be the most useful family of functions in stringr, the three whitespace manipulation functions.
You can add white space using str_pad
, remove whitespace using
str_trim
, and manipulate existing whitespace using str_wrap
.
String pad will “pad” vectors that are shorter than your desirable
length with whitespace either side of the string. If your string is
already greater than or equal to the desired length, it remains
unaffected. str_pad
will never shorten a vector.
# create example strings
ws_example <- c("abc", "abcdefghij")
# ensure strings are 10 chr long
ws_example <- str_pad(ws_example, 10, side = "right")
ws_example
## [1] "abc " "abcdefghij"
# Now ensure the strings are 20 chr long
ws_example <- str_pad(ws_example, 20, side = "left")
ws_example
## [1] " abc " " abcdefghij"
String trim will remove the whitespace we’ve just added, returning the strings to their original characters.
# trim the whitespace returning the strings to original size
ws_example <- str_trim(ws_example, side = "both")
ws_example
## [1] "abc" "abcdefghij"
This example is directly from the stringr vignette - we’ve got one long
string telling a jabberwocky story. We want to retain the story as one
string, but break it into multiple similar-length lines to increase
accessibility. str_wrap
to the rescue.
# create jabberwocky example story
jabberwocky <- str_c(
"`Twas brillig, and the slithy toves ",
"did gyre and gimble in the wabe: ",
"All mimsy were the borogoves, ",
"and the mome raths outgrabe. "
)
# view the string wrap of jabberwocky, line width of 40 chr
cat(str_wrap(jabberwocky, width = 40))
## `Twas brillig, and the slithy toves did
## gyre and gimble in the wabe: All mimsy
## were the borogoves, and the mome raths
## outgrabe.
And that’s it for Unpacking the Tidyverse - stringr! There’s a lot more to stringr and regular expressions, but it’s probably best to learn the nitty-gritty when you need it. While strings aren’t the most glamorous, they’re prolific in many dataset and it’s useful to know how to manipulate them.
If you’d like additional information on stringr, check out the additional resources I’ve listed below. Keep an eye out for my next Unpacking the Tidyverse post, forcats.
Additional Resources
Until next time,
- Fisher
Comments