Introduction
This is the second of eight installments in my Unpacking the Tidyverse series. Each installment focuses on one of the eight core packages in Hadley Wickham’s tidyverse. Instructions given in each post are mainly derived from Hadley’s textbook, R for Data Science, and CRAN package documentation. This installment of Unpacking the Tidyverse focuses on the data-importing package, readr. The previous installment focuses on the ggplot2 package, and can be found here. The next installment focuses on the dplyr package, and can be found here.
Spending a small amount of time to properly import, parse, and export data will save countless hours of frustration later in your analysis. The Tidyverse tool built to tackle these tasks is none other than the readr package. Understanding how to properly utilize readr to increase an analysis’ reproducibility and decrease data structuring errors is a worthy goal, and the main topic of this post.
library('tidyverse')
Important Package Functions
read_delim() # Importing .csv and .tsv files
read_file() # Importing text files
read_xls() # Importing excel files (from readxl)
parse_*() # Family of parsing functions
write_delim() # Explicitly export .csv and .tsv files
write_file() # Explicitly export text files
ggsave() # Explicitly save plots (from ggplot2)
read_delim()
The two special cases of read_delim()
are read_csv()
and
read_tsv()
. These two commands are useful for the most common type of
flat data files, comma separated files and tab separated files. If
you’re using European .csv data with ;
as separators instead of
commas, use read_csv2()
.
readr functions can be used on a variety of paths, some you might not otherwise have known about:
read_csv("mtcars.csv")
read_tsv("mtcars.tsv.zip")
read_tsv("~/local/path/to/my/file/mtcars.tsv")
read_csv("https://github.com/tidyverse/readr/raw/master/inst/extdata/mtcars.csv")
As you can see, readr read_
functions can access files within the
working directory, compressed files, files in other directories, or
files from the internet.
There are over a dozen arguments that can be included in a read
function, here are the ones that I find most useful. Type ?read_delim
into the R console for the complete list of arguments.
read_csv(
"file_name.csv", # file path and name always comes first
delim=",", # single character field separator
quote = "\"", # single character to quote strings
comment = "#", # single character to signal comments
col_names = c("add", "names", "or", "T/F"), # custom name columns on import
na = ".", # string to signify missing values
skip = 0, # number of lines to skip before reading data
progress = show_progress() # display a progress bar
)
read_file()
Typically used with text files, read_file()
can also be used as a
backup read function to nearly any file type. read_file()
reads a
complete file into a single object: either a character vector of length
one, or a raw vector (read_file_raw()
). I use this function with the
single file path argument, It lacks the customization present in
read_delim()
and should be used as a last resort for uncooperative
files that aren’t .txt
.
When working with text files, I suggest looking into the tidytext
library’s unnest_tokens()
function. Read more about tidytext
here.
read_excel()
I’m cheating a little bit with including read_excel()
, as it is
actually from the readxl library. The package must be loaded separately
from the tidyverse.
library('readxl')
read_excel()
does just what you think it does! It auto detects the
format, .xls
or .xlsx
from the file extension. The function also
comes with a variety of customizable arguments, similar to
read_delim()
.
read_excel(
"path/file_name.xlsx", # path to excel file
sheet = c("sheet1","sheet3"), # name or integer position of sheets, defaults to first sheet
range = "A1:D10", # range of cells to be read, takes precedence over skip, n_max, sheet
col_names = TRUE, # true to use first row as col names
col_types = NULL, # NULL to have readr guess from spreadsheet
trim_ws = TRUE, # should leading and trailing white space be trimmed?
skip = 100, # number of columns to skip before reading
n_max = 1000 # maximum number of rows to read in
)
parse_*()
When the data you read in isn’t structured properly, it’s time to parse.
The readr library has a family of parsing functions built in to help
format your data. There are several parsing functions, including
parse_logical
, parse_factor
, parse_atomic
, parse_number
,
parse_datetime
and more. Typically I use the last two, parse_number
and parse_datetime
, so I’ll cover those in detail.
Numeric entries can be surrounded by unwanted characters, such as “$100” or “60%”. There’s also the issue of grouping characters, e.g. 1,000,000 instead of 10000000. Finally, you may work with foreign data sources that use the comma as a decimal point instead of the period, e.g. 1,00 instead of 1.00. Correcting these issues is simple with the parse function.
a <- "The price is $1993"
str(a)
## chr "The price is $1993"
a <- parse_number(a)
str(a)
## num 1993
a <- "1,99"
str(a)
## chr "1,99"
a <- parse_double(a, locale = locale(decimal_mark = ","))
str(a)
## num 1.99
a <- "$100.000.000"
str(a)
## chr "$100.000.000"
a <- parse_number(a, locale = locale(grouping_mark = "."))
str(a)
## num 1e+08
Parsing numbers is easy, but dates can be a little trickier. There are so many ways to denote a date-time, but ISO8601 is an international date-time standard that is often used. ISO8601 orders the components from largest to smallest - year, month, day and optionally a T followed by hour, minute, second. If your date-time is in this format, parsing it is a breeze.
a <- "20180228"
parse_datetime(a)
## [1] "2018-02-28 UTC"
b <- "2018-02-28T20:10:59"
parse_datetime(b)
## [1] "2018-02-28 20:10:59 UTC"
If you’re only working with dates or only times, use parse_date
or
parse_time
. The dates function expects a four digit year with a “-“ or
“/” followed by the month with a “-“ or “/” followed by the day. The
times function expects the hour value followed by a “:” followed by the
minutes value with another “:” and finally then the seconds value.
a <- "2018/02/28"
parse_date(a)
## [1] "2018-02-28"
b <- "11:11:11"
parse_time(b)
## 11:11:11
Sometimes your data doesn’t follow any of these formatting requirements, readr gives you the ability to build your own parsing formulas using these building blocks.
Year
%Y - 4 digit year number
%y - last 2 digit year number; 00-69 =
2000 - 2069, 70-99 = 1970 - 1999
Month
%m - 2 digit month number
%b - abbreviated month name, e.g.
“Jan”
%B - full month name, e.g. “January”
Day
%d - 2 digit day number
%e - option leading space
Time
%H - 0 to 23 hour
%I - 0 to 12 hour, must be used with %p
%p - AM / PM indicator
%M - minutes
%S - integer
seconds
%OS - real seconds
%Z - time zone, e.g.
America/Chicago
%z - time zone, offset from UTC, e.g. +0800
Other
%. - skips one non-digit character
%* - skips any number of
non-digits
a <- "Jan 7 2018"
parse_date(a, "%b %d %Y")
## [1] "2018-01-07"
b <- "12:45 am"
parse_time(b, "%I:%M %p")
## 00:45:00
write_delim()
When your analysis is complete and you’re ready to save a .csv or .tsv file, readr comes back into action. It’s important to explicitly save files from within R scripts to increase an analysis’ reproducibility.
write_csv(
dataframe, # The R object you want to save
"path/to/file/filename.csv", # The saved files name and path
delim = " ", # Custom delimiter
na = "NA", # Set NA values
append = FALSE # Concatenate to a file or overwrite, T/F
)
A straight forward function, write_csv( ) saves your defined data as a .csv file to the destination and name you specify as the second argument. write_csv( ) allows you to specify delimiters, missing values and more. Type ?write_csv into the R console for more arguments.
write_file()
There isn’t much explaining to be done about write_file
, it only has
four arguments to worry about, just liek read_file
. They’re the same
arguments and the function does just what you would expect; it’s
included in this post just so you know that it exists!
ggsave()
From the ggplot2 library, which is also loaded with the tidyverse,
ggsave
is the write_file
equivalent to plots. A key difference is
the order of arguments in the function. With ggsave
the filename is
defined before the plot is defined, as shown in this example.
library(‘ggplot2’)
ggsave(
filename = "path/to/file/filename.png", #
plot = last_plot(), #
scale = 1, #
width = NA, #
height = NA #
)
That covers the ins and outs of the readr package as taught in the R for Data Science book by Hadley Wickham. These basic functions are essential to getting your data science project up and running. For more tips and tricks on properly maintaining a data science project, check out my post Data Science Project Management.
If you found this summary helpful, check out the other posts on Unpacking the Tidyverse.
Additional Resources:
- CRAN
Documentation
- Github Repository
-
Other Import
Methods
Until next time,
- Fisher
Comments