Library
library('tidyverse')
Iteration by Hand
Iteration allows you to conduct the same operation on multiple inputs
without tediously copying-and-pasting code. To illustrate the need for
iteration, I’ll set up a repetitive task and complete it by hand. Once
we have a baseline of tediousness, I’ll complete the same task using a
for loop. After that, we’ll move to the purrr package’s map
functions
for maximum-awesome.
To illustrate the need for iteration, let’s compute a simple summary
statistic on a dataset. First we’ll create a dataframe with 4 variables:
a, b, c, and d. Each of these variables will contain 10 randomly
generated numbers from the normal distribution using the function
rnorm
.
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df
## # A tibble: 10 x 4
## a b c d
## <dbl> <dbl> <dbl> <dbl>
## 1 -0.554 -0.0567 -0.550 0.527
## 2 -0.364 0.697 -0.00842 -0.329
## 3 -0.247 0.395 0.179 -0.477
## 4 0.701 0.943 -0.760 -1.26
## 5 0.724 0.0190 -1.66 -0.0738
## 6 -0.0406 -0.427 -0.315 -0.645
## 7 1.16 -0.185 -0.931 0.529
## 8 0.498 -0.378 0.636 0.00134
## 9 -0.832 -0.594 0.355 -0.176
## 10 -0.786 0.468 0.882 -1.27
As you can see the numbers range from around negative 3 to positive 3, but are mostly populating -1 to 1. This is typical of the normal distribution, to see this distribution in action, check out my previous blog post A Roll of the Dice.
Now to compute the mean of each of the variables a-d:
mean(df)
## Warning in mean.default(df): argument is not numeric or logical: returning
## NA
## [1] NA
Well that doesn’t work…
I guess we’ll have to specify each variable.
mean(df$a)
## [1] 0.02607439
mean(df$b)
## [1] 0.08807291
mean(df$c)
## [1] -0.2174574
mean(df$d)
## [1] -0.3171327
That’s better. But imagine if we have 100 variables that we want to calculate the mean of… That would be awful! Good thing we have for loops to rescue us from all of this work.
For Loops
Let’s make a for loop, and use it to calculate the mean of each of these variables.
# Step 1
output_vector <- vector("double", ncol(df))
# Step 2
for (i in seq_along((df))) {
# Step 3
output_vector[[i]] <- mean(df[[i]])
# close for loop
}
# display results
output_vector
## [1] 0.02607439 0.08807291 -0.21745738 -0.31713274
We’ve successfully created a for loop to iterate through the variables! It may seem like more work at first, and maybe it is for such a simple task as calculating the mean of 4 variables. But for loops are an integral programming technique and can be extremely useful. Even if you’re a master at purrr.
Each for loop is broken down into three steps.
The first step, you must create an output vector to store your results.
It’s easy to create an empty vector of the correct size using the
vector()
function. In the first argument of vector()
you specify
what kind of vector to create. This could be specified as any of the
data structures i.e. integer, logical, character, etc. The second
argument of vector()
is the length of the vector. Clever use of
ncol()
, nrow()
, or length()
will give you the proper length for
your output vector.
The second step of a for loop is to define the sequence. Here we
determine what to loop over using base R’s seq_along()
function.
Create the variable i
as a counter variable.
The third and most complicated step of creating a for loop is in the body of the sequence. Here we describe to R exactly what we want to do as we loop through our defined dataframe. Sometimes we simply want to take the mean of each variable, and store that number into our output vector. Other times we may want to insert a series of if then statements, of generate graphics of the data we’re looping over. The possibilities are endless.
Great, so we’ve covered a iteration and a simple for loop. Now let’s get to the good stuff. The reason you came here. The purrr package!
Purrr Map Function
In our quest to find the mean of each of the four variables in df
, we
can use the most basic purrr function map()
.
df %>%
map(mean)
## $a
## [1] 0.02607439
##
## $b
## [1] 0.08807291
##
## $c
## [1] -0.2174574
##
## $d
## [1] -0.3171327
The function map
takes two main arguments, a target to iterate over,
and a function to apply during that iteration. map
Returns a list of
values, but if you don’t want a generic list you can the variants of the
map
function.
map_lgl()
returns a logical vector.map_int()
returns an integer vector.map_dbl()
returns a double vector.map_chr()
returns a character vector.
Here’s what map_dbl()
looks like when applied to our favorite
iteration-situation.
df %>%
map_dbl(mean)
## a b c d
## 0.02607439 0.08807291 -0.21745738 -0.31713274
That’s smooooth.
So let’s make it more complex! How about we create our own function and
test out the map_chr
function. I want to run through a dataframe of
arbitrary size, and return the words “positive”, “negative”, or “zero”,
if the elements are as such. I want this to function to be done to every
First things first, let’s make the custom function!
# name the function 'classify_chr' with input = 'input'
classify_chr <- function(input) {
# set a counter equal to 1
i <- 1
# create a save vector of input length
save_vect <- vector("character", length(input))
# main while loop, while counter is less than or equal
# to input length, classify the elements and put that
# character classification into the save vector.
# Then, add one to the counter to move on to the next
# element.
while (i <= length(input))
{
if (input[[i]] > 0) {
save_vect[[i]] = "positive"
}
if (input[[i]] < 0) {
save_vect[[i]] = "negative"
}
if (input[[i]] == 0) {
save_vect[[i]] = "zero"
}
i <- i + 1
}
# When the save vector is full of output, collapse the
# results and separate them by a space. Finally, print
# the resulting output vector.
output <- str_c(save_vect, collapse=" ")
print(output)
}
Alright. That function was relatively easy to create! Let’s test it out.
# test classify_chr function
classify_chr(-1)
## [1] "negative"
classify_chr(0)
## [1] "zero"
classify_chr(1)
## [1] "positive"
Single entries are working as expected. Let’s give the custom function a concatenated list of numbers and see what it does.
# test classify_chr on a numeric vector
a <- c(-1,0,1)
classify_chr(a)
## [1] "negative zero positive"
Looking good, how about we step it up to the final level of complexity! Running through a tibble that has lists of numbers as it’s variables.
# test classify_chr on a tibble
test_tib <- tibble(
a = sample(-10:10,10, replace = T),
b = sample(1:10,10, replace = T),
c = 0
)
test_tib
## # A tibble: 10 x 3
## a b c
## <int> <int> <dbl>
## 1 -10 7 0
## 2 2 5 0
## 3 4 7 0
## 4 -10 5 0
## 5 -8 1 0
## 6 -10 10 0
## 7 1 7 0
## 8 5 1 0
## 9 10 5 0
## 10 0 8 0
classify(test_tib)
## Error in classify(test_tib): could not find function "classify"
An Error! Oh no! Wait, that’s exactly why we went through all of this. That’s where purrr comes in! Let’s give it a shot.
# use map_chr to apply custom function to a tibble
test_tib %>%
map_chr(classify_chr)
## [1] "negative positive positive negative negative negative positive positive positive zero"
## [1] "positive positive positive positive positive positive positive positive positive positive"
## [1] "zero zero zero zero zero zero zero zero zero zero"
## a
## "negative positive positive negative negative negative positive positive positive zero"
## b
## "positive positive positive positive positive positive positive positive positive positive"
## c
## "zero zero zero zero zero zero zero zero zero zero"
Alright! Not the prettiest of outputs, but it does what it’s designed to do. And it’s a great illustration of when we need to use the map function.
Purrr map2 & pmap Function
When we want to map over two arguments in our function, we can use the map2 function. Say you want to generate a random number from the normal distribution with a specific mean and standard deviation.
Using the rnorm
function, you could do something like this.
rnorm(10, 5, n = 1)
## [1] 16.51869
There we’ve taken number from the normal curve with mean (mu) 10, and
standard deviation (sigma) 5. now say we want to do this 5 times each
with 10 different inputs. Instead of writing 50 rnorm
statements,
Map2
can help us out.
# create a variety of mean inputs
mu <- rep(20:24, 2)
# create a variety of standard devation inputs
sigma <- sample(1:5, size = 10, replace = T)
# check the inputs
mu
## [1] 20 21 22 23 24 20 21 22 23 24
sigma
## [1] 3 5 1 2 5 1 3 4 1 5
# apply mu and sigma inputs to the rnorm function,
# produce 5 outputs for each pair of input arguments
map2(mu, sigma, rnorm, n=5)
## [[1]]
## [1] 21.17404 22.56106 22.49158 22.15317 16.70522
##
## [[2]]
## [1] 19.08679 17.91927 21.51969 17.75422 12.80090
##
## [[3]]
## [1] 21.73921 20.89284 22.85076 22.23587 19.92050
##
## [[4]]
## [1] 25.29599 23.64171 23.27595 20.67421 21.70236
##
## [[5]]
## [1] 19.87137 15.18446 15.11524 19.59864 20.91303
##
## [[6]]
## [1] 20.39417 19.47562 18.83630 19.57695 19.72240
##
## [[7]]
## [1] 12.27347 19.89696 26.24954 18.58605 20.20357
##
## [[8]]
## [1] 14.32382 22.86421 18.62573 25.46024 21.88317
##
## [[9]]
## [1] 24.44066 23.07704 23.02711 23.51332 22.88832
##
## [[10]]
## [1] 28.77960 22.31481 33.42296 20.08290 29.53275
It’s easy to see how you could want 3 arguments, or 4 or 5. For the
generalized case, of p inputs, we use the pmap
function in much
the same way we would use map2.
As a final example, we want to again use the rnorm
function to choose
a number from the normal distribution with a set mean and standard
deviation. this time, we also want to vary the number of outputs the
function returns by changing the n =
argument in rnorm
. 3 varied
arguements is the job for pmap
.
outs <- c(10,15,20)
mu <- c(5,6,7)
sigma <- c(1,2,3)
arguments <- list(outs, mu, sigma)
arguments %>%
pmap(rnorm)
## [[1]]
## [1] 5.752082 4.529751 5.569735 3.719392 2.931446 5.323930 3.992918
## [8] 3.253352 4.440078 6.312741
##
## [[2]]
## [1] 6.363228 4.626116 7.005189 7.429690 9.127814 6.009882 4.589295
## [8] 6.587369 6.612209 3.677911 6.607614 8.443922 5.242299 5.776972
## [15] 4.124226
##
## [[3]]
## [1] 5.1630993 9.4077378 0.4833954 6.3163107 3.2688190 9.4852517
## [7] 2.4587297 9.8187866 3.0978455 7.3941717 2.1337123 7.4064330
## [13] 7.5908101 6.4423399 9.2909349 8.8124667 12.9292212 8.6884592
## [19] 8.1997732 6.3380038
That’s all for now!
- Fisher
Comments