The Forcats Package


Categorical data can be tricky to work with. Categorical data can often be thought of in terms of factors, or levels. Factors are a way to organize categorical variables logically. For example, if you attempt to sort a list of months, they’ll be sorted alphabetically (April, August, December, …). But wouldn’t it make more sense if they were sorted chronologically? Forcats aims to help you solve problems like this quickly and efficiently.

Forcats is included in the 8 core tidyverse packages, so we can simply load the tidyverse library.

library('tidyverse')


Load Data

Let’s get some categorical data to work with. After a quick search, I’ve found a satesfactory dataset from the University of California, Irvine’s department of Information and Computer Science website. The dataset we’ll be working with in this post can be found here.


# load data as dataframe from the url in its .csv form, insure data isn't used as column names
income_predict_data <- as_data_frame(read_csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"), col_names = F))

# name columns as described in the dataset's information page
colnames(income_predict_data) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gains", "capital_loss", "hours_per_week", "native_country", "income_prediction")

# save the dataset just incase you go offline, or the source is removed
write_csv(income_predict_data, "r4ds_tidyverse/tidyverse_packages/forcats_data.csv")

## Parsed with column specification:
## cols(
##   age = col_integer(),
##   workclass = col_character(),
##   fnlwgt = col_integer(),
##   education = col_character(),
##   education_num = col_integer(),
##   marital_status = col_character(),
##   occupation = col_character(),
##   relationship = col_character(),
##   race = col_character(),
##   sex = col_character(),
##   capital_gains = col_integer(),
##   capital_loss = col_integer(),
##   hours_per_week = col_integer(),
##   native_country = col_character(),
##   income_prediction = col_character()
## )


# view dataset
income_predict_data

## # A tibble: 32,561 x 15
##      age workc… fnlwgt educa… educa… marita… occu… rela… race  sex   capi…
##    <int> <chr>   <int> <chr>   <int> <chr>   <chr> <chr> <chr> <chr> <int>
##  1    39 State…  77516 Bache…     13 Never-… Adm-… Not-… White Male   2174
##  2    50 Self-…  83311 Bache…     13 Marrie… Exec… Husb… White Male      0
##  3    38 Priva… 215646 HS-gr…      9 Divorc… Hand… Not-… White Male      0
##  4    53 Priva… 234721 11th        7 Marrie… Hand… Husb… Black Male      0
##  5    28 Priva… 338409 Bache…     13 Marrie… Prof… Wife  Black Fema…     0
##  6    37 Priva… 284582 Maste…     14 Marrie… Exec… Wife  White Fema…     0
##  7    49 Priva… 160187 9th         5 Marrie… Othe… Not-… Black Fema…     0
##  8    52 Self-… 209642 HS-gr…      9 Marrie… Exec… Husb… White Male      0
##  9    31 Priva…  45781 Maste…     14 Never-… Prof… Not-… White Fema… 14084
## 10    42 Priva… 159449 Bache…     13 Marrie… Exec… Husb… White Male   5178
## # ... with 32,551 more rows, and 4 more variables: capital_loss <int>,
## #   hours_per_week <int>, native_country <chr>, income_prediction <chr>


The dataset is a collection of over 32,000 observations and 15 variables. It contains census data from the 1990’s and is apart of a study that attempts to guess an individual’s income (>$50,000 or <$50,000) based on the census data. We’re only going to use a few variables (education, race, hours worked per week) to demonstrate the capabilities of various forcat functions.


Forcats

Now that we’ve loaded forcats and the dataset, let’s have a closer look at the dataset. It’s hard to get a grip on variables and their possible values just by calling the entire dataset, so let’s do a count of a specific variable. The function count() is from the dplyr package, which is automatically loaded as a part of the tidyverse. My post on dplyr can be found here.


# call the dataset 'then'
income_predict_data %>%
# count number of occurrences of each element in the race variable
  count(race)

## # A tibble: 5 x 2
##   race                   n
##   <chr>              <int>
## 1 Amer-Indian-Eskimo   311
## 2 Asian-Pac-Islander  1039
## 3 Black               3124
## 4 Other                271
## 5 White              27816


The dataset shows 5 categories of possible values for the race variable. Let’s visualize a different variable’s categories using ggplot2, which is another core tidyverse package. (tutorial here)


# call the dataset
income_predict_data %>%
# plot the dataset's variable 'education'
  ggplot(aes(education)) + 
# use a bar chart
  geom_bar() +
# adjust the theme and labels
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  xlab("Highest Level of Education") + 
  ylab("Number of Entries") + 
  ggtitle("Education Variable Breakdown")


There are 16 different categories in this dataset’s education variable. Because the categories in this variable can be ordered logically, it’s a great candidate to test out our forcat functions on.

First, I’ll show you how to turn this character variable into a factor variable using base R. It’s important to some forcat functions that your variable is a factor and not a list of strings.


# view possible categories for education
unique(income_predict_data$education)

##  [1] "Bachelors"    "HS-grad"      "11th"         "Masters"     
##  [5] "9th"          "Some-college" "Assoc-acdm"   "Assoc-voc"   
##  [9] "7th-8th"      "Doctorate"    "Prof-school"  "5th-6th"     
## [13] "10th"         "1st-4th"      "Preschool"    "12th"

# order categories logically by hand
education_levels <- c("Preschool", "1st-4th", "5th-6th", "7th-8th", "9th", "10th", "11th", "12th", "HS-grad", "Some-college", "Assoc-acdm", "Assoc-voc", "Bachelors", "Masters", "Prof-school", "Doctorate")

# use base R factor function with defined levels to overwrite education variable
income_predict_data$education <- factor(income_predict_data$education, levels=education_levels)

# plot new education variable with ordered factors
income_predict_data %>%
  ggplot(aes(education)) + 
  geom_bar() + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  xlab("Highest Level of Education") + 
  ylab("Number of Entries") + 
  ggtitle("Ordered Education Variable Breakdown")


That makes more sense.

Now let’s put the factors to use by visualizing the the average hours worked per week, as broken down by our factor variable, education.


# copy original dataset into new dataset, 'then'
education_hours_summary <- income_predict_data %>%
# group new dataset by education factored variable, 'then'
  group_by(education) %>%
# use dplyr's summarize function to...
  summarize(
# calculate the average hours worked (grouped by education factor)
    avg_hours_worked = mean(hours_per_week, na.rm = TRUE)
  )

# plot average hours worked per week as grouped by education
ggplot(education_hours_summary) + 
  geom_point(aes(avg_hours_worked, education)) + 
  theme_minimal() + 
  xlab("Average Hours Worked Per Week") + 
  ylab("Highest Level of Education") + 
  ggtitle("Average Hours Worked per Week by Ordered Education")


Interesting plot, but let’s start using forcats functions to manipulate the data. Say I want to re-order the education factor according to the newly calculated average hours worked per week. I’ll use the forcat function fct_reorder to accomplish this.


# take the previously created education_hours_summary dataset, 'then'
education_hours_summary %>%
# reorder the factor, education, by avg_hours_worked, 'then'
  mutate(education = fct_reorder(education, avg_hours_worked)) %>%
# plot the newly ordered factor and avg_hours_worked
  ggplot(aes(avg_hours_worked, education)) + 
  geom_point() + 
  theme_minimal() + 
  xlab("Average Hours Worked Per Week") + 
  ylab("Highest Level of Education") + 
  ggtitle("Ordered Average Hours Worked per Week by Education")


Well that was easy. Now we can clearly see that those who have only completed schooling until the 11th grade work the least number of hours each week, and professional school graduates work the most hours per week, on average.

Let’s use another forcats function fct_infreq to reorder the education factors according to their frequency of occurrence in the dataset. We’ll also throw in the forcats function fct_rev to reverse the levels so they’re in ascending order of occurrence.


# take the original dataset, 'then'
income_predict_data %>%
# change the education variable so that it's ordered by reverse frequency, 'then'
  mutate(education = education %>% fct_infreq() %>% fct_rev()) %>%
# plot the new education variable
  ggplot(aes(education)) + 
  geom_bar() + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  xlab("Highest Level of Education") + 
  ylab("Number of Entries") + 
  ggtitle("Ordered Education Variable Breakdown")


The most common level of completed education is clearly the high school graduate, and the least common level of completed education is the those who’ve never attended grade school.


But what if I dislike the names of my factors? Do I have to go in and change them by hand? No. Forcats has a function for that: fct_recode.

# take the initial dataset, 'then'
income_predict_data %>%
# change the education variable, by recoding the education variable
  mutate(education = fct_recode(education, 
# the new category "Associates" is made up of what used to be "Assoc-acdm"
         "Associates" = "Assoc-acdm",
# the new category "Vocational" is made up of what used to be "Assoc-voc"
         "Vocational" = "Assoc-voc",
# the new category "Professional" is made up of what used to be "Prof-school"
         "Professional" = "Prof-school"
# complete this function, 'then'
         )) %>%
# count the newly altered variable
  count(education)

## # A tibble: 16 x 2
##    education        n
##    <fctr>       <int>
##  1 Preschool       51
##  2 1st-4th        168
##  3 5th-6th        333
##  4 7th-8th        646
##  5 9th            514
##  6 10th           933
##  7 11th          1175
##  8 12th           433
##  9 HS-grad      10501
## 10 Some-college  7291
## 11 Associates    1067
## 12 Vocational    1382
## 13 Bachelors     5355
## 14 Masters       1723
## 15 Professional   576
## 16 Doctorate      413


What if these categories are too descriptive, and I want to lump some of them together? The function fct_lump (forcats) has you covered.


# lump the smallest groups together to other, num big groups = 5

# take the original dataset, 'then'
income_predict_data %>%
# change the education variable by lumping the education variable into 5 largest categories + 'other', 'then'
  mutate(education = fct_lump(education, n = 5)) %>%
# count the new variable
  count(education)

## # A tibble: 6 x 2
##   education        n
##   <fctr>       <int>
## 1 HS-grad      10501
## 2 Some-college  7291
## 3 Assoc-voc     1382
## 4 Bachelors     5355
## 5 Masters       1723
## 6 Other         6309


“But lumping the factor doesn’t give me enough control!” - You.

“How about fct_collapse then?” - Me.


# take the initial dataset, 'then'
income_predict_data %>%
# change the education variable using factor collapse on the education variable
  mutate(education = fct_collapse(education,
# create a new factor, advanced_degree, made up of five old factors collapsed together
        advanced_degree = c("Assoc-acdm","Bachelors", "Masters",
                            "Doctorate", "Prof-school")
# 'then'
  )) %>%
# plot the new education variable 
  ggplot(aes(education)) + 
  geom_bar() + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  xlab("Highest Level of Education") + 
  ylab("Number of Entries") + 
  ggtitle("Lumped Education Variable Breakdown")


Now you might be saying, “But Fisher, that was too easy! I want a harder, more verbose way to lump factors together!” To that, I present you again with fct_recode.


# take the intitial dataset, 'then'
income_predict_data %>%
# change the education variable by recoding the factors of the education variable
  mutate(education = fct_recode(education,
# make a new factor, "drop_out", populate it with the old "Preeschool" factor
      "drop_out" = "Preschool",
# repeat ad nauseam
      "drop_out" = "1st-4th",
      "drop_out" = "5th-6th",
      "drop_out" = "7th-8th", 
      "drop_out" = "9th",
      "drop_out" = "10th", 
      "drop_out" = "11th",
      "drop_out" = "12th")
# THEN
      ) %>%
# plot the new education variable
  ggplot(aes(education)) + 
  geom_bar() + 
    theme_minimal() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    xlab("Highest Level of Education") + 
    ylab("Number of Entries") + 
    ggtitle("Lumped Education Variable Breakdown")


And that does it for the fun and friendly forcats package. See, categorical data analysis can be fun! Next time I’ll be finishing up my Unpacking the Tidyverse series with the purrr package. It should be a good one!

Here’s some additional resources on Forcats:
- Forcats Github Repo
- Forcats Documentation
- Factor Chapter of R4DS


Until next time,

- Fisher



Comments