Unpacking the Tidyverse - ggplot2


Introduction


This is the first of eight installments of my Unpacking the Tidyverse series. Each installment focuses on one of the eight core packages in Hadley Wickham’s tidyverse. Instructions given in each post are mainly derived from Hadley’s textbook, R for Data Science, and CRAN package documentation. This installment of Unpacking the Tidyverse focuses on the graphing package, ggplot2. The next installment focuses on the readr package and can be found here.

The ggplot2 package is a powerful data visualization tool based on the grammar of graphics: an idea that any plot can be described using seven distinct parameters. In this blog post, I run through these parameters attempting summarize them in an easy-to-access resource for some of the most common tasks in data visualization.


library('ggplot2')       
library('RColorBrewer')  
library('viridisLite')  


Template


ggplot( ***dataset*** ) +            
geom_***geometry***(              
  mapping = aes( ***variables*** ),    
  stat = ***statistical transformation***,          
  position = ***position adjustment***        
) + 
coord_***coordinate system****() +          
facet_***facet declaration***()              


Not all parameters must be defined for every ggplot call, but by defining the dataset, plot geometry, variables, statistical transformations, position adjustments, coordinate system and facets, you can create almost any plot with ggplot2. Of course, there are additional functions that affect the aesthetics of the plot, such as annotations, titles, axes, and legends. Let’s briefly cover all seven of these parameters, discussing the commands that I most commonly find useful for each of them.


Data


ggplot(mpg)


Just calling ggplot() with an associated dataset will create a canvas for your graphic, not an actual plot. This is because R doesn’t know what variables you want plotted from that dataset, or how you want them plotted. To create a basic plot, you’ll have to provide information on the plot’s geometry.


Geometries


ggplot(mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = cyl))


Plot geometries are the most important parameter, and the only parameter other than ggplot() that must be defined in every single visualization. There are hundreds of geometries available for your use, courtesy of the R community; check out http://www.ggplot2-exts.org/ for more information. Here’s a list of the handful of geometries, mappings, and aesthetics that I use most often.

Geometries
  • geom_density( ) - 1 variable Gaussian distribution
  • geom_point( ) - 2, 3, or 4 variable scatterplot
  • geom_smooth( ) - 2 or 3 variable line plot
  • geom_bar(stat=”identity”) - 2 or 3 variable bar plot
  • geom_hex( ) - 2 variable distribution
  • geom_tile( ) - 3 variable tile plot
  • geom_map( ) - 2, 3, or 4 variable geospatial
Mappings
  • aes( x = ) - x-axis location
  • aes( y = ) - y-axis location
  • aes( …, color = ) - color dependent on variable’s value
  • aes( …, linetype = ) - line type dependent on variable’s value
  • aes( …, size = ) - size of object dependent on variable’s value
Aestetics
  • size = size of a point, declared by an integer
  • linetype = 0 (blank), 1 (solid), 1 (dashed), 3 (dotted), 4 (dot-dash), 5 (long-dash), 6 (two-dash)
  • weight = (line thickness) integer
  • color = (outline) string
  • fill = (inside) string
  • alpha = (transparency) double from 0 to 1


Statistical Transformation


Statistical transformations are most useful with geom_bar. Each geometry has an associated transformation, and most of the time this transformation is useful and should not be changed. The default transformation associated with geom_bar is "count", or the total number of occurrences of each variable. Sometimes you want a bar chart to visualize something else, like the unique occurrences of each variable, or a different variable all together. These three transformations allow you control over bar charts to visualize your data in entirely new ways.

  • geom_bar(aes(…), stat = “count”) - visualize the number of entries in a variable, n()
  • geom_bar(aes(…), stat = “identity”) - visualize the variable, not the count
  • geom_bar(aes(…), stat = “unique”) - visualize only unique components of the variable


Position Adjustment


When there are multiple data points occupying the same space or are stacked too closely together, position adjustments allow for a clearer visualization. Most often used with bar charts and scatter plots, these are the three position adjustments I find to be the most useful.

  • geom_point(aes(…), position = “jitter”) - adds a small amount of noise to better view overlapping points
  • geom_bar(aes(…), position = “dodge”) - arranges bar elements side-by-side
  • geom_bar(aes(…), position = “fill”) - arranges bar elements on top of each other, normalizing height


Coordinate Systems


ggplot(mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  coord_flip()  


Cartesian coordinates are the go-to most visualizations, but the coord_ function does more than call different coordinate systems. Often, it’s useful to fix the x-y ratio, or flip the axis of bar charts. When you’re working with exponential data, you may want to transform an axis to a logarithmic system. These useful functions and many more can be done by the coord_ function.

  • coord_fixed(ratio, xlim, ylim) - fixing the aspect ratio between x and y
  • coord_flip(xlim, wlim) - flipping the axes
  • coord_polar(theta, start, direction) - converting Cartesian to polar coordinates
  • coord_trans(xtrans, ytrans, limx, limy) - transform Cartesian coordinates, use with log functions
  • coord_map(projection, orientation, xlim, ylim) - mapproj package projections


Facets


ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)


I don’t use facets that often, but if you’re a fan multiple miniature plots, you might find these faceting options useful.

  • facet_grid(x ~ y, labeller, scales) - facets the display into rows and columns based on the two given variables (“.”” placeholder)
  • facet_wrap(~ x, labeller, scales, nrow, ncol) - wraps facets into a rectangular layout


Labels


ggplot(mpg) + 
  labs(
    title = "title", 
    subtitle = "subtitle",
    caption = "caption",
    xlab = "x-label", 
    ylab = "y-label"
  )

Labels are helpful for properly communicating your visualization. Including these five labels with your plots can help clarify your topic, variables, and sources.


Annotations


Annotations can be helpful to point out specific properties of a visualization. I think annotating multiple individual observations is a mistake, it clutters the graphic and takes away from the effectiveness of the visualization. Packages like plotly and shiny do a better job of creating interactive graphics if you want individual labels. With that in mind, here are some annotations that are easy to use in ggplot2.

  • geom_text(aes(label = “text here”), vjust = , hjust = ) - places a summary annotation within the graphic, according to vjust and hjust
  • geom_hline(yintercept = , size = , color = , linetype) - creates a horizontal reference line through the plot according to yintercept
  • geom_vline(xintercept = , size =, color =, linetype =) - creates a vertical reference line through the plot according to xintercept
  • geom_rect(aes(xmin = , xmax = , ymin = , ymax = )) - creates a rectangle to box in and highlight interesting data
  • geom_segment(aes(x = , y = , xend = , yend = ,) arrow = ) - creates a line segment within the plot, it can be an arrow


Scales


Scales are automatically set when you create a ggplot, but sometimes it’s useful to alter the color scheme, legend, or axes. The naming scheme for these functions is scale_ followed by the name of the aesthetic then another _, and then the name of the scale.

Aestetic Names
  • _x_
  • _y_
  • _alpha_
  • _color_
  • _fill_
  • _linetype_
  • _shape_
  • _size_
Scale Names
  • _discrete - maps discrete variables to visual values
  • _continous - maps continuous variables to visual values
  • _identity - uses data values as visual values
  • _manual - values = c( ), maps discrete values to manually chose visual values
  • _gradient - creates a two-color gradient, low - high
  • _gradient2 - creates a diverging two color gradients, positive - negative
  • _gray -creates a gradient in gray
  • _brewer - brewer( palette = “ “), calls RColorBrewer palettes


Axes

Scales can also be used to alter axes in ggplot2, these are some commented examples for fixing common behavioral issues you may have with your axes.


scale_x_continuous(breaks = seq(0, 10, by = 1), labels = c(1:9, "ten")) 
# define breaks by sequence, define labels of breaks

scale_x_discrete(breaks=c("ctrl", "trt1", "trt2"),            # defined tick marks
                 labels=c("Control", "Treat 1", "Treat 2"))   # new names
                 
expand_limits(x = c(0,8), y = c(0,8))                         # expand the limits of the graph to visualize specific values

scale_y_reverse()     # Reverse y-axis direction (zero on top)

element_blank()       # Hiding axis labels 

theme(axis.title.x = element_text(face="bold", color="red", size=20),  # change axis labels, font and style
      axis.text.x = element_text(angle=90, vjust=0.5, size=16))        # change tick mark label, font, style

theme(panel.grid.major=element_blank(),  # hiding major gridlines
      panel.grid.minor=element_blank())  # hiding minor gridlines

library(scales)   # changing the y-axis to log 10
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x)))


Legends


Legends can be a bit harder to wrangle than axes. Here are examples to several common problems I have with legends.

theme(legend.position = "left")    # legend to the left of plot
theme(legend.position = "top")     # legend above plot
theme(legend.position = "bottom")  # legend below plot
theme(legend.position = "right")   # the default
theme(legend.position = "none")    # no legend will be generated

geom_point(aes(...), show.legend=FALSE) # don't include this geom in the legend    

theme(legend.title = element_blank()) # hides the legends title

# changing the order of items in your legend, scale_ function is dependent upon the type of plot
scale_fill_discrete(breaks=(c("item1", "item2", "item3")))

# simply reverse the current item display in the legend
scale_fill_discrete(guide = guide_legend(reverse=TRUE))

# manually alter color, break, label, and name of legend items.
scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"), 
                  name="Experimental\nCondition",
                  breaks=c("ctrl", "trt1", "trt2"),
                  labels=c("Control", "Treatment 1", "Treatment 2"))


Colors


Some of my favorite preset colors include

  • thistle
  • slateblue2
  • orchid
  • midnightblue
  • deepskyblue
  • lightslateblue

The RColorBrewer package also has some great color palettes.

RColorBrewer::display.brewer.all()


The viridisLite Package has contains 5 color scales that are color-blind and gray-scale friendly.


Themes


Sometimes you don’t want to go through the effort of creating your own theme, so ggplot2 has some built in. Here are four basic themes that are excellent for exploratory analysis or creating a solid foundation to build off.

  • theme_bw() - A while background with gridlines
  • theme_grey() - the default theme
  • theme_classic() - A white background with no gridlines
  • theme_minimal() - mystery theme, check it out!


Final Thoughts


This is just a small taste of what ggplot2 can do, but with this information you should be able to solve 80% of your graphing needs. Again, the references listed at the beginning of this write up are excellent and worth checking out. If you want to learn more about R for Data Science written by Hadley Wickham, it’s free to read at http://r4ds.had.co.nz/. If you don’t want to read through the entire book, check out the other posts in my R for Data Science Summary Series.

For more useful resources, including advanced ggplot2 techniques, check out these links:


Thanks for reading,
- Fisher

Comments