Efficient Visualization of Summary Statistics in ggplot2: Tracking Changes in Milk Protein Content

In this article, I’ll demonstrate how to rapidly create summarizing statistics in ggplot2. The demo graphic shows how the protein content in cows’ milk changes in the weeks following calving. The cattle are grouped according to whether they are fed a diet with barley alone, lupins alone, or with both barley and lupins. The dataset used is built in the nlme package (popular for Nonlinear and Linear Mixed Effects models).

Major techniques explained in this article include:

Use Google Fonts.
Rapid generation of visual statistic summaries with stat_summary().
Display graphic elements beyond the panel boundary.
Annotate a single faceted panel (subplot).

Stepwise instructions
Code only

Packages and data cleanup

library(ggplot2)library(dplyr)library(nlme) # load the Milk dataset
# set up the default themetheme_set(theme_classic(  base_size = 14, base_family = "Abril"))
Milk <- Milk %>% as_tibble() # convert to tibble format
head(Milk, 3)

Output:
# A tibble: 3 × 4
  protein  Time Cow   Diet
    <dbl> <dbl> <ord> <fct>
1    3.63     1 B01   barley
2    3.57     2 B01   barley
3    3.47     3 B01   barley

Visualization

Load font “Abril Fatface” from Google Font Repository, and set up global theme.

library(showtext)font_add_google(name = "Abril Fatface", family = "Abril")showtext.auto()

Create a line plot, with each line representing a cow. The plot is faceted based on the variable Diet. As the legend information is already displayed in the titles of the faceted panels (subplots), we remove the legend at this step. The fill aesthetic so far is not useful, but will be utilized for the ribbons added in the following steps.

p1 <- Milk %>%   ggplot(aes(x = Time, y = protein,              color = Diet,              fill = Diet)) + # for ribbon layers added later  geom_line(aes(group = Cow), alpha = .3) +  facet_wrap(~Diet) +  theme(legend.position = "none")p1

Create a ribbon representing one standard deviation (SD) above and below the mean. As more than one statistic is calculated, i.e., the mean, the upper SD limit, and the lower SD limit, we use fun.data argument (in contrast of fun for a single statistic, as used to create plot p4 at a later step). fun.args = list(...) specifies additional arguments of the mean_sdl function, and mult = 1 dictates the calculation of one-fold of SD above and below the mean (while the default calculates two-fold of SD around the mean). (in like manner, standard error of the mean can be calculated using the function mean_se)

p2 <- p1 +   stat_summary(fun.data = mean_sdl,                fun.args = list(mult = 1),               geom = "ribbon", alpha = .3,                color = NA) # not show ribbon outlinep2

Create a ribbon depicting 95% normal (Gaussian) confidence limits (based on the t-distribution). The syntax is similar to the calculation of standard deviation.

# 95% confidence intervalp3 <- p2 +  stat_summary(fun.data = mean_cl_normal,                fun.args = list(conf.int = .95), # confidence interval                geom = "ribbon", color = NA) p3

Create a central line representing the mean. As a single statistic, the mean, is calculated, we use the argument fun, instead of fun.data as used in the above two steps.

p4 <- p3 +   stat_summary(fun = mean, geom = "line",                linewidth = .7, color = "black") p4

Polish up. Two edits here are particularly important to prepare the plot for annotation in the right margin of the plot (at the later step of creating plot p6):

In coord_cartesian (2nd line), we use clip = F to display graphical elements that extend beyond the panel range (otherwise they would be clipped off by default).
Use plot.margin (last line) to add extra margin on the right side of the plot. (If there are texts in the space sandwiched between panels as in this example, it may be also necessary to adjust the margin between panels.)

p5 <- p4 +   # coordinate  coord_cartesian(    ylim = c(2.5, 4.5), expand = 0, clip = F) +  # x-axis breaks  scale_x_continuous(breaks = seq(0, 16, 4)) +  # axis titles  labs(x = "weeks", y = "content (%)",       title = "Protein content in milk after calving") +  # theme  theme(    strip.background = element_blank(),    strip.text = element_text(vjust = -2), # bring the facet title downward    plot.title = element_text(hjust = .5), # center-justify the plot title    legend.position = "none",        # add plot margin, in particular to the right side of the plot    # use t - r - b - l for top - right - bottom - left    plot.margin = margin(r = 50, l = 10, t = 10, b = 10))  
p5

Add text annotation in the right margin of the plot. Technically, the annotations are made in the third panel “lupins”. To make annotations to selective panels, a typical solution is to create a dataset (d) to specify the annotation position and content, including the faceting variable (Diet) to indicate which panel the annotations should be made in. (To learn more about making unique annotations to selective panels, check this faceted arrow plot, and this faceted dumbbell plot).

# annotation to the 3rd paneld <- tibble(  x = rep(19.5, 3),   y = c(3.5, 3.23, 3.1),  t = c("← 95%\nconf. int.", "← mean", "← SD"),    # specifies that all texts are added to the panel "lupins"  Diet = rep("lupins", 3)) 
d

Output:
# A tibble: 3 × 4
      x     y t                   Diet
  <dbl> <dbl> <chr>               <chr>
1  19.5  3.5  "← 95%\nconf. int." lupins
2  19.5  3.23 "← mean"            lupins
3  19.5  3.1  "← SD"              lupins

Add annotation using the dataset d.

p6 <- p5 +   geom_text(data = d,            aes(x = x, y = y, label = t),            hjust = 0, color = "black") p6

library(ggplot2)library(dplyr)library(nlme) # load the Milk dataset
# set up global default themetheme_set(theme_classic(  base_size = 14, base_family = "Abril"))
Milk <- Milk %>% as_tibble() # convert to tibble format
head(Milk, 3)

# Add Google font lobster, and set up global theme. library(showtext)font_add_google(name = "Abril Fatface", family = "Abril")showtext.auto()

# Create a line plot, with each line representing a cow. p1 <- Milk %>%   ggplot(aes(x = Time, y = protein,              color = Diet,              fill = Diet)) + # for ribbon layers added later  geom_line(aes(group = Cow), alpha = .3) +  facet_wrap(~Diet) +  theme(legend.position = "none")p1

# Create a ribbon representing one standard deviation (SD) above and below the mean. p2 <- p1 +   stat_summary(fun.data = mean_sdl,                fun.args = list(mult = 1),               geom = "ribbon", alpha = .3,                color = NA) # not show ribbon outlinep2

# Create a ribbon depicting 95% normal (Gaussian) confidence interval. p3 <- p2 +  stat_summary(fun.data = mean_cl_normal,                fun.args = list(conf.int = .95), # confidence interval                geom = "ribbon", color = NA) p3

# Create a central line representing the mean.p4 <- p3 +   stat_summary(fun = mean, geom = "line",                linewidth = .7, color = "black") p4

# Polish up; and prepare the plot (clip off, and increased margin) for the next step.p5 <- p4 +   # coordinate  coord_cartesian(    ylim = c(2.5, 4.5), expand = 0, clip = F) +  # x-axis breaks  scale_x_continuous(breaks = seq(0, 16, 4)) +  # axis titles  labs(x = "weeks", y = "content (%)",       title = "Protein content in milk after calving") +  # theme  theme(    strip.background = element_blank(),    strip.text = element_text(vjust = -2), # bring the facet title downward    plot.title = element_text(hjust = .5), # center-justify the plot title    legend.position = "none",        # add plot margin, in particular to the right side of the plot    # use t - r - b - l for top - right - bottom - left    plot.margin = margin(r = 50, l = 10, t = 10, b = 10))  
p5

# Add text annotation in the right margin of the plot. d <- tibble(  x = rep(19.5, 3),   y = c(3.5, 3.23, 3.1),  t = c("← 95%\nconf. int.", "← mean", "← SD"),    # specifies that all texts are added to the panel "lupins"  Diet = rep("lupins", 3)) 
p6 <- p5 +   geom_text(data = d,            aes(x = x, y = y, label = t),            hjust = 0, color = "black") p6

Continue Exploring — 🚀 one level up!

In a typical line plot, in addition to visual summaries of statistics, such as the central trend line and variability ribbons illustrate above, it is also common practice to highlight certain observations of interest, as demonstrated in the following annotated and highlighted line plot that shows the changing popularity of smoking in U.S., Germany, and France, as well as other countries over the past century.

While lines are often employed to trace the trend of changes, it serves many other purposes. In the following plot, each line corresponds to a distinct father-son pair, vividly illustrating the influence of a father’s occupation on his son’s career trajectory in the United States during the 1970s.

Check out this line plot with an underlying map to illustrate the global network of flights, as shown below. In addition, check this article to tweak this static network into a dreamy animation.

Packages and data cleanup

Visualization

Continue Exploring — 🚀 one level up!

Amazing eBook to learn ggplot2 FAST & EASY