Efficient Visualization of Summary Statistics in ggplot2: Tracking Changes in Milk Protein Content
In this article, I’ll demonstrate how to rapidly create summarizing statistics in ggplot2. The demo graphic shows how the protein content in cows’ milk changes in the weeks following calving. The cattle are grouped according to whether they are fed a diet with barley alone, lupins alone, or with both barley and lupins. The dataset used is built in the nlme package (popular for Nonlinear and Linear Mixed Effects models).
Major techniques explained in this article include:
library(ggplot2)library(dplyr)library(nlme) # load the Milk dataset # set up the default themetheme_set(theme_classic(base_size =14, base_family ="Abril")) Milk <- Milk %>%as_tibble() # convert to tibble format head(Milk, 3)
Output:
# A tibble: 3 × 4 protein Time Cow Diet <dbl> <dbl> <ord> <fct> 1 3.63 1 B01 barley 2 3.57 2 B01 barley 3 3.47 3 B01 barley
library(showtext)font_add_google(name ="Abril Fatface", family ="Abril")showtext.auto()
Create a line plot, with each line representing a cow. The plot is faceted based on the variable Diet. As the legend information is already displayed in the titles of the faceted panels (subplots), we remove the legend at this step. The fill aesthetic so far is not useful, but will be utilized for the ribbons added in the following steps.
p1 <- Milk %>%ggplot(aes(x = Time, y = protein, color = Diet, fill = Diet)) +# for ribbon layers added latergeom_line(aes(group = Cow), alpha = .3) +facet_wrap(~Diet) +theme(legend.position ="none")p1
Create a ribbon representing one standard deviation (SD) above and below the mean. As more than one statistic is calculated, i.e., the mean, the upper SD limit, and the lower SD limit, we use fun.data argument (in contrast of fun for a single statistic, as used to create plot p4 at a later step). fun.args = list(...) specifies additional arguments of the mean_sdl function, and mult = 1 dictates the calculation of one-fold of SD above and below the mean (while the default calculates two-fold of SD around the mean). (in like manner, standard error of the mean can be calculated using the function mean_se)
p2 <- p1 +stat_summary(fun.data = mean_sdl, fun.args =list(mult =1),geom ="ribbon", alpha = .3, color =NA) # not show ribbon outlinep2
Create a ribbon depicting 95% normal (Gaussian) confidence limits (based on the t-distribution). The syntax is similar to the calculation of standard deviation.
Create a central line representing the mean. As a single statistic, the mean, is calculated, we use the argument fun, instead of fun.data as used in the above two steps.
Polish up. Two edits here are particularly important to prepare the plot for annotation in the right margin of the plot (at the later step of creating plot p6):
In coord_cartesian (2nd line), we use clip = F to display graphical elements that extend beyond the panel range (otherwise they would be clipped off by default).
Use plot.margin (last line) to add extra margin on the right side of the plot. (If there are texts in the space sandwiched between panels as in this example, it may be also necessary to adjust the margin between panels.)
p5 <- p4 +# coordinatecoord_cartesian(ylim =c(2.5, 4.5), expand =0, clip = F) +# x-axis breaksscale_x_continuous(breaks =seq(0, 16, 4)) +# axis titleslabs(x ="weeks", y ="content (%)",title ="Protein content in milk after calving") +# themetheme(strip.background =element_blank(),strip.text =element_text(vjust =-2), # bring the facet title downwardplot.title =element_text(hjust = .5), # center-justify the plot titlelegend.position ="none",# add plot margin, in particular to the right side of the plot# use t - r - b - l for top - right - bottom - leftplot.margin =margin(r =50, l =10, t =10, b =10)) p5
Add text annotation in the right margin of the plot. Technically, the annotations are made in the third panel “lupins”. To make annotations to selective panels, a typical solution is to create a dataset (d) to specify the annotation position and content, including the faceting variable (Diet) to indicate which panel the annotations should be made in. (To learn more about making unique annotations to selective panels, check this faceted arrow plot, and this faceted dumbbell plot).
# annotation to the 3rd paneld <-tibble(x =rep(19.5, 3), y =c(3.5, 3.23, 3.1),t =c("← 95%\nconf. int.", "← mean", "← SD"),# specifies that all texts are added to the panel "lupins"Diet =rep("lupins", 3)) d
Output:
# A tibble: 3 × 4 x y t Diet <dbl> <dbl> <chr> <chr> 1 19.5 3.5 "← 95%\nconf. int." lupins 2 19.5 3.23 "← mean" lupins 3 19.5 3.1 "← SD" lupins
Add annotation using the dataset d.
p6 <- p5 +geom_text(data = d,aes(x = x, y = y, label = t),hjust =0, color ="black") p6
library(ggplot2)library(dplyr)library(nlme) # load the Milk dataset # set up global default themetheme_set(theme_classic(base_size =14, base_family ="Abril")) Milk <- Milk %>%as_tibble() # convert to tibble format head(Milk, 3) # Add Google font lobster, and set up global theme. library(showtext)font_add_google(name ="Abril Fatface", family ="Abril")showtext.auto() # Create a line plot, with each line representing a cow. p1 <- Milk %>%ggplot(aes(x = Time, y = protein, color = Diet, fill = Diet)) +# for ribbon layers added latergeom_line(aes(group = Cow), alpha = .3) +facet_wrap(~Diet) +theme(legend.position ="none")p1 # Create a ribbon representing one standard deviation (SD) above and below the mean. p2 <- p1 +stat_summary(fun.data = mean_sdl, fun.args =list(mult =1),geom ="ribbon", alpha = .3, color =NA) # not show ribbon outlinep2 # Create a ribbon depicting 95% normal (Gaussian) confidence interval. p3 <- p2 +stat_summary(fun.data = mean_cl_normal, fun.args =list(conf.int = .95), # confidence interval geom ="ribbon", color =NA) p3 # Create a central line representing the mean.p4 <- p3 +stat_summary(fun = mean, geom ="line", linewidth = .7, color ="black") p4 # Polish up; and prepare the plot (clip off, and increased margin) for the next step.p5 <- p4 +# coordinatecoord_cartesian(ylim =c(2.5, 4.5), expand =0, clip = F) +# x-axis breaksscale_x_continuous(breaks =seq(0, 16, 4)) +# axis titleslabs(x ="weeks", y ="content (%)",title ="Protein content in milk after calving") +# themetheme(strip.background =element_blank(),strip.text =element_text(vjust =-2), # bring the facet title downwardplot.title =element_text(hjust = .5), # center-justify the plot titlelegend.position ="none",# add plot margin, in particular to the right side of the plot# use t - r - b - l for top - right - bottom - leftplot.margin =margin(r =50, l =10, t =10, b =10)) p5 # Add text annotation in the right margin of the plot. d <-tibble(x =rep(19.5, 3), y =c(3.5, 3.23, 3.1),t =c("← 95%\nconf. int.", "← mean", "← SD"),# specifies that all texts are added to the panel "lupins"Diet =rep("lupins", 3)) p6 <- p5 +geom_text(data = d,aes(x = x, y = y, label = t),hjust =0, color ="black") p6
Continue Exploring — 🚀 one level up!
In a typical line plot, in addition to visual summaries of statistics, such as the central trend line and variability ribbons illustrate above, it is also common practice to highlight certain observations of interest, as demonstrated in the following annotated and highlighted line plot that shows the changing popularity of smoking in U.S., Germany, and France, as well as other countries over the past century.
While lines are often employed to trace the trend of changes, it serves many other purposes. In the following plot, each line corresponds to a distinct father-son pair, vividly illustrating the influence of a father’s occupation on his son’s career trajectory in the United States during the 1970s.