Annotate Scatterplots with Confidence Ellipse and Marginal Distribution in ggplot2

In this article, I’ll illustrate how to create a scatterplot with different types of annotations. Major techniques explained in this article include:

Create confidence ellipses.
Create text annotations with geom_text.
Customize the theme.
Visualize the marginal distribution in two distinct ways.

Stepwise instructions
Code only

Here we’ll use the popular penguins dataset from the palmerpenguins package.

library(ggplot2)library(dplyr)# install.packages("palmerpenguins")library(palmerpenguins) # data package
# set as default themetheme_set(theme_minimal(base_size = 16))
head(penguins, 4)

Output:
# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>

Create a scatterplot, showing the relationship between the length and depth of a penguin’s bill.

p1 <- penguins %>%   ggplot(aes(x = bill_length_mm, y = bill_depth_mm,              color = species)) +     # add small amount of random noise to points position to reduce overlap  geom_point(position = position_jitter(    width = .2, height = 0.2, seed = 1)) +    # color scale for categorical variable  scale_color_brewer(palette = "Set2") p1

Draw confidence ellipse that encircles 95% of data points.

p2 <- p1 +   stat_ellipse(level = .95, show.legend = F, linewidth = .7) p2

Annotate the scatterplot with the species’ names in place of the legend. We’ll remove the legend in the next step.

species <- tibble(  x = c(34, 56.5, 56),  y = c(20, 16.5, 19),  species = c("Adelie", "Gentoo",  "Chinstrap"))
p3 <- p2 + geom_text(  data = species,   aes(x = x, y = y, label = species, color = species),  # not inherit aesthetic mapping from the ggplot line  inherit.aes = F,   fontface = "bold.italic", size = 6)p3

Polish up more details as commented below.

p4 <- p3 +  # rename axes' titles  labs(x = "bill length (mm)", y = "bill depth (mm)") +   # adjust axis breaks   scale_x_continuous(breaks = seq(from = 20, to = 60, by = 4)) +   scale_y_continuous(breaks = seq(from = 10, to = 25, by = 2)) +     theme(    # remove legend    legend.position = "none",    # remove the minor grids, and adjust the width of the major grids    panel.grid.minor = element_blank(),    panel.grid.major = element_line(linewidth = .3),    # bold axis titles    axis.title = element_text(face = "bold"),    # increase the margin on the right side of y-axis title, and on top of x-axis title    axis.title.y = element_text(margin = margin(r = 10)),    axis.title.x = element_text(margin = margin(t = 10))) +     # center justify the legend title  guides(color = guide_legend(title.hjust = .5)) 
p4

Visualize the marginal distribution. geom_rug() creates barcode like images at the edge of the plot, and intuitively depict the data distribution.

p5 <- p4 +  geom_rug(    # apply the same "jitter" position to align with points    position = position_jitter(width = .2, height = 0.2, seed = 1),     sides = "tr", # t, top; r, right; b, bottom; l, left           length = unit(10, "pt"), # shorter bar line length     linewidth = .1) # smaller width to reduce overlap
p5

Further enhance the marginal distribution visualization with density plots using the ggExtra package. This package also provides an interactive dashboard approach to customize the plots. The dashboard can be generated by calling ggMarginalGadget(p5).

# install.packages("ggExtra")library(ggExtra)ggMarginal(p5, groupFill = T, size = 10, linewidth = 0, alpha = .4)

library(ggplot2)library(dplyr)# install.packages("palmerpenguins")library(palmerpenguins) # data package
# set as default themetheme_set(theme_minimal(base_size = 16))

# Create a basic scatterplotp1 <- penguins %>%   ggplot(aes(x = bill_length_mm, y = bill_depth_mm,              color = species)) +     # add a small amount of random noise to points position to reduce overlap  geom_point(position = position_jitter(    width = .2, height = 0.2, seed = 1)) +    # color scale for categorical variable  scale_color_brewer(palette = "Set2") p1

# Add confidence ellipse that encircles 95% of data points. p2 <- p1 +   stat_ellipse(level = .95, show.legend = F, linewidth = .7) p2

# Directly annotate the scatterplot with species names in replace of the legend.species <- tibble(  x = c(34, 56.5, 56),  y = c(20, 16.5, 19),  species = c("Adelie", "Gentoo",  "Chinstrap"))
p3 <- p2 + geom_text(  data = species,   aes(x = x, y = y, label = species, color = species),  # not inherit aesthetic mapping from the ggplot line  inherit.aes = F,   fontface = "bold.italic", size = 6)p3

# Polish up more detailsp4 <- p3 +  # rename axis titles  labs(x = "bill length (mm)", y = "bill depth (mm)") +   # adjust axis breaks   scale_x_continuous(breaks = seq(from = 20, to = 60, by = 4)) +   scale_y_continuous(breaks = seq(from = 10, to = 25, by = 2)) +     theme(    # remove legend    legend.position = "none",    # remove the minor grids, and adjust the width of the major grids    panel.grid.minor = element_blank(),    panel.grid.major = element_line(linewidth = .3),    # bold axis titles    axis.title = element_text(face = "bold"),    # increase the margin on the right side of y-axis title, and on top of x-axis title    axis.title.y = element_text(margin = margin(r = 10)),    axis.title.x = element_text(margin = margin(t = 10))) +     # center justify the legend title  guides(color = guide_legend(title.hjust = .5)) 
p4

# Visualize the marginal distribution. p5 <- p4 +  geom_rug(    # apply the same "jitter" position to align with points    position = position_jitter(width = .2, height = 0.2, seed = 1),     sides = "tr", # t, top; r, right; b, bottom; l, left           length = unit(10, "pt"), # shorter bar line length     linewidth = .1) # smaller width to reduce overlap
p5
# Enhance the marginal distribution visualization with density plots# install.packages("ggExtra")library(ggExtra)ggMarginal(p5, groupFill = T, size = 10, linewidth = 0, alpha = .4)

Continue Exploring — 🚀 one level up!

A scatterplot is often drawn on a semi-logarithmic or double-log scale when there is significant data skewness in one or two axes. Check out the following scatterplot on semi-logarithmic scale that unveils linear relationship between the percentage of urbanization and log(GDP per capita).

In addition to the scatterplot, line plots are another most used tool in data visualization. Check out the following annotated line plot that shows the evolving popularity of smoking worldwide, especially in the highlighted countries of the United States, Germany, and France.

Continue Exploring — 🚀 one level up!

Amazing eBook to learn ggplot2 FAST & EASY