Annotate Scatterplots with Confidence Ellipse and Marginal Distribution in ggplot2

In this article, I’ll illustrate how to create a scatterplot with different types of annotations. Major techniques explained in this article include:


Here we’ll use the popular penguins dataset from the palmerpenguins package.

library(ggplot2)library(dplyr)# install.packages("palmerpenguins")library(palmerpenguins) # data package
# set as default themetheme_set(theme_minimal(base_size = 16))
head(penguins, 4)

Output:

# A tibble: 4 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
# ℹ 2 more variables: sex <fct>, year <int>

Create a scatterplot, showing the relationship between the length and depth of a penguin’s bill.

p1 <- penguins %>%   ggplot(aes(x = bill_length_mm, y = bill_depth_mm,              color = species)) +     # add small amount of random noise to points position to reduce overlap  geom_point(position = position_jitter(    width = .2, height = 0.2, seed = 1)) +    # color scale for categorical variable  scale_color_brewer(palette = "Set2") p1

Draw confidence ellipse that encircles 95% of data points.

p2 <- p1 +   stat_ellipse(level = .95, show.legend = F, linewidth = .7) p2

Annotate the scatterplot with the species’ names in place of the legend. We’ll remove the legend in the next step.

species <- tibble(  x = c(34, 56.5, 56),  y = c(20, 16.5, 19),  species = c("Adelie", "Gentoo",  "Chinstrap"))
p3 <- p2 + geom_text( data = species, aes(x = x, y = y, label = species, color = species), # not inherit aesthetic mapping from the ggplot line inherit.aes = F, fontface = "bold.italic", size = 6)p3

Polish up more details as commented below.

p4 <- p3 +  # rename axes' titles  labs(x = "bill length (mm)", y = "bill depth (mm)") +   # adjust axis breaks   scale_x_continuous(breaks = seq(from = 20, to = 60, by = 4)) +   scale_y_continuous(breaks = seq(from = 10, to = 25, by = 2)) +     theme(    # remove legend    legend.position = "none",    # remove the minor grids, and adjust the width of the major grids    panel.grid.minor = element_blank(),    panel.grid.major = element_line(linewidth = .3),    # bold axis titles    axis.title = element_text(face = "bold"),    # increase the margin on the right side of y-axis title, and on top of x-axis title    axis.title.y = element_text(margin = margin(r = 10)),    axis.title.x = element_text(margin = margin(t = 10))) +     # center justify the legend title  guides(color = guide_legend(title.hjust = .5)) 
p4

Visualize the marginal distribution. geom_rug() creates barcode like images at the edge of the plot, and intuitively depict the data distribution.

p5 <- p4 +  geom_rug(    # apply the same "jitter" position to align with points    position = position_jitter(width = .2, height = 0.2, seed = 1),     sides = "tr", # t, top; r, right; b, bottom; l, left           length = unit(10, "pt"), # shorter bar line length     linewidth = .1) # smaller width to reduce overlap
p5

Further enhance the marginal distribution visualization with density plots using the ggExtra package. This package also provides an interactive dashboard approach to customize the plots. The dashboard can be generated by calling ggMarginalGadget(p5).

# install.packages("ggExtra")library(ggExtra)ggMarginal(p5, groupFill = T, size = 10, linewidth = 0, alpha = .4) 
library(ggplot2)library(dplyr)# install.packages("palmerpenguins")library(palmerpenguins) # data package
# set as default themetheme_set(theme_minimal(base_size = 16))

# Create a basic scatterplotp1 <- penguins %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + # add a small amount of random noise to points position to reduce overlap geom_point(position = position_jitter( width = .2, height = 0.2, seed = 1)) + # color scale for categorical variable scale_color_brewer(palette = "Set2") p1

# Add confidence ellipse that encircles 95% of data points. p2 <- p1 + stat_ellipse(level = .95, show.legend = F, linewidth = .7) p2

# Directly annotate the scatterplot with species names in replace of the legend.species <- tibble( x = c(34, 56.5, 56), y = c(20, 16.5, 19), species = c("Adelie", "Gentoo", "Chinstrap"))
p3 <- p2 + geom_text( data = species, aes(x = x, y = y, label = species, color = species), # not inherit aesthetic mapping from the ggplot line inherit.aes = F, fontface = "bold.italic", size = 6)p3

# Polish up more detailsp4 <- p3 + # rename axis titles labs(x = "bill length (mm)", y = "bill depth (mm)") + # adjust axis breaks scale_x_continuous(breaks = seq(from = 20, to = 60, by = 4)) + scale_y_continuous(breaks = seq(from = 10, to = 25, by = 2)) + theme( # remove legend legend.position = "none", # remove the minor grids, and adjust the width of the major grids panel.grid.minor = element_blank(), panel.grid.major = element_line(linewidth = .3), # bold axis titles axis.title = element_text(face = "bold"), # increase the margin on the right side of y-axis title, and on top of x-axis title axis.title.y = element_text(margin = margin(r = 10)), axis.title.x = element_text(margin = margin(t = 10))) + # center justify the legend title guides(color = guide_legend(title.hjust = .5))
p4

# Visualize the marginal distribution. p5 <- p4 + geom_rug( # apply the same "jitter" position to align with points position = position_jitter(width = .2, height = 0.2, seed = 1), sides = "tr", # t, top; r, right; b, bottom; l, left length = unit(10, "pt"), # shorter bar line length linewidth = .1) # smaller width to reduce overlap
p5
# Enhance the marginal distribution visualization with density plots# install.packages("ggExtra")library(ggExtra)ggMarginal(p5, groupFill = T, size = 10, linewidth = 0, alpha = .4)




Continue Exploring — 🚀 one level up!


A scatterplot is often drawn on a semi-logarithmic or double-log scale when there is significant data skewness in one or two axes. Check out the following scatterplot on semi-logarithmic scale that unveils linear relationship between the percentage of urbanization and log(GDP per capita).



In addition to the scatterplot, line plots are another most used tool in data visualization. Check out the following annotated line plot that shows the evolving popularity of smoking worldwide, especially in the highlighted countries of the United States, Germany, and France.