Create Scatterplot on Semi-log Scale in ggplot2 to Visualize GDP vs. Urbanization

In this article, we’ll create a scatterplot to display the linear relationship between the log10 of GDP per capita and the percent of urbanization across different countries.

Major techniques covered in this visualization include:

Point shape customization.
Override of aesthetic inheritance in the legend.
Axis with logarithmic scale and annotation.
Label texts with reduced overlap.

Stepwise instructions
Code only

Packages and data cleanup

library(ggplot2)library(dplyr)library(ggrepel) # text with minimized overlap
# install.packages("carData")library(carData) # dataset package
UN2 <- UN %>% filter(! is.na(region)) # %>% as_tibble()
# display first 3 rows in tibble formatas_tibble(UN2) %>% head(n = 3)

Output:
# A tibble: 3 × 7
  region group  fertility ppgdp lifeExpF pctUrban infantMortality
  <fct>  <fct>      <dbl> <dbl>    <dbl>    <dbl>           <dbl>
1 Asia   other       5.97  499      49.5       23           125.
2 Europe other       1.52 3677.     80.4       53            16.6
3 Africa africa      2.14 4473      75         67            21.5

Visualization

Create a simple scatterplot. We use point shape of 21 so that the point has a circular outline and a filled interior (check out more basics of points). The color of the outline is controlled by aesthetic color, and the color of the interior controlled by aesthetic fill. As the same color scale ("Set3") is applied to both color and fill aesthetics, the outline and interior merge to present a solid point (as desired). The purpose of creating the attributes of outline and interior is to enable their visualization in the legend keys at the next step.

p1 <- UN2 %>%   ggplot(aes(x = pctUrban, y = ppgdp, color = region, fill = region)) +   geom_point(shape = 21) +  # color scale  scale_color_brewer(palette = "Set3", name = "") + # remove legend title  scale_fill_brewer(palette = "Set3", name = "") 
p1

Enhance the legend. The legend keys inherit the aesthetic properties of the associated geom_* by default. In this example, the legend keys inherit the outline, fill, shape, and size, etc., of geom_point(), and are drawn as small solid points. To make the legends visually more prominent, here we override the aesthetic inheritance in the legend, and sketch out a black outline and increase the size of the legend keys.

p2 <- p1 + guides(color = guide_legend(  override.aes = list(color = "black", size = 5))) p2

Transform y-axis to logarithmic scale at base 10, and add log-based ticks. This unveils a roughly linear relationship between log(y) and x. Now the transformed y-axis progresses exponentially (how to read log-scale). (Check this diamonds’ scatterplot with both axes log-transformed at base 2)

p3 <- p2 + scale_y_log10(  # set breaks (of the main grids)  breaks = c(100, 500, 1000, 5000, 10^4, 5*10^4, 10^5),  labels = function(x) {paste(x/1000, "K")} ) +    # add log-10 scale ticks;   # note the scale tick space is not evenly distributed  annotation_logticks(sides = "l", colour = "white") 
p3

By default, a minor grid line is drawn in the middle of two major grid lines, informative in linear scale but not so useful in logarithmic scales. We’ll remove these minor grids (of the y axis) at the following step.

Customize the theme, and remove the useless minor grids on the y axis.

p4 <- p3 +    # add titles  labs(    y = "GDP per capita (US $)",     x = "Urbanization percent",    title = "  UN National Statistics, 2009–2011") +    # theme  theme_minimal(base_size = 15) +  theme(    # remove minor grids, which is not meaningful in log scale    panel.grid.minor = element_blank(),        panel.background = element_rect(fill = "black"),    panel.grid = element_line(color = "grey30"),    # use 'vjust' to sink plot title downward     plot.title = element_text(vjust = -6, color = "snow3")) 
p4

Draw a regression line, calculated based on transformed data, i.e., log(y) and x. Due to aesthetic inheritance of color = region and fill = region from the ggplot() line, a regression line would be created separately for each region. Here however we overwrite such inheritance by assigning fixed values to color and fill. This results in a single regression created based on the entire dataset, roughly equivalent to specifying aes(group = 1) in geom_smooth().

p5 <- p4 +    geom_smooth(method = "lm", color = "white",               fill = "beige", alpha = .4)p5

Label the name of countries using the ggrepel package. The main function geom_text_repel() works similarly as geom_point(), but adds text labels, instead of points. It automatically repel text labels from each other and from the dots to reduce overlap, and ensures that all texts are plotted within the plot boundary.

# label with country name, with minimized overlapp6 <- p5 +   geom_text_repel(aes(label = rownames(UN2)),                  box.padding = unit(0, "pt"),                  max.overlaps = Inf,                  size = 2.3, show.legend = F) p6

library(ggplot2)library(dplyr)library(ggrepel) # text with minimized overlap
# install.packages("carData")library(carData) # dataset package
UN2 <- UN %>% filter(! is.na(region)) # %>% as_tibble()
# display first 3 rows in tibble formatas_tibble(UN2) %>% head(n = 3)

# Create a scatter plotp1 <- UN2 %>%   ggplot(aes(x = pctUrban, y = ppgdp, color = region, fill = region)) +   geom_point(shape = 21) +  # color scale  scale_color_brewer(palette = "Set3",                      name = "") +   # remove legend title  scale_fill_brewer(palette = "Set3", name = "") 
p1

# Update the legend keys: # Sketch out a black outline, and increase their sizep2 <- p1 + guides(color = guide_legend(  override.aes = list(color = "black", size = 5))) p2

# Transform y-axis to logarithmic scale, and add log-based ticks. p3 <- p2 + scale_y_log10(  # set breaks (of the main grids)  breaks = c(100, 500, 1000, 5000, 10^4, 5*10^4, 10^5),  labels = function(x) {paste(x/1000, "K")} ) +    # add log-10 scale ticks;   # note the scale tick space is not evenly distributed  annotation_logticks(sides = "l", colour = "white") p3

# Add plot titles, and customize the theme. # And remove the useless minor grids on the y axis.p4 <- p3 +    # add titles  labs(    y = "GDP per capita (US $)",     x = "Urbanization percent",    title = "  UN National Statistics, 2009–2011") +    # theme  theme_minimal(base_size = 15) +  theme(    # remove minor grids, which are not meaningful in log scale    panel.grid.minor = element_blank(),        panel.background = element_rect(fill = "black"),    panel.grid = element_line(color = "grey30"),    # use 'vjust' to sink plot title downward     plot.title = element_text(vjust = -6, color = "snow3")) 
p4

# Add regression line. The regression is calculated based on log(y) and x. p5 <- p4 +    geom_smooth(method = "lm", color = "white",               fill = "beige", alpha = .4)p5

# Label the name of the countries using the 'ggrepel' package with reduced overlap. p6 <- p5 +   geom_text_repel(aes(label = rownames(UN2)),                  box.padding = unit(0, "pt"),                  max.overlaps = Inf,                  size = 2.3, show.legend = F) p6

Continue Exploring — 🚀 one level up!

For data with high skewness, mathematical transformations are a powerful tool to aid in visualizing the underlying data structure, as shown above. Such transformations are also commonly leveraged in the color scale. Check out the following awesome heatmap on African population density, with critical pseudo-logarithmic transformation in the color scale to unveil highly skewed data pattern.

A scatterplot is often enhanced by visualizing the marginal (univariate) distribution of the x and y variables, and the bivariate distribution pattern with confidence ellipses. Check out the following scatterplot with marginal and ellipses visualization.

Furthermore, check here to learn how to encircle and highlight selected points.

Packages and data cleanup

Visualization

Continue Exploring — 🚀 one level up!

Amazing eBook to learn ggplot2 FAST & EASY