Create Scatterplot on Semi-log Scale in ggplot2 to Visualize GDP vs. Urbanization
In this article, we’ll create a scatterplot to display the linear relationship between the log10 of GDP per capita and the percent of urbanization across different countries.
Major techniques covered in this visualization include:
library(ggplot2)library(dplyr)library(ggrepel) # text with minimized overlap # install.packages("carData")library(carData) # dataset package UN2 <- UN %>%filter(!is.na(region)) # %>% as_tibble() # display first 3 rows in tibble formatas_tibble(UN2) %>%head(n =3)
Output:
# A tibble: 3 × 7 region group fertility ppgdp lifeExpF pctUrban infantMortality <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Asia other 5.97 499 49.5 23 125. 2 Europe other 1.52 3677. 80.4 53 16.6 3 Africa africa 2.14 4473 75 67 21.5
Visualization
Create a simple scatterplot. We use point shape of 21 so that the point has a circular outline and a filled interior (check out more basics of points). The color of the outline is controlled by aesthetic color, and the color of the interior controlled by aesthetic fill. As the same color scale ("Set3") is applied to both color and fill aesthetics, the outline and interior merge to present a solid point (as desired). The purpose of creating the attributes of outline and interior is to enable their visualization in the legend keys at the next step.
p1 <- UN2 %>%ggplot(aes(x = pctUrban, y = ppgdp, color = region, fill = region)) +geom_point(shape =21) +# color scalescale_color_brewer(palette ="Set3", name ="") +# remove legend titlescale_fill_brewer(palette ="Set3", name ="") p1
Enhance the legend. The legend keys inherit the aesthetic properties of the associated geom_* by default. In this example, the legend keys inherit the outline, fill, shape, and size, etc., of geom_point(), and are drawn as small solid points. To make the legends visually more prominent, here we override the aesthetic inheritance in the legend, and sketch out a black outline and increase the size of the legend keys.
Transform y-axis to logarithmic scale at base 10, and add log-based ticks. This unveils a roughly linear relationship between log(y) and x. Now the transformed y-axis progresses exponentially (how to read log-scale). (Check this diamonds’ scatterplot with both axes log-transformed at base 2)
p3 <- p2 +scale_y_log10(# set breaks (of the main grids)breaks =c(100, 500, 1000, 5000, 10^4, 5*10^4, 10^5),labels =function(x) {paste(x/1000, "K")} ) +# add log-10 scale ticks; # note the scale tick space is not evenly distributedannotation_logticks(sides ="l", colour ="white") p3
By default, a minor grid line is drawn in the middle of two major grid lines, informative in linear scale but not so useful in logarithmic scales. We’ll remove these minor grids (of the y axis) at the following step.
Customize the theme, and remove the useless minor grids on the y axis.
p4 <- p3 +# add titleslabs(y ="GDP per capita (US $)", x ="Urbanization percent",title =" UN National Statistics, 2009–2011") +# themetheme_minimal(base_size =15) +theme(# remove minor grids, which is not meaningful in log scalepanel.grid.minor =element_blank(),panel.background =element_rect(fill ="black"),panel.grid =element_line(color ="grey30"),# use 'vjust' to sink plot title downward plot.title =element_text(vjust =-6, color ="snow3")) p4
Draw a regression line, calculated based on transformed data, i.e., log(y) and x. Due to aesthetic inheritance of color = region and fill = region from the ggplot() line, a regression line would be created separately for each region. Here however we overwrite such inheritance by assigning fixed values to color and fill. This results in a single regression created based on the entire dataset, roughly equivalent to specifying aes(group = 1) in geom_smooth().
p5 <- p4 +geom_smooth(method ="lm", color ="white", fill ="beige", alpha = .4)p5
Label the name of countries using the ggrepel package. The main function geom_text_repel() works similarly as geom_point(), but adds text labels, instead of points. It automatically repel text labels from each other and from the dots to reduce overlap, and ensures that all texts are plotted within the plot boundary.
# label with country name, with minimized overlapp6 <- p5 +geom_text_repel(aes(label =rownames(UN2)),box.padding =unit(0, "pt"),max.overlaps =Inf,size =2.3, show.legend = F) p6
library(ggplot2)library(dplyr)library(ggrepel) # text with minimized overlap # install.packages("carData")library(carData) # dataset package UN2 <- UN %>%filter(!is.na(region)) # %>% as_tibble() # display first 3 rows in tibble formatas_tibble(UN2) %>%head(n =3) # Create a scatter plotp1 <- UN2 %>%ggplot(aes(x = pctUrban, y = ppgdp, color = region, fill = region)) +geom_point(shape =21) +# color scalescale_color_brewer(palette ="Set3", name ="") +# remove legend titlescale_fill_brewer(palette ="Set3", name ="") p1 # Update the legend keys: # Sketch out a black outline, and increase their sizep2 <- p1 +guides(color =guide_legend(override.aes =list(color ="black", size =5))) p2 # Transform y-axis to logarithmic scale, and add log-based ticks. p3 <- p2 +scale_y_log10(# set breaks (of the main grids)breaks =c(100, 500, 1000, 5000, 10^4, 5*10^4, 10^5),labels =function(x) {paste(x/1000, "K")} ) +# add log-10 scale ticks; # note the scale tick space is not evenly distributedannotation_logticks(sides ="l", colour ="white") p3 # Add plot titles, and customize the theme. # And remove the useless minor grids on the y axis.p4 <- p3 +# add titleslabs(y ="GDP per capita (US $)", x ="Urbanization percent",title =" UN National Statistics, 2009–2011") +# themetheme_minimal(base_size =15) +theme(# remove minor grids, which are not meaningful in log scalepanel.grid.minor =element_blank(),panel.background =element_rect(fill ="black"),panel.grid =element_line(color ="grey30"),# use 'vjust' to sink plot title downward plot.title =element_text(vjust =-6, color ="snow3")) p4 # Add regression line. The regression is calculated based on log(y) and x. p5 <- p4 +geom_smooth(method ="lm", color ="white", fill ="beige", alpha = .4)p5 # Label the name of the countries using the 'ggrepel' package with reduced overlap. p6 <- p5 +geom_text_repel(aes(label =rownames(UN2)),box.padding =unit(0, "pt"),max.overlaps =Inf,size =2.3, show.legend = F) p6
Continue Exploring — 🚀 one level up!
For data with high skewness, mathematical transformations are a powerful tool to aid in visualizing the underlying data structure, as shown above. Such transformations are also commonly leveraged in the color scale. Check out the following awesome heatmap on African population density, with critical pseudo-logarithmic transformation in the color scale to unveil highly skewed data pattern.
A scatterplot is often enhanced by visualizing the marginal (univariate) distribution of the x and y variables, and the bivariate distribution pattern with confidence ellipses. Check out the following scatterplot with marginal and ellipses visualization.