Create Elegant Population Pyramid in ggplot2

In this article, we’ll create a population pyramid displaying the German population structure in 2023. Major techniques covered in this article include:

  • Modify the dataset structure to fit the graphic designs.
  • Draw a population pyramid.
  • Create text and rectangle annotation.

Packages and data cleanup

library(ggplot2)library(dplyr)library(tidyr)library(stringr) # for string manipulation
theme_set(theme_classic())

The raw data is sourced from the Federal Statistical Office of Germany. The dataset (a cleaned version) used in this visualization can be downloaded here.

d <- read.csv(  "/Users/boyuan/Desktop/R/gallery/DATASETS/germany_population_2023.csv",   # remove the "X" letters otherwise added to headers started with numbers  check.names = F)
d <- d %>% as_tibble() # convert to tibble format
head(d, n = 3)

Output:

# A tibble: 3 × 3
age f m
<chr> <int> <int>
1 0_1 369 389
2 1_2 368 388
3 2_3 394 416

Transform the dataset into the tidy format: gender attributes (female f and male m) as one column, and the associated population as another column.

d.tidy <- d %>%   pivot_longer(-age, names_to = "gender", values_to = "population")head(d.tidy, 4)

Output:

# A tibble: 4 × 3
age gender population
<chr> <chr> <int>
1 0_1 f 369
2 0_1 m 389
3 1_2 f 368
4 1_2 m 388

Modify the population variable to create bars in a symmetrical layout:

  • The bars corresponding to male occupies the positive y-axis range, and bars of the female takes the negative y-axis range.

  • Increase the bar length by extra 80 units for each gender to offset the overlay of the central banner, which covers the bars of each gender by 80 units.

d.tidy <- d.tidy %>% as_tibble() %>%   mutate(population = ifelse(    gender == "f", - population - 80, population + 80 ))

Arrange the population bars by the age. While the dataset d.tidy is already arranged in order of age, the age variable is of a “character” type, and does not “memorize” such order when rendered in graphics. In order to retain such order, we need to convert the age variable to a factor, with levels explicitly indicated as in the current row order of d.tidy. Use unique to extract distinct ages, as the levels argument requires no duplication in the input values. (Learn all essential skills in reordering graphic elements in this complete tutorial.)

d.tidy$age <-   factor(d.tidy$age, levels = unique(d.tidy$age))

Highlight the population difference between genders. Here we create a new column pop.min to record the minimum population value of the two genders in each age group. The minimum values will be visualized as a masking layer in lighter color, and thus highlight the excess population in one of the genders.

d.tidy <- d.tidy %>%   group_by(age) %>%  mutate(pop.min = min(abs(population)),          # female in negative values, male in positive values         pop.min = ifelse(gender == "m", pop.min, -pop.min))
head(d.tidy, n = 4) # ready for visualization

Output:

# A tibble: 4 × 4
# Groups: age [2]
age gender population pop.min
<fct> <chr> <dbl> <dbl>
1 0_1 f -449 -449
2 0_1 m 469 449
3 1_2 f -448 -448
4 1_2 m 468 448

Visualization

Create bar plots with flipped axes. As the population value of males are positive, and females are negative, this creates bars in a symmetrical layout as mirror images (thanks to this step of data processing).

p1 <- d.tidy %>%   ggplot(aes(x = age, y = population, fill = gender)) +  geom_col(width = 1) +  coord_flip(clip = "off") p1

Highlight population difference between genders. The regions more saturated in color signify the population excess in one gender compared to the other. This effect is achieved by generating a masking layer of bars with semi-transparent white color, with bar length equal to the minimum population of the genders at each age (thanks to this step of data processing).

p2 <- p1 +   geom_col(aes(y = pop.min),            fill = "white", alpha = .5, width = 1,           position = "identity")  p2

Create a central banner overlay on the bars. This banner will contain the age labels, in place of the original x-axis (left vertical). Centered at y = 0, the width of the banner spans a total of 160 units, covering 80 units of bar length of each gender. To offset the coverage of the central banner to the underlying bars, we had included an additional 80 units to the population values at this early step of data processing.

p3 <- p2 +   annotate(    geom = "rect", # create rectangles    xmin = -Inf, xmax = Inf,     ymin = 80, ymax = -80,    fill = "floralwhite")p3

Create age labels in the central banner. We’ll remove the x-axis labels later.

p4 <- p3 +  annotate(    geom = "text",    x = seq(0, 100, 5), y = 0,     label = seq(0, 100, 5),    size = 3, fontface = "bold") p4

Revise the y-axis (bottom horizontal), and add plot titles.

  • To correct for the coverage of the central banner, the y-axis for male and female data should be shifted towards the right and left by an additional 80 units, respectively. Therefore, here we calculate the updated y-axis breaks, and label the axis with positive population values.

  • Use white spaces in the y-axis title to symmetrically split the title into two pieces, each half for each of the genders.

breaks <- seq(0, 700, by = 100)breaks.updated <- c(breaks + 80, - breaks - 80) 
p5 <- p4 + scale_y_continuous( breaks = breaks.updated, labels = function(x) {abs(x) - 80 }) + labs( y = "Female (in thousands) Male (in thousands) ", title = "Population Pyramid in Germany 2023", caption = "data source: https://service.destatis.de/bevoelkerungspyramide/") p5

Add Google font, and polish up the theme.

library(showtext)showtext_auto()font_add_google(name = "Abril Fatface", family = "fatface")
p6 <- p5 + theme_void() + # an empty canvas, with removal of all axial elements theme( axis.text.x = element_text(size = 10, color = "snow4", margin = margin(t = 5)), axis.title.x = element_text(face = "bold", margin = margin(t = 5, b = 5)), plot.title = element_text(hjust = .5, size = 15, face = "bold", margin = margin(b = 10), family = "fatface"), plot.caption = element_text(hjust = .9, size = 10, color = "snow4"), plot.margin = margin(t = 10, b = 10), # increase plot top and bottom margin legend.position = "none")
p6
### Packages and data importlibrary(ggplot2)library(dplyr)library(tidyr)library(stringr)   # for string manipulationlibrary(gganimate) # create animation
theme_set(theme_classic())
d <- read.csv( "/Users/boyuan/Desktop/R/gallery/DATASETS/germany_population_2023.csv", # remove the "X" letters otherwise added to headers started with numbers check.names = F)
d <- d %>% as_tibble() # convert to tibble format
head(d, n = 3)
# Transform the dataset into a tidy formatd.tidy <- d %>% pivot_longer(-age, names_to = "gender", values_to = "population")head(d.tidy, 4)

# Put the male data at the positive range of y axis, and female data at negative range. # And meanwhile increase the bar length by extra 80 units # to offset the overlay of the central banner which covers the bars of each gender by 80 units.d.tidy <- d.tidy %>% as_tibble() %>% mutate(population = ifelse( gender == "f", - population - 80, population + 80 ))

# Display the population pyramid in order of age.d.tidy$age <- factor(d.tidy$age, levels = unique(d.tidy$age))

# Highlight the population difference between genders.d.tidy <- d.tidy %>% group_by(age) %>% mutate(pop.min = min(abs(population)), # female in negative values, male in positive values pop.min = ifelse(gender == "m", pop.min, -pop.min))
head(d.tidy, n = 4) # ready for visualization
### Visualization ----------------------------------------
# Create bar plots with flipped axes.p1 <- d.tidy %>% ggplot(aes(x = age, y = population, fill = gender)) + geom_col(width = 1) + coord_flip(clip = "off") p1

# Highlight population difference between genders.p2 <- p1 + geom_col(aes(y = pop.min), fill = "white", alpha = .5, width = 1, position = "identity") p2

# Create a central banner overlay on the bars.p3 <- p2 + annotate( geom = "rect", # create rectangles xmin = -Inf, xmax = Inf, ymin = 80, ymax = -80, fill = "floralwhite")p3

# Label with ages in the central banner.p4 <- p3 + annotate( geom = "text", x = seq(0, 100, 5), y = 0, label = seq(0, 100, 5), size = 3, fontface = "bold") p4

# Revise the y axis (horizontal), and add plot titlesbreaks <- seq(0, 700, by = 100)breaks.updated <- c(breaks + 80, - breaks - 80)
p5 <- p4 + scale_y_continuous( breaks = breaks.updated, labels = function(x) {abs(x) - 80 }) + labs( y = "Female (in thousands) Male (in thousands) ", title = "Population Pyramid in Germany 2023", caption = "data source: https://service.destatis.de/bevoelkerungspyramide/") p5

# Add Google font, and polish up the theme. library(showtext)showtext_auto()font_add_google(name = "Abril Fatface", family = "fatface")
p6 <- p5 + theme_void() + # an empty canvas, with removal of all axial elements theme( axis.text.x = element_text(size = 10, color = "snow4", margin = margin(t = 5)), axis.title.x = element_text(face = "bold", margin = margin(t = 5, b = 5)), plot.title = element_text(hjust = .5, size = 15, face = "bold", margin = margin(b = 10), family = "fatface"), plot.caption = element_text(hjust = .9, size = 10, color = "snow4"), plot.margin = margin(t = 10, b = 10), # increase plot top and bottom margin legend.position = "none")
p6




Continue Exploring — 🚀 one level up!


We can slightly tweak the script to visualize the annual population pyramids in animation using the popular gganimate package. Check it out!



The barplot is the basis of pie and donut charts, which are essentially bars presented in a polar coordinate. Check the following exploded donut plots in faceted layout that visualizes the top four GDP contributors in each continent.