Create Ordered and Stacked Bar Plots in ggplot2 to Visualize U.S. Midwest Demographics

In this article, we’ll create a bar plot to visualize demographic data of the U.S. midwest counties in 2000, highlighting the relationship between poverty rate and education level. Major techniques covered in this work include:

Reorder the bar plot.
Create bar plots in a “symmetrical” layout.
Generate regression lines on stacked bar plot with synchronized position.
Customize the theme.

Stepwise instructions
Code only

Packages and data cleanup

We’ll use the midwest dataset that is built in the base R.

library(ggplot2)library(dplyr)library(tidyr)library(stringr)# set up global themetheme_set(theme_classic(base_size = 15))

Reorder the barplot. Arrange counties (PID) in order of the percentage of the population living under the poverty line (percbelowpoverty), and convert PID into a factor to “memorise” this order. This ensures that the counties will be visualized in such order when rendered in the graphics. (Check this complete guide on how to reorder graphic elements in ggplot2)

m.reordered <- midwest %>%   arrange(percbelowpoverty) %>%   mutate(PID = factor(PID, levels = PID))

Tidy up the dataset. Select four variables to be plotted: the county ID (PID), percentage of population with high school degree (perchsd) and college degree (percollege), and the percentage living in poverty (percbelowpoverty). Transform the dataset into a tidy structure, such that the three democratic variables names are in one column education_poverty, and the associated proportion are in another column percent.

m.tidy <- m.reordered %>%   select(PID, perchsd, percollege, percbelowpoverty) %>%   # tidy up  pivot_longer(    -PID,     names_to = "education_poverty",     values_to = "percent")
head(m.tidy, n = 3)

Output:
# A tibble: 3 × 3
  PID   education_poverty percent
  <fct> <chr>               <dbl>
1 3026  perchsd             86.9
2 3026  percollege          37.4
3 3026  percbelowpoverty     2.18

Create bars in “symmetrical” layout. Turn the poverty data to negative values. This way, the poverty data will be displayed on the negative range of the y-axis, while the education data on the positive range of the y-axis (as illustrated in p1). The same technique is also used to create population pyramids.

# turn poverty data to negative valuesm.signed <- m.tidy %>%   mutate(percent = ifelse(    education_poverty == "percbelowpoverty", -percent, percent)) 
head(m.signed, 3) # ready for visualization

Output:
# A tibble: 3 × 3
  PID   education_poverty percent
  <fct> <chr>               <dbl>
1 3026  perchsd             86.9
2 3026  percollege          37.4
3 3026  percbelowpoverty    -2.18

Visualization

For each county (PID), create a bar to display the population percent living below the poverty line in the negative range of the y-axis, and percent with high school or college education in the positive range of y-axis. (Due to the large number of counties, the bars are squeezed into line segments.)

p1 <- m.signed %>%   ggplot(aes(x = PID, y = percent, fill = education_poverty)) +  geom_col(alpha = .6) +  coord_flip(expand = 0) # flip the plotp1

Add a central vertical line at y = 0 (note that the coordinate is flipped), separating the data associated with the poverty and education. Relabel both halves of the y-axis with positive numbers.

p2 <- p1 +  geom_hline(yintercept = 0, linewidth = 1.5) +  scale_y_continuous(    breaks = seq(-40, 120, 20),    labels = function(x){ifelse(x < 0, -x, x)})p2

Add simple linear regression lines to outline the changing trend of education across counties that have been arranged in descending order of poverty.

use the group aesthetic to specify the subset of data that should be regressed.
use position in "stack" to synchronize the regression line with the bars
use the manual color & fill scale to keep the bars and lines consistent in color

p3 <- p2+  geom_smooth(    # use a data subset NOT containing the poverty data    data = filter(m.signed, education_poverty != "percbelowpoverty"),    aes(group = education_poverty,         color = education_poverty),         method = "lm", # linear model    se = F, # not show confidence ribbon    position = "stack", # align with bars in position    linetype = "dashed", linewidth = 1) +    # make the color of bars and regression lines consistent  scale_color_manual(values = c("steelblue2", "tomato")) +   scale_fill_manual(values = c("snow3", "steelblue1", "tomato")) p3

Label the bars with the categories (poverty and education degree) they are associated with.

a <- tibble(  Y = c(-43, 25, 105),   X = rep(100, 3),  status = c("living\nin poverty",              "college\ndegree",              "high\nschool\ndegree"))
p4 <- p3 + geom_text(  data = a,  aes(x = X, y = Y, label = status),   inherit.aes = F,  hjust = 0, fontface = "bold", size = 5,  color = c("snow4", "tomato", "steelblue4")) 
p4

Add axial and plot titles, and fine tune the theme.

p5 <- p4 +   labs(    x = "Counties",     y = "Percent of population in the county",    title = "U.S. midwest demographics in 2000",    subtitle = "Better education is strongly associated with the decrease of poverty") +    theme(    legend.position = "none",    axis.ticks = element_blank(),    axis.text.y = element_blank(),    axis.line.y = element_blank(),    axis.title.y = element_text(hjust = 1),    axis.title.x = element_text(margin = margin(t = 15)),    axis.line.x = element_line(linewidth = 1),        plot.title = element_text(hjust = 1, face = "bold"),    plot.subtitle = element_text(hjust = 1, face = "italic", margin = margin(b = 10))  ) p5

Save the plot. By default, the most recently displayed plot will be saved. Here we save the plot to the “graphics” folder, which is in the same folder as the source code.

ggsave(filename = "bars education vs poverty.pdf",       path = "graphics",  # a relative path       width = 7, height = 5)

library(ggplot2)library(dplyr)library(tidyr)library(stringr)theme_set(theme_classic(base_size = 15))
# Arrange counties (`PID`) in order of poverty percent.m.reordered <- midwest %>%   arrange(percbelowpoverty) %>%   mutate(PID = factor(PID, levels = PID))
# Select useful variables, and convert data to tidy structure. m.tidy <- m.reordered %>%   select(PID, perchsd, percollege, percbelowpoverty) %>%   # tidy up  pivot_longer(    -PID,     names_to = "education_poverty",     values_to = "percent")
head(m.tidy, n = 3)
# turn the poverty percent to negative values.m.signed <- m.tidy %>% mutate(  percent = ifelse(education_poverty == "percbelowpoverty", -percent, percent)) 
head(m.signed, 3) # ready for visualization

### Visualization
# Create a bar plot.p1 <- m.signed %>%   ggplot(aes(x = PID, y = percent, fill = education_poverty)) +  geom_col(alpha = .6) +  coord_flip(expand = 0) # flip the plotp1

# Add a central vertical line p2 <- p1 +  geom_hline(yintercept = 0, linewidth = 1.5) +  scale_y_continuous(    breaks = seq(-40, 120, 20),    labels = function(x){ifelse(x < 0, -x, x)})p2

# Add linear regression to outline the changing trend of education.p3 <- p2+  geom_smooth(    # use a data subset NOT containing the poverty data    data = filter(m.signed, education_poverty != "percbelowpoverty"),    aes(group = education_poverty,         color = education_poverty),         method = "lm", # linear model    se = F, # not show confidence ribbon    position = "stack", # align with bars in position    linetype = "dashed", linewidth = 1) +    # make the color of bars and regression lines consistent  scale_color_manual(values = c("steelblue2", "tomato")) +   scale_fill_manual(values = c("snow3", "steelblue1", "tomato")) p3

# Label the bars with the categories they are associated with. a <- tibble(  Y = c(-43, 25, 105),   X = rep(100, 3),  status = c("living\nin poverty",              "college\ndegree",              "high\nschool\ndegree"))
p4 <- p3 + geom_text(  data = a,  aes(x = X, y = Y, label = status),   inherit.aes = F,  hjust = 0, fontface = "bold", size = 5,  color = c("snow4", "tomato", "steelblue4")) 
p4

# Add axial and plot titles, and fine tune the theme. p5 <- p4 +   labs(    x = "Counties",     y = "Percent of population in the county",    title = "US midwest demographics in 2000",    subtitle = "Better education is strongly associated with the decrease of poverty") +    theme(    legend.position = "none",    axis.ticks = element_blank(),    axis.text.y = element_blank(),    axis.line.y = element_blank(),    axis.title.y = element_text(hjust = 1),    axis.title.x = element_text(margin = margin(t = 15)),    axis.line.x = element_line(linewidth = 1),        plot.title = element_text(hjust = 1, face = "bold"),    plot.subtitle = element_text(hjust = 1, face = "italic", margin = margin(b = 10))  ) p5

# Save the plot. ggsave(filename = "bars education vs poverty.pdf",       path = "graphics",  # a relative path       width = 7, height = 5)

Continue Exploring — 🚀 one level up!

In the following plot, we employ annotated lines and points to highlight the significant changes in the human life span and population size from 1800 to 2015.

Check out this awesome stacked ribbon / alluvium plot, which shows dynamic shifts in the migrant population to the United States from 1820 to 2009.

Packages and data cleanup

Visualization

Continue Exploring — 🚀 one level up!

Amazing eBook to learn ggplot2 FAST & EASY