Visualize Social Mobility with Line plot Using ggplot2
In this article, we’ll create a line plot to visualize the social mobility: how does a father’s occupation impact his son’s career path, surveyed in the 1970s in the United States.
Major techniques covered in this tutorial include:
Create a “jittered” line plot. Each single line represents a father-son pair. A technical highlight in this visualization is the addition of a secondary axis to a discrete (categorical) scale, which is officially not supported in ggplot2. The secrete of achieving this goal is to first convert the categorical variable (i.e., occupation) into a numerical variable to enable the addition of the secondary axis (as we do at this step). We’ll then update the axis labels in a following step.
p1 <- logan.tidy %>%ggplot(aes(x = father.son, # as.numeric to enable adding secondary axis y =as.numeric(occupation), color = education)) +geom_line(aes(group = subject), alpha = .4,# lines are "jittered" such that each occupation pair # between the father and the son can be individually displayed. position =position_jitter(.07, .07, seed =1)) +coord_flip(expand =0) p1
Add secondary axis. The secondary axis transformation sec_axis(trans = ~., ...) at this step is possible only because the occupation variable had been turned into a numerical variable in the earlier step. Here we manually update the axis labels into the desired job names using the labels argument.
p2 <- p1 +# fix labels of the original y axisscale_y_continuous(labels = occupations, # update axis labelname ="Father",# add a secondary axis to display the son's jobsec.axis =sec_axis(trans =~., labels = occupations, # update axis labelname ="Son"))p2
Customize the color scale. Observations with shorter education (0 ~ 10 years) are represented by green colors with slow color transition. As majority of the observations have longer education (> 10 years), the color palettes are more dynamically applied to this data range, as adjusted by the bias argument. (This powerful technique is also employed to create these beautiful heatmaps)
When customizing a color scale for skewed data distribution like this, another powerful approach is to use logarithmic or pseudo-logarithmic transformation, as demonstrated in this fabulous map of African population density.
Adjust the legend, and add plot titles. It is important to note that title.theme in guides()takes precedence over legend.title in the generic theme syntax.
p4 <- p3 +# adjust legend color bar guides(color =guide_colorbar(barheight =unit(200, "pt"),barwidth =unit(8, "pt"),title.position ="left",title ="total education years",# important: "title.theme" in "guides" takes precedence over# "legend.title" in the generic "theme" syntaxtitle.theme =element_text(angle =90, hjust = .5, color ="snow3"))) +# remove x (left) - axis titlelabs(x =NULL) +# add plot titleslabs(title ="Social ladder & mobility:",subtitle ="What does the son of a craftsman do?") p4
Update the theme. It is worth mentioning that in the theme syntax, the vertical axis is always considered as the y-axis, and the horizontal one as the x-axis, regardless of the presence of coord_flip or not.
library(ggplot2)library(dplyr)library(tidyr)library(survival) occupations <-levels(logan$occupation) logan.tidy <- logan %>%as_tibble() %>%mutate(subject =1:nrow(logan)) %>%pivot_longer(c(occupation, focc), names_to ="father.son", values_to ="occupation") head(logan.tidy, n =3) # Create a "jittered" line plot.p1 <- logan.tidy %>%ggplot(aes(x = father.son, # as.numeric to enable adding secondary axis y =as.numeric(occupation), color = education)) +geom_line(aes(group = subject), alpha = .4,# lines are "jittered" such that each occupation pair # between the father and the son can be individually displayed. position =position_jitter(.07, .07, seed =1)) +coord_flip(expand =0) p1 # Add secondary axis.p2 <- p1 +# fix labels of the original y axisscale_y_continuous(labels = occupations, # update axis labelname ="Father",# add a secondary axis to display the son's jobsec.axis =sec_axis(trans =~., labels = occupations, # update axis labelname ="Son"))p2 # Update the color scale.myColors <-c("green4", "green1", "skyblue2", "blue1", "firebrick","tomato", "orange", "gold", "yellow", "beige", "white")myColors <-colorRampPalette(myColors, bias = .4)(20) p3 <- p2 +scale_color_gradientn(colors = myColors) p3 # Adjust the legend, and add plot titles.p4 <- p3 +# adjust legend color bar guides(color =guide_colorbar(barheight =unit(200, "pt"),barwidth =unit(8, "pt"),title.position ="left",title ="total education years",# important: "title.theme" in "guides" takes precedence over# "legend.title" in the generic "theme" syntaxtitle.theme =element_text(angle =90, hjust = .5, color ="snow3"))) +# remove x (left) - axis titlelabs(x =NULL) +# add plot titleslabs(title ="Social ladder & mobility:",subtitle ="What does the son of a craftsman do?") p4 # Update the theme.p5 <- p4 +theme(# backgroundpanel.grid =element_blank(),panel.background =element_rect(fill ="black"),plot.background =element_rect(fill ="black"),# legendlegend.background =element_blank(), # remove legend backgroundlegend.margin =margin(l =20), # increase legend left margin# textstext =element_text(color ="snow", size =14),axis.text.y =element_blank(),axis.text.x =element_text(color ="snow"),axis.title =element_text(face ="bold"),plot.title =element_text(face ="bold", size =15),plot.subtitle =element_text(face ="italic", size =12),# increase margin around the plotplot.margin =margin(rep(20, 4), unit ="pt")) p5
Continue Exploring — 🚀 one level up!
In the visualization above, we used colorRampPalette(..., bias = .4) to cater the color scale to skewed data distribution. For highly skewed dataset, another powerful approach is to apply mathematical transformation in the color scale. Check this stunning African population heatmap with pseudo-logarithmic transform in color scale, and you’ll be truly amazed how an appropriate color scale adjusted with a simple piece of code can make a tremendous difference.
In the above social mobility visualization, we have plotted out each individual observation (row), and each father-son pair is visualized as a single line. Often, it is desirable to plot not only individual observations, but also statistical summaries for a group, e.g., mean and variance. Check the following line plots with summary statistics, and learn the most efficient way to combine individual data points and group summarizing statistics in a graph.