Create Heatmap in ggplot2 to Visualize the Control of Polio Disease by Vaccination in the U.S.
In this article, we’ll create a heatmap to visualize the annual incidences of the polio disease in the U.S. history, and highlight the power of vaccination in checking polio’s occurrences. This work is a ggplot2 reproduction of the original work created by Tynan and Dov in Wall Street Journal.
To access the authentic dataset (as used in this demonstration), go to Project TYCHO, scroll down to “ACCESS DATA” - “DATASETS Pre-Compiled”, and click “START NOW”. Then scroll up to “PREVIOUS VERSIONS” - “Level 1 Data peer-reviewed research 8 diseases”, and click download ( create an account to initiate download).
For practice purpose, you can download thissimulated data herewith a single click. This simulated dataset has the same data structure as the original one.
library(ggplot2)library(dplyr)library(stringr)library(RColorBrewer) data <-read.csv("/Users/boyuan/Desktop/R/gallery/DATASETS/US_disease.csv")data <- data %>%as_tibble() head(data, n =3)
Output:
# A tibble: 3 × 7 epi_week state loc loc_type disease cases incidence_per_100000 <int> <chr> <chr> <chr> <chr> <chr> <dbl> 1 196601 MN MINNESOTA STATE HEPATITIS A 3 0.08 2 196601 CO COLORADO STATE HEPATITIS A 1 0.05 3 196601 AZ ARIZONA STATE HEPATITIS A 6 0.37
Create a dataset summarizing the annual incidences per 100K population in each state and for each disease.
data.summary <- data %>%# extract years mutate(year =str_sub(epi_week, start =1, end =4) %>%as.integer()) %>%select(-epi_week) %>%# sum up cases across all weeks for each disease in each year and state # incidences presented as per 100 K populationgroup_by(state, year, disease) %>%summarise(n.per.100K =sum(incidence_per_100000)) head(data.summary, n =3)
Output:
# A tibble: 3 × 4 # Groups: state, year [2] state year disease n.per.100K <chr> <int> <chr> <dbl> 1 AK 1950 PERTUSSIS 63.7 2 AK 1950 POLIO 45.2 3 AK 1951 PERTUSSIS 49.4
In this article, we’ll create a heatmap for the polio disease only. In the next article, we’ll update the code into a function to automate the creation of heatmaps for all eight different diseases.
Create a basic heatmap using the generic geom_tile() function. Compared with this function, geom_raster() is a high-performance alternative when all the cells have the same size, but does not draw cell outlines (as used in this example).
p1 <- polio %>%ggplot(aes(x = year, y = state, fill = n.per.100K)) +# create heatmapgeom_tile(color ="white", linewidth = .5) p1
Design a color scale to highlight the dynamic changes in the polio occurrences: greens and blues are used only when the incidences are extremely low, and yellows and reds are applied for the majority range of the color scale. The colors are selected with trial and error to bring out the most vivid color contrast. (It helps to use scales::show_col(myColors) to visualize the color hex codes.)
Label the x-axis with 5-year time interval, remove the axis titles, and add plot title and subtitle.
p3 <- p2 +# x axis: year label ending with 00 or 05scale_x_continuous(breaks =seq(from =1930, to =1965, by =5)) +# axes and plot titleslabs(x =NULL, y =NULL, fill ="yearly cases \nper 100K",title ="POLIO",subtitle ="yearly cases per 100K population"%>%paste( "\n")) p3
Polish up the theme.
# color bar customizationp4 <- p3 +theme_classic() +theme(axis.line =element_blank(),axis.text.x =element_text(size =12),axis.text.y =element_text(size =8),plot.title =element_text(face ="bold", size =20),plot.subtitle =element_text(size =12, color ="grey60"),legend.position ="bottom",plot.margin =unit(c(.5, 0.5, 0.5, 0.5), "cm"))p4
Adjust the horizontal colorbar to be wider and thinner.
Annotate the heatmap with the year of vaccine introduction. Note that clip = "off" (the last line) is critical to completely display the texts, which would otherwise be clipped away and invisible after shifting upward (by vjust) beyond the panel boundary.
p6 <- p5 +# annotate with vertical line: the year of vaccine introduction geom_vline(xintercept =1954.5, linewidth =1) +# annotate with text: the year of vaccine introduction annotate(geom ="text", x =1954.5, y =51, label ="Vaccine introduced", vjust =-1.2, # shift the label upwardhjust =0, # left justifyfontface ="bold", size =4) +# expand to fill up entire plot area, # and clip off to display the text annotationcoord_cartesian(expand =0, clip ="off") p6
Save the plot.
ggsave(filename ="heatmap_vaccine_polio.pdf",path ="graphics", # a relative pathwidth =7, height =7)
library(ggplot2)library(dplyr)library(stringr)library(RColorBrewer) data <-read.csv("/Users/boyuan/Desktop/R/gallery/DATASETS/US_disease.csv")data <- data %>%as_tibble() data.summary <- data %>%# extract years mutate(year =str_sub(epi_week, start =1, end =4) %>%as.integer()) %>%select(-epi_week) %>%# sum up cases across all weeks for each disease in each year and each state # incidences presented as per 100 K populationgroup_by(state, year, disease) %>%summarise(n.per.100K =sum(incidence_per_100000)) polio <- data.summary %>%filter(disease =="POLIO") # color scalemyColors <-colorRampPalette(c("grey95", "#d1e6f7", "#a6d3f1", "#57b7e7","#1f9ad4", "#3aab70", "#e0d141", "#fdd53e", "#f4cd3c", "#feb73e", "orange", "#eb6c10", "firebrick2", "firebrick4"),bias =3)(50) plt <- polio %>%ggplot(aes(x = year, y = state, fill = n.per.100K)) +# create heatmapgeom_tile(color ="white", linewidth = .5) +scale_fill_gradientn(colours = myColors, breaks =seq(from =0, to =150, by =50)) +# x axis: year label ending with 00 or 05scale_x_continuous(breaks =seq(from =1930, to =1965, by =5)) +# titleslabs(x =NULL, y =NULL, fill ="yearly cases \nper 100K",title ="POLIO",subtitle ="yearly cases per 100K population"%>%paste( "\n")) +# themetheme_classic() +theme(axis.line =element_blank(),axis.text.x =element_text(size =12),axis.text.y =element_text(size =8),plot.title =element_text(face ="bold", size =16),plot.subtitle =element_text(size =12, color ="grey60"),legend.position ="bottom",plot.margin =unit(c(.5, 0.5, 0.5, 0.5), "cm")) +# color bar guides(fill =guide_colorbar(barwidth =unit(250, "pt"),barheight =unit(10, "pt"),title =NULL)) +# annotate with vertical line: the year of vaccine introduction geom_vline(xintercept =1954.5, linewidth =1) +# annotate with text: the year of vaccine introduction annotate(geom ="text", x =1954.5, y =51, label ="Vaccine introduced", vjust =-1.2, # shift the label upwardhjust =0, # left justifyfontface ="bold", size =4) +# expand to fill up entire plot area, # and clip off to display the text annotationcoord_cartesian(expand =0, clip ="off") plt ggsave(filename ="heatmap_vaccine_polio.pdf",path ="graphics", # a relative pathwidth =7, height =7)
Continue Exploring — 🚀 one level up!
Now that we have a pipeline to create the heatmap for the polio disease as demonstrated above, we’ll next wrap the code into a function to automate the heatmap creation for eight infectious diseases throughout the US history, and highlight how vaccination played a vital role in the disease control.
To generate heatmaps, in addition to the generic geom_tile() and geom_raster(), geom_bin2d() is another powerful function – Check out the following awesome 2D histogram with a world map overlay that visualizes the hurricane activities in North Atlantic Ocean.