Create Heatmap in ggplot2 to Visualize the Control of Polio Disease by Vaccination in the U.S.

In this article, we’ll create a heatmap to visualize the annual incidences of the polio disease in the U.S. history, and highlight the power of vaccination in checking polio’s occurrences. This work is a ggplot2 reproduction of the original work created by Tynan and Dov in Wall Street Journal.

Major techniques covered in this article include:


Packages and data cleanup

To access the authentic dataset (as used in this demonstration), go to Project TYCHO, scroll down to “ACCESS DATA” - “DATASETS Pre-Compiled”, and click “START NOW”. Then scroll up to “PREVIOUS VERSIONS” - “Level 1 Data peer-reviewed research 8 diseases”, and click download ( create an account to initiate download).

For practice purpose, you can download this simulated data here with a single click. This simulated dataset has the same data structure as the original one.

library(ggplot2)library(dplyr)library(stringr)library(RColorBrewer) 
data <- read.csv("/Users/boyuan/Desktop/R/gallery/DATASETS/US_disease.csv")data <- data %>% as_tibble()
head(data, n = 3)

Output:

# A tibble: 3 × 7
epi_week state loc loc_type disease cases incidence_per_100000
<int> <chr> <chr> <chr> <chr> <chr> <dbl>
1 196601 MN MINNESOTA STATE HEPATITIS A 3 0.08
2 196601 CO COLORADO STATE HEPATITIS A 1 0.05
3 196601 AZ ARIZONA STATE HEPATITIS A 6 0.37

Create a dataset summarizing the annual incidences per 100K population in each state and for each disease.

data.summary <- data %>%  # extract years   mutate(year = str_sub(epi_week, start = 1, end = 4) %>% as.integer()) %>%    select(-epi_week) %>%     # sum up cases across all weeks for each disease in each year and state   # incidences presented as per 100 K population  group_by(state, year, disease) %>%   summarise(n.per.100K = sum(incidence_per_100000)) 
head(data.summary, n = 3)

Output:

# A tibble: 3 × 4
# Groups: state, year [2]
state year disease n.per.100K
<chr> <int> <chr> <dbl>
1 AK 1950 PERTUSSIS 63.7
2 AK 1950 POLIO 45.2
3 AK 1951 PERTUSSIS 49.4

In this article, we’ll create a heatmap for the polio disease only. In the next article, we’ll update the code into a function to automate the creation of heatmaps for all eight different diseases.

polio <- data.summary %>% filter(disease == "POLIO")

Visualization

Create a basic heatmap using the generic geom_tile() function. Compared with this function, geom_raster() is a high-performance alternative when all the cells have the same size, but does not draw cell outlines (as used in this example).

p1 <- polio %>%   ggplot(aes(x = year, y = state, fill = n.per.100K)) +  # create heatmap  geom_tile(color = "white", linewidth = .5) 
p1

Design a color scale to highlight the dynamic changes in the polio occurrences: greens and blues are used only when the incidences are extremely low, and yellows and reds are applied for the majority range of the color scale. The colors are selected with trial and error to bring out the most vivid color contrast. (It helps to use scales::show_col(myColors) to visualize the color hex codes.)

myColors <- colorRampPalette(  c("grey95", "#d1e6f7", "#a6d3f1", "#57b7e7","#1f9ad4", "#3aab70", "#e0d141",     "#fdd53e", "#f4cd3c", "#feb73e", "orange", "#eb6c10",  "firebrick2", "firebrick4"),  bias = 3)(50) 
p2 <- p1 + scale_fill_gradientn(colours = myColors, breaks = seq(from = 0, to = 150, by = 50))
p2

Label the x-axis with 5-year time interval, remove the axis titles, and add plot title and subtitle.

p3 <- p2 +     # x axis:  year label ending with 00 or 05  scale_x_continuous(    breaks = seq(from = 1930, to = 1965, by = 5)) +    # axes and plot titles  labs(x = NULL, y = NULL,        fill = "yearly cases \nper 100K",       title = "POLIO",       subtitle = "yearly cases per 100K population" %>% paste( "\n")) 
p3

Polish up the theme.

# color bar customizationp4 <- p3 +   theme_classic() +  theme(axis.line = element_blank(),        axis.text.x = element_text(size = 12),        axis.text.y = element_text(size = 8),        plot.title = element_text(face = "bold", size = 20),        plot.subtitle = element_text(size = 12, color = "grey60"),        legend.position = "bottom",        plot.margin = unit(c(.5, 0.5, 0.5, 0.5), "cm"))p4

Adjust the horizontal colorbar to be wider and thinner.

p5 <- p4 +   guides(fill = guide_colorbar(barwidth = unit(250, "pt"),                               barheight = unit(10, "pt"),                               title = NULL)) p5

Annotate the heatmap with the year of vaccine introduction. Note that clip = "off" (the last line) is critical to completely display the texts, which would otherwise be clipped away and invisible after shifting upward (by vjust) beyond the panel boundary.

p6 <- p5 +   # annotate with vertical line: the year of vaccine introduction     geom_vline(xintercept = 1954.5, linewidth = 1) +    # annotate with text: the year of vaccine introduction     annotate(geom = "text",            x = 1954.5, y = 51,            label = "Vaccine introduced",            vjust = -1.2, # shift the label upward           hjust = 0, # left justify           fontface = "bold", size = 4) +    # expand to fill up entire plot area,   # and clip off to display the text annotation  coord_cartesian(expand = 0, clip = "off")  p6

Save the plot.

ggsave(filename = "heatmap_vaccine_polio.pdf",       path = "graphics", # a relative path       width = 7, height = 7)
library(ggplot2)library(dplyr)library(stringr)library(RColorBrewer) 
data <- read.csv("/Users/boyuan/Desktop/R/gallery/DATASETS/US_disease.csv")data <- data %>% as_tibble()
data.summary <- data %>% # extract years mutate(year = str_sub(epi_week, start = 1, end = 4) %>% as.integer()) %>% select(-epi_week) %>% # sum up cases across all weeks for each disease in each year and each state # incidences presented as per 100 K population group_by(state, year, disease) %>% summarise(n.per.100K = sum(incidence_per_100000))
polio <- data.summary %>% filter(disease == "POLIO")
# color scalemyColors <- colorRampPalette( c("grey95", "#d1e6f7", "#a6d3f1", "#57b7e7","#1f9ad4", "#3aab70", "#e0d141", "#fdd53e", "#f4cd3c", "#feb73e", "orange", "#eb6c10", "firebrick2", "firebrick4"), bias = 3)(50)
plt <- polio %>% ggplot(aes(x = year, y = state, fill = n.per.100K)) + # create heatmap geom_tile(color = "white", linewidth = .5) + scale_fill_gradientn(colours = myColors, breaks = seq(from = 0, to = 150, by = 50)) + # x axis: year label ending with 00 or 05 scale_x_continuous( breaks = seq(from = 1930, to = 1965, by = 5)) + # titles labs(x = NULL, y = NULL, fill = "yearly cases \nper 100K", title = "POLIO", subtitle = "yearly cases per 100K population" %>% paste( "\n")) + # theme theme_classic() + theme(axis.line = element_blank(), axis.text.x = element_text(size = 12), axis.text.y = element_text(size = 8), plot.title = element_text(face = "bold", size = 16), plot.subtitle = element_text(size = 12, color = "grey60"), legend.position = "bottom", plot.margin = unit(c(.5, 0.5, 0.5, 0.5), "cm")) + # color bar guides(fill = guide_colorbar(barwidth = unit(250, "pt"), barheight = unit(10, "pt"), title = NULL)) + # annotate with vertical line: the year of vaccine introduction geom_vline(xintercept = 1954.5, linewidth = 1) + # annotate with text: the year of vaccine introduction annotate(geom = "text", x = 1954.5, y = 51, label = "Vaccine introduced", vjust = -1.2, # shift the label upward hjust = 0, # left justify fontface = "bold", size = 4) + # expand to fill up entire plot area, # and clip off to display the text annotation coord_cartesian(expand = 0, clip = "off")
plt
ggsave(filename = "heatmap_vaccine_polio.pdf", path = "graphics", # a relative path width = 7, height = 7)




Continue Exploring — 🚀 one level up!


Now that we have a pipeline to create the heatmap for the polio disease as demonstrated above, we’ll next wrap the code into a function to automate the heatmap creation for eight infectious diseases throughout the US history, and highlight how vaccination played a vital role in the disease control.



To generate heatmaps, in addition to the generic geom_tile() and geom_raster(), geom_bin2d() is another powerful function – Check out the following awesome 2D histogram with a world map overlay that visualizes the hurricane activities in North Atlantic Ocean.