Twelve Awesome Methods to Deal with Overcrowded Scatterplot

In this article, we’ll discuss 12 powerful methods to address overcrowded scatterplot.

Case 1: There are identical and completely overlapped data points. The number of data points though is not high. We’ll work on the iris dataset to illustrate methods 1 ~ 5.
Case 2: There are too many data points to be displayed individually. We’ll work on the txhousing dataset (with 8600 data points) to illustrate methods 6 ~ 12.

Case 1: Identical and overlapped data points with the `iris` dataset.

library(ggplot2)library(dplyr)theme_set(theme_minimal())
# create the plot basei <- iris %>%   ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species))

Method 1: Apply transparency to unveil overlap.

Here we apply 50% transparency to the points with alpha = .5. Higher color saturation indicates the existence of data point overlap.

i + geom_point(size = 4, alpha = .5)

Method 2: Use jittered position

A frequently used method to unveil superimposed data points is to add a small amount of random noise to the horizontal (width) and vertical (height) position. The seed argument ensures reproducing the same pattern of randomization each time the script is run.

i + geom_point(  size = 4, alpha = .5,  position = position_jitter(width = .2, height = .1, seed = 123))

geom_jitter() is a shorthand to geom_point() with jittered position (but it does not has the seed argument). In addition, the ggbeeswarm package generates randomization in a more organized and symmetric manner – check out here to learn more.

Method 3: Make shape in outlines

Combined with “jittered” position, shape in outlines further help to visualize the data overlap. Shape of 21 is a popular option, and contains an outline and a solid interior (which remains empty by default). Check the basics of points to learn more about shape customization.

i + geom_point(  size = 4, shape = 21,  position = position_jitter(width = .2, height = .1, seed = 123))

Method 4: Use larger points for clustered data points

geom_count() maps the extent of overlap to the aesthetic size of points.

i + geom_count(alpha = .6)

Method 5: Enhance with marginal distribution

geom_rug() projects each data point onto the x and y axes as short tick marks, and re-image the scatterplot as bar codes on the two sides of the plot. The density of the bar codes visualizes the distribution of each x and y variable. When the scatterplot is jittered, the rug needs to be jittered in like manner to be synchronized in position with the scatterplot.

i +   geom_point(position = position_jitter(.2, .1, 123)) +   geom_rug(position = position_jitter(.2, .1, 123),           linewidth = .3)

In addition, ggExtra provides an elegant solution to visualize the marginal distribution using different types of plots, including density plot, histogram, box plot, and violin plot.

# install.packages("ggExtra")library(ggExtra)
i2 <- i + geom_jitter() +   theme(legend.position = "bottom")
ggMarginal(i2, groupFill = T)

Case 2: High number of data points with the `txhousing` dataset

The following scatterplot derived from the dataset txhousing contains 8600 data points. The significant overlap of the data points makes it difficult to unveil the underlying data pattern. In addition to the techniques discussed above, here we’ll introduce 6 more methods to improve the visualization.

# too many pointsp <- txhousing %>%   ggplot(aes(x = sales, y = median)) 
p + geom_point()

Method 6: Use the dot shape

A point shape in dot . in semi-transparency can be a quick way to reveal the pattern of a large dataset. For highly skewed data like this dataset, however, the efficiency of this method may be still limited.

p + geom_point(shape = ".", alpha = .5)

Method 7: Map point colors to neighbor counts

The ggpointdensity package provides an awesome solution to visualize large dataset with significant overlap. The geom_pointdensity() function works very similarly to geom_point(), but additionally coats the points in colors based on the number of neighboring points. This way, both the overall distribution and individual points can be visualized.
Given the extreme skewness of the data distribution, we use trans = "pseudo_log" in the viridis color scale to preferentially highlight the most clustered data points. This color scale transformation is also a silver bullet in the creation of this stunning heatmap.

# install.packages("ggpointdensity")library(ggpointdensity)
p +   geom_pointdensity() +   scale_color_viridis_c(trans = "pseudo_log") +  # elongate the color bar  guides(color = guide_colorbar(barheight = unit(200, "pt")))

Method 8: Count data points in 2D histogram

The scatterplot can be summarized as a 2D histogram. The bins argument specifies the number of intervals / bins created along the x and y axes, and determines the number of cells the plot is divided into. The number of data points falling into each cell is counted, and mapped to a color scale. Higher bin number creates a higher resolution.

# Update the base plot with viridis color scale. p2 <- p + scale_fill_viridis_c(option = "B")
p2 + geom_bin_2d(bins = 30)

Draw cells in hexagons.

p2 + geom_hex(bins = 30)

Method 9: Visualize data distribution with 2D density plot

Similar to 2D histogram, 2D density plot is another way to unveil the underlying data distribution.

# density in binned formatp2 + stat_density_2d(aes(fill = after_stat(density)),                     geom = "tile", contour = F,                      n = 30, color = "snow3")

Alternative to geom = "tile", we can use geom = "raster". The “raster” is a high-performance alternative to “tile” when all cells are of the same size, but does not have the color argument, and thus does not draw cell outlines.

Method 10: Facet the plot into subplots

Faceting unveils positive correlation in x and y variables in many cities, which is otherwise much less perceivable when all data points are merged into a single plot.
In the after_stat() function: density presents the density in each subplot relative to the data range of the whole dataset, while ndensity (as demonstrated) normalizes the density separately in each subplot. Check out our unique ggplot2 ebook to learn more!

p2 +   stat_density_2d(aes(fill = after_stat(ndensity)),                  geom = "raster", contour = F, n = 30) +  facet_wrap(~city, scales = "free") +   theme_void() +  # display complete subplot titles   theme(strip.clip = "off") +  # customize legend colorbar  guides(fill = guide_colorbar(    barheight = unit(150, "pt"),    title.position = "left",    title.theme = element_text(angle = 90, hjust = .5)))

Method 11: Interactive plot

ggiraph is a versatile ggplot2 extension package that helps generate interactive plots with a few tweaks of the script.

library(ggiraph)
p.static <- txhousing %>%   ggplot(aes(x = sales, y = median,              # display all points associated with the same city             data_id = city,              # interactively display the city name             tooltip = city)) +   geom_point_interactive(show.legend = F) +    ggtitle("This Is An Interactive Plot") +  theme(plot.title = element_text(face = "bold", hjust = .5))
ggiraph(ggobj = p.static)

Method 12: Animation

Animation with the popular package gganimate is another powerful tool to display complicated data. Animations are simply faceted plots but on a time scale. The following exploratory animation loops through 46 different cities and quickly displays the sales condition for each of them.

# The animation goes through 46 different cities txhousing$city %>% n_distinct()

Output:
[1] 46

library(gganimate)
p + geom_point() +  # label each frame with the associated city name  labs(title = "{closest_state}") +  # facet the plot into frames of different cities  transition_states(states = city, transition_length = 0)

Check out this awesome animation of annual population pyramids, which dynamically illustrates the past and predicted dynamics of population structure in Germany.

Case 1: Identical and overlapped data points with the iris dataset.