library(ggplot2)library(dplyr)theme_set(theme_minimal())
# create the plot base<- iris %>% i ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species))
Twelve Awesome Methods to Deal with Overcrowded Scatterplot
In this article, we’ll discuss 12 powerful methods to address overcrowded scatterplot.
Case 1: There are identical and completely overlapped data points. The number of data points though is not high. We’ll work on the
iris
dataset to illustrate methods 1 ~ 5.Case 2: There are too many data points to be displayed individually. We’ll work on the
txhousing
dataset (with 8600 data points) to illustrate methods 6 ~ 12.
Case 1: Identical and overlapped data points with the iris
dataset.
Method 1: Apply transparency to unveil overlap.
Here we apply 50% transparency to the points with alpha = .5
. Higher color saturation indicates the existence of data point overlap.
+ geom_point(size = 4, alpha = .5) i
Method 2: Use jittered position
A frequently used method to unveil superimposed data points is to add a small amount of random noise to the horizontal (width
) and vertical (height
) position. The seed
argument ensures reproducing the same pattern of randomization each time the script is run.
+ geom_point( i size = 4, alpha = .5, position = position_jitter(width = .2, height = .1, seed = 123))
geom_jitter()
is a shorthand to geom_point()
with jittered position (but it does not has the seed
argument). In addition, the ggbeeswarm package generates randomization in a more organized and symmetric manner – check out here to learn more.
Method 3: Make shape in outlines
Combined with “jittered” position, shape in outlines further help to visualize the data overlap. Shape of 21
is a popular option, and contains an outline and a solid interior (which remains empty by default). Check the basics of points to learn more about shape customization.
+ geom_point( i size = 4, shape = 21, position = position_jitter(width = .2, height = .1, seed = 123))
Method 4: Use larger points for clustered data points
geom_count()
maps the extent of overlap to the aesthetic size
of points.
+ geom_count(alpha = .6) i
Method 5: Enhance with marginal distribution
geom_rug()
projects each data point onto the x and y axes as short tick marks, and re-image the scatterplot as bar codes on the two sides of the plot. The density of the bar codes visualizes the distribution of each x and y variable. When the scatterplot is jittered, the rug needs to be jittered in like manner to be synchronized in position with the scatterplot.
+ i geom_point(position = position_jitter(.2, .1, 123)) + geom_rug(position = position_jitter(.2, .1, 123), linewidth = .3)
In addition, ggExtra
provides an elegant solution to visualize the marginal distribution using different types of plots, including density plot, histogram, box plot, and violin plot.
# install.packages("ggExtra")library(ggExtra)
<- i + geom_jitter() + i2 theme(legend.position = "bottom")
ggMarginal(i2, groupFill = T)
Case 2: High number of data points with the txhousing
dataset
The following scatterplot derived from the dataset txhousing
contains 8600 data points. The significant overlap of the data points makes it difficult to unveil the underlying data pattern. In addition to the techniques discussed above, here we’ll introduce 6 more methods to improve the visualization.
# too many points<- txhousing %>% p ggplot(aes(x = sales, y = median))
+ geom_point() p
Method 6: Use the dot shape
A point shape in dot .
in semi-transparency can be a quick way to reveal the pattern of a large dataset. For highly skewed data like this dataset, however, the efficiency of this method may be still limited.
+ geom_point(shape = ".", alpha = .5) p
Method 7: Map point colors to neighbor counts
The
ggpointdensity
package provides an awesome solution to visualize large dataset with significant overlap. Thegeom_pointdensity()
function works very similarly togeom_point()
, but additionally coats the points in colors based on the number of neighboring points. This way, both the overall distribution and individual points can be visualized.Given the extreme skewness of the data distribution, we use
trans = "pseudo_log"
in the viridis color scale to preferentially highlight the most clustered data points. This color scale transformation is also a silver bullet in the creation of this stunning heatmap.
# install.packages("ggpointdensity")library(ggpointdensity)
+ p geom_pointdensity() + scale_color_viridis_c(trans = "pseudo_log") + # elongate the color bar guides(color = guide_colorbar(barheight = unit(200, "pt")))
Method 8: Count data points in 2D histogram
The scatterplot can be summarized as a 2D histogram. The bins
argument specifies the number of intervals / bins created along the x and y axes, and determines the number of cells the plot is divided into. The number of data points falling into each cell is counted, and mapped to a color scale. Higher bin number creates a higher resolution.
# Update the base plot with viridis color scale. <- p + scale_fill_viridis_c(option = "B") p2
+ geom_bin_2d(bins = 30) p2
Draw cells in hexagons.
+ geom_hex(bins = 30) p2
Method 9: Visualize data distribution with 2D density plot
Similar to 2D histogram, 2D density plot is another way to unveil the underlying data distribution.
# density in binned format+ stat_density_2d(aes(fill = after_stat(density)), p2 geom = "tile", contour = F, n = 30, color = "snow3")
Alternative to geom = "tile"
, we can use geom = "raster"
. The “raster” is a high-performance alternative to “tile” when all cells are of the same size, but does not have the color
argument, and thus does not draw cell outlines.
Method 10: Facet the plot into subplots
Faceting unveils positive correlation in x and y variables in many cities, which is otherwise much less perceivable when all data points are merged into a single plot.
In the
after_stat()
function:density
presents the density in each subplot relative to the data range of the whole dataset, whilendensity
(as demonstrated) normalizes the density separately in each subplot. Check out our unique ggplot2 ebook to learn more!
+ p2 stat_density_2d(aes(fill = after_stat(ndensity)), geom = "raster", contour = F, n = 30) + facet_wrap(~city, scales = "free") + theme_void() + # display complete subplot titles theme(strip.clip = "off") + # customize legend colorbar guides(fill = guide_colorbar( barheight = unit(150, "pt"), title.position = "left", title.theme = element_text(angle = 90, hjust = .5)))
Method 11: Interactive plot
ggiraph
is a versatile ggplot2 extension package that helps generate interactive plots with a few tweaks of the script.
library(ggiraph)
<- txhousing %>% p.static ggplot(aes(x = sales, y = median, # display all points associated with the same city data_id = city, # interactively display the city name tooltip = city)) + geom_point_interactive(show.legend = F) + ggtitle("This Is An Interactive Plot") + theme(plot.title = element_text(face = "bold", hjust = .5))
ggiraph(ggobj = p.static)
Method 12: Animation
Animation with the popular package gganimate
is another powerful tool to display complicated data. Animations are simply faceted plots but on a time scale. The following exploratory animation loops through 46 different cities and quickly displays the sales condition for each of them.
# The animation goes through 46 different cities $city %>% n_distinct() txhousing
Output:
[1] 46
library(gganimate)
+ geom_point() + p # label each frame with the associated city name labs(title = "{closest_state}") + # facet the plot into frames of different cities transition_states(states = city, transition_length = 0)
Check out this awesome animation of annual population pyramids, which dynamically illustrates the past and predicted dynamics of population structure in Germany.