Pseudo-Logarithm in Data Visualization

A Mostly Unknown Yet Powerful Method for Data Processing

Not many people have heard of this hidden gem of pseudo-logarithmic transformation. In this article, I’ll introduce you to this fantastic tool of data processing, and demonstrate how it adds a magic touch to the classic logarithm to create stunning data visuals, especially for skewed data.

Why Use Pseudo-Logarithm?

The classic logarithm is not defined for zero and negative values. This limits its use in many applications. In comparison, pseudo-logarithm fixes this limit of the classic logarithm: defined for all real numbers, it employs a signed logarithm for large absolute values, and transitions smoothly to zero as the underlying values approach zero.

What Is Pseudo-Logarithm at all?

Pseudo-logarithm of base 10 (pseudo-log10) is defined as

This equation involves the hyperbolic sine, sinh, with

and arsinh is its inverse function, with

In the plot below, values on the x-axis is transformed by pseudo-log10 and mapped to the y-axis, depicted as the blue line. In comparison, the classic log10-transformation is drawn as the black curve.

Show the code

library(tidyverse)
pseudoLog10 <- function(x) {   log( (x/2) + sqrt ((x/2)^2 + 1) ) / log(10)  # Alternatively, you can use the built-in function 'asinh'  # asinh(x/2)/log(10)}
x <- seq(-12, 12, .05) # x-axisy <- log(x, base = 10) # map to classic log10z <- pseudoLog10(x)    # map to pseudo-log10
# plotp <- tibble(x, y, z) %>%   ggplot(aes(x)) +   geom_line(aes(y = y)) + # classic log10 as black line  geom_line(aes(y = z), color = "#0E70C0") + # pseudo-log10 as red line    # set up axis scale  coord_cartesian(xlim = c(-10, 10), ylim = c(-1, 1)) +  scale_x_continuous(breaks = seq(-10, 10, 2)) +  scale_y_continuous(breaks = seq(-1, 1, .2)) +    # label the curves with transformation names  annotate(    geom = "text",     x = c(-6, 3.5),     y = -.3,     size = 4,     fontface = "bold",    label = c("pseudo-log10", "classic log10"),    color = c("#0E70C0", "black")) +    theme_minimal(base_size = 14) +  geom_hline(yintercept = 0, color = "tomato", alpha = .4) +  geom_vline(xintercept = 0, color = "tomato", alpha = .4) +    # mark critical points  annotate(geom = "point",           x = c(-10, 0, 1, 10),            y = c(-1, 0, 0, 1),           size = 2) p

This plot shows some nice properties of pseudo-log transformation:

pseudo log10(x) is defined for all real numbers, and monotonically increasing.
pseudo log10(0) = 0
pseudo log10(-x) = - pseudo log10(x).
If x ≫ 0, pseudo-log10(x) ≈ log10(x),
If x ≪ 0, pseudo-log10(x) ≈ −log10(|x|)

In like manner, pseudo-logarithm of any base b (pseudo-log b) can be defined as

Pseudo-logb(x) has the following properties:

pseudo-logb(0) = 0
If x ≫ 0, pseudo-logb(x) ≈ logb(x),
If x ≪ 0, pseudo-logb(x) ≈ −logb(|x|)

Pseudo-Logarithm in Data Visualization

a) pseudo-log transform in histogram

The African population has a very skewed distribution. If we divide Africa into grids of latitude and longitude, and count the population in each cell, we can plot the population distribution as a histogram below.

Show the code

# the African population dataset# install.packages("remotes") # if not already installed# remotes::install_github("afrimapr/afrilearndata") # if fails, try restarting Rlibrary(afrilearndata) # packages for data cleanuplibrary(raster) library(sp)
# Data cleanupafripop_df <- afripop2020 %>%   as.data.frame(xy = TRUE) %>%   rename(pop = 3) %>%   filter(!is.na(pop)) %>%   as_tibble()
# mark some fixed population values for ease of comparisonmyBreaks <- c(0, .1, 1, 10, 100, 1000, 5000, 10000, 20000)
# create a function creating histograms with specified transformationhist <- function(    transformation = "identity",     title = "no transform"){  p <- afripop_df %>%     ggplot(aes(x = pop)) +     geom_histogram(bins =  100) +    scale_x_continuous(      transform = transformation,      breaks = myBreaks,      labels = scales::comma,      minor_breaks = NULL,      name = NULL    ) +    theme_bw() +    theme(      legend.position = "bottom",      plot.title = element_text(hjust = .5, face = "bold", size = 17, color = "turquoise4"),      panel.grid.major.y = element_blank(),      panel.grid.minor.y = element_blank(),      axis.text.x = element_text(angle = 90, hjust = 1),      panel.border = element_blank(),      plot.background = element_rect(color = "black")    ) +    ggtitle(title)  return(p)}
# Draw histograms with specified transformationsh1 <- hist() h2 <- hist(transformation = "log", title = "log") h3 <- hist(transformation = "pseudo_log", title = "pseudo-log") h4 <- hist(transformation = "log1p", title = "log ( 1 + x )")
# Plot all togethercowplot::plot_grid(h1, h2, h3, h4, nrow = 2)

Most places have a very low population density (including zero), and only a small number of cells contain a large population covering a wide numeric range. In comparison, the classic log, pseudo-log, and log(1+x) transformation remedies the skewness by various extent. (The impact of transformation to the numeric ZERO is not that obvious yet.)

b) pseudo-log transform in heatmap

The impact of transformations is mostly profound when visualized on a color scale, such as the heatmap below (with the classic viridis palette).

Show the code

# Create a function creating heatmaps with specified transformationheat <- function(data,                  transformation,                  title = "no transform"){  data %>%     ggplot() +    # Draw heatmaps    geom_raster(aes(x, y, fill = pop)) +    coord_fixed(ratio = 1.1) +    # Adjust color scale with transformation    scale_fill_viridis_c(      trans = transformation,       option = "B",      breaks = myBreaks,       labels = myBreaks) +    ggtitle(title)  +    # adjust color bar style    guides(fill = guide_colorbar(      barwidth = unit(7, "pt"),      barheight = unit(220, "pt"),      title = NULL,       title.theme = element_text(hjust = .5, face = "bold"))) +    # theme    theme_void() +    theme(plot.margin = margin(rep(5, 4)),          plot.title = element_text(            hjust = .5, face = "bold", size = 18, color = "turquoise4"))}
# plot with different logarithmic transformationsheat1 <- afripop_df %>% heat()heat2 <- afripop_df %>% heat(transformation = "log", title = "log")heat3 <- afripop_df %>% heat(transformation = "pseudo_log", title = "pseudo-log")heat4 <- afripop_df %>% heat(transformation = "log1p", title = "log (1 + x )")
cowplot::plot_grid(heat1, heat2, heat3, heat4, nrow = 2)

Without data transformation: the map is completely blacked out. The very scattered large values (depicted in brighter colors) are overwhelmed and drowned out by the bulk of smaller numbers.
Classic logarithmic transform: it nicely unveils a data pattern. Most places in Egypt, however, is shown in grey (not part of the color scale); as these places have population values of zero, which is not defined in logarithm, the data is treated as “missing values” (i.e., -Inf). In addition, the minimal fractional numbers (e.g., 0.0000001) in the dataset creates much negative transformed values, which occupy and waste a large range of color scale.
Pseudo-log transform: it performs the classic logarithmic operation for large numbers, but gradually transitions to a more “linear” scale as the values approach zero. It generates an impressive heatmap with well defined data pattern, highlighting the most populous geographical sites, and the dire inhospitality of the vast Saharan desert in the map.
log(1+x): a somewhat brutal yet effective practice to throw away small fractional numbers and zero by adding 1 to them. Both Pseudo-log and log(1+x) transforms lead to a similar transformed data scale starting from 0, and properly highlight the regions of the most populous regions (which are of the most interest).

c) compare with root transform

It is interesting to compare pseudo-log transformation with the commonly used root transformation. A root transform with a base larger than 1, e.g., square root and cubit root, is equivalent to an exponential transform with a power smaller than 1. It shrinks values bigger than 1, and reduces larger values more rapidly than smaller values. This squeezes values close to each other to fit into a shorter range, making them easier to visualize.

Show the code

# Create a function with root transformation (base specified by 'a')r <- function(a){  # perform root transformation (exponential with fractional power)  heat( afripop_df %>% mutate(pop = (pop)^(1/a)) ) +     # update the color scale    scale_fill_viridis_c(      option = "B",      breaks = myBreaks ^ (1/a),       # reverse back to original data before transformation      labels = function(x) {round((x^a), 4)}     ) +    ggtitle(paste("root base", a))}
r3 <- r(a = 3)r5 <- r(a = 5)r7 <- r(a = 7)r10 <- r(a = 10)
cowplot::plot_grid(r3, r5, r7, r10, nrow = 2)

The problem with root transformation is its opposite effect on values between 0 and 1, as it enlarges them, rather than decreasing them. This results in exaggerated difference between fractional numbers and zero. Because of this, minimal fractional population density in the Saharan area are unrealistically blown up. In addition, zeros in Egypt are left unaffected, and results in a sharp discontinuity between transformed fractional numbers and zero, mirrored as abrupt color transition between Egypt (blacked out) and elsewhere (especially under transform using a large root base, e.g., as in the 4th panel).

Pseudo-Logarithm in Data Visualization

Why Use Pseudo-Logarithm?

What Is Pseudo-Logarithm at all?

Pseudo-Logarithm in Data Visualization

a) pseudo-log transform in histogram

b) pseudo-log transform in heatmap

c) compare with root transform

Reference

Amazing eBook to learn ggplot2 FAST & EASY