Gather Columns into Longer and Narrower Dataset (3/4): Deal with Jammed Variables in Column Names

A more challenging situation occurs when you have multiple variables crammed into the column names. For example, consider the who dataset from the World Health Organization Global Tuberculosis Report. For ease of demo, we’ll use a subset of this data: using slice_max() to select the top 20 rows of records of the most tuberculosis outbreaks.

library(tidyr)library(dplyr)
# select top-20 rows containing most outbreaks in male at age 15-24 who.max <- who %>%   slice_max(order_by = new_sp_m1524, n = 20)
who.max

Output:
# A tibble: 20 × 60
   country iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 new_sp_m4554 new_sp_m5564 new_sp_m65 new_sp_f014 new_sp_f1524 new_sp_f2534 new_sp_f3544 new_sp_f4554 new_sp_f5564 new_sp_f65 new_sn_m014 new_sn_m1524 new_sn_m2534 new_sn_m3544 new_sn_m4554 new_sn_m5564 new_sn_m65
   <chr>   <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>      <dbl>       <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>      <dbl>       <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>      <dbl>
 1 India   IN    IND    2010        4871        78278        82757        90440        81210        60766      38442        8544        53415        49425        34035        22719        15527       9735          NA           NA           NA           NA           NA           NA         NA
 2 India   IN    IND    2009        5001        78177        84003        90830        80097        59163      37419        8576        51945        49747        33754        22032        14929       8944          NA           NA           NA           NA           NA           NA         NA
 3 India   IN    IND    2011        4649        78096        82762        89706        82921        63625      42443        8336        53958        49227        34698        23977        17182      10731          NA           NA           NA           NA           NA           NA         NA
 4 India   IN    IND    2008        4648        77121        83798        90498        78815        56928      36079        8319        51485        49887        33664        21486        14407       8357          NA           NA           NA           NA           NA           NA         NA
 5 India   IN    IND    2012        4697        75502        79594        88111        82356        63814      41322        8260        53975        47511        33378        23267        17300      10502          NA           NA           NA           NA           NA           NA         NA
 6 India   IN    IND    2007        4305        73947        83850        88045        76408        53414      31922        7575        50289        49519        32407        20316        13195       7395          NA           NA           NA       250051           NA           NA         NA
 7 India   IN    IND    2006        3566        68346        79037        82939        71621        49320      28716        6963        47702        47420        31128        18870        11752       6417          NA           NA           NA           NA           NA           NA         NA
 8 India   IN    IND    2005        3185        62620        74678        76870        64843        43038      24726        6292        45136        45629        28577        17042        10513       5408          NA           NA           NA           NA           NA           NA         NA
 9 India   IN    IND    2004        3018        57208        72132        74450        62173        40769      22388        5860        41017        42808        27000        16121         9705       5016          NA           NA           NA           NA           NA           NA         NA
10 India   IN    IND    2003        2411        47251        61758        63587        52865        33739      18018        4745        34511        36317        23320        14055         8322       3985          NA           NA           NA           NA           NA           NA         NA
# ℹ 10 more rows
# ℹ 35 more variables: new_sn_f014 <dbl>, new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, new_sn_f3544 <dbl>, new_sn_f4554 <dbl>, new_sn_f5564 <dbl>, new_sn_f65 <dbl>, new_ep_m014 <dbl>, new_ep_m1524 <dbl>, new_ep_m2534 <dbl>, new_ep_m3544 <dbl>, new_ep_m4554 <dbl>, new_ep_m5564 <dbl>, new_ep_m65 <dbl>,
#   new_ep_f014 <dbl>, new_ep_f1524 <dbl>, new_ep_f2534 <dbl>, new_ep_f3544 <dbl>, new_ep_f4554 <dbl>, new_ep_f5564 <dbl>, new_ep_f65 <dbl>, newrel_m014 <dbl>, newrel_m1524 <dbl>, newrel_m2534 <dbl>, newrel_m3544 <dbl>, newrel_m4554 <dbl>, newrel_m5564 <dbl>, newrel_m65 <dbl>, newrel_f014 <dbl>,
#   newrel_f1524 <dbl>, newrel_f2534 <dbl>, newrel_f3544 <dbl>, newrel_f4554 <dbl>, newrel_f5564 <dbl>, newrel_f65 <dbl>

country, iso2, iso3, and year are already variables, so they can be left as is. But the columns from new_sp_m014 to the last column newrel_f65 encode four pieces of information in their names:

The new_ and new prefix indicates that the counts are new tuberculosis cases.
sp, sn, ep, and rel describe the diagnosis result: sp, positive pulmonary smear; sn, negative pulmonary smear; ep, extra pulmonary; rel, relapse.
m an f indicates the gender: m, male; f, female.
The digits show the age ranges: 014, 0-14 years of age; 1524, 15-24 years of age; 2535, 25-35 years of age, etc.

It’s desirable to break these four pieces of information into four separate columns during pivoting. We can use names_to to specify names of the four new column to create: status, diagnosis, gender and age. Meanwhile, we use names_pattern to extract values for each of the four new columns: you give it a regular expression containing capture groups defined by a pair of parentheses (), and it puts each capture group into each associated column.

who.max %>% pivot_longer(  cols = new_sp_m014 : last_col(),  # create 3 new columns  names_to = c("status", "diagnosis", "gender", "age"),  # define the associated capture group for each new column  names_pattern = "(new)_?(.*)_(.)(.*)",  values_to = "count")

Output:

# A tibble: 1,120 × 9 country iso2 iso3 year status diagnosis gender age count <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> 1 India IN IND 2010 new sp m 014 4871 2 India IN IND 2010 new sp m 1524 78278 3 India IN IND 2010 new sp m 2534 82757 4 India IN IND 2010 new sp m 3544 90440 5 India IN IND 2010 new sp m 4554 81210 6 India IN IND 2010 new sp m 5564 60766 7 India IN IND 2010 new sp m 65 38442 8 India IN IND 2010 new sp f 014 8544 9 India IN IND 2010 new sp f 1524 53415 10 India IN IND 2010 new sp f 2534 49425 # ℹ 1,110 more rows

The above script is equivalent to first calling pivot_longer(names_to = "condition", values_to = "count"), and then split column condition into separate columns (first extract the four pieces of information with capture groups, and then put the pieces into four separate columns using stringr::str_match() as demonstrated here).

We could go one step further using names_transform and the readr functions to convert gender and age to factors. In the following script, we create lambda functions on the fly, marked by the tilde sign ~, and use .x to refer to columns being processed.

who %>% pivot_longer(  cols = new_sp_m014 : last_col(),  names_to = c("diagnosis", "gender", "age"),  names_pattern = "new_?(.*)_(.)(.*)",  # convert gender and age to normal and ordered factor, respectively  names_transform = list(    gender = ~ readr::parse_factor(.x, levels = c("f", "m")),    age = ~ readr::parse_factor(      .x, levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),      ordered = T)  ),  values_to = "count")

Output:

# A tibble: 405,440 × 8 country iso2 iso3 year diagnosis gender age count <chr> <chr> <chr> <dbl> <chr> <fct> <ord> <dbl> 1 Afghanistan AF AFG 1980 sp m 014 NA 2 Afghanistan AF AFG 1980 sp m 1524 NA 3 Afghanistan AF AFG 1980 sp m 2534 NA 4 Afghanistan AF AFG 1980 sp m 3544 NA 5 Afghanistan AF AFG 1980 sp m 4554 NA 6 Afghanistan AF AFG 1980 sp m 5564 NA 7 Afghanistan AF AFG 1980 sp m 65 NA 8 Afghanistan AF AFG 1980 sp f 014 NA 9 Afghanistan AF AFG 1980 sp f 1524 NA 10 Afghanistan AF AFG 1980 sp f 2534 NA # ℹ 405,430 more rows

Amazing eBook to learn ggplot2 FAST & EASY