Gather Columns into Longer and Narrower Dataset (3/4): Deal with Jammed Variables in Column Names

A more challenging situation occurs when you have multiple variables crammed into the column names. For example, consider the who dataset from the World Health Organization Global Tuberculosis Report. For ease of demo, we’ll use a subset of this data: using slice_max() to select the top 20 rows of records of the most tuberculosis outbreaks.

library(tidyr)library(dplyr)
# select top-20 rows containing most outbreaks in male at age 15-24 who.max <- who %>% slice_max(order_by = new_sp_m1524, n = 20)
who.max

Output:

# A tibble: 20 × 60
country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 new_sp_m4554 new_sp_m5564 new_sp_m65 new_sp_f014 new_sp_f1524 new_sp_f2534 new_sp_f3544 new_sp_f4554 new_sp_f5564 new_sp_f65 new_sn_m014 new_sn_m1524 new_sn_m2534 new_sn_m3544 new_sn_m4554 new_sn_m5564 new_sn_m65
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 India IN IND 2010 4871 78278 82757 90440 81210 60766 38442 8544 53415 49425 34035 22719 15527 9735 NA NA NA NA NA NA NA
2 India IN IND 2009 5001 78177 84003 90830 80097 59163 37419 8576 51945 49747 33754 22032 14929 8944 NA NA NA NA NA NA NA
3 India IN IND 2011 4649 78096 82762 89706 82921 63625 42443 8336 53958 49227 34698 23977 17182 10731 NA NA NA NA NA NA NA
4 India IN IND 2008 4648 77121 83798 90498 78815 56928 36079 8319 51485 49887 33664 21486 14407 8357 NA NA NA NA NA NA NA
5 India IN IND 2012 4697 75502 79594 88111 82356 63814 41322 8260 53975 47511 33378 23267 17300 10502 NA NA NA NA NA NA NA
6 India IN IND 2007 4305 73947 83850 88045 76408 53414 31922 7575 50289 49519 32407 20316 13195 7395 NA NA NA 250051 NA NA NA
7 India IN IND 2006 3566 68346 79037 82939 71621 49320 28716 6963 47702 47420 31128 18870 11752 6417 NA NA NA NA NA NA NA
8 India IN IND 2005 3185 62620 74678 76870 64843 43038 24726 6292 45136 45629 28577 17042 10513 5408 NA NA NA NA NA NA NA
9 India IN IND 2004 3018 57208 72132 74450 62173 40769 22388 5860 41017 42808 27000 16121 9705 5016 NA NA NA NA NA NA NA
10 India IN IND 2003 2411 47251 61758 63587 52865 33739 18018 4745 34511 36317 23320 14055 8322 3985 NA NA NA NA NA NA NA
# ℹ 10 more rows
# ℹ 35 more variables: new_sn_f014 <dbl>, new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, new_sn_f3544 <dbl>, new_sn_f4554 <dbl>, new_sn_f5564 <dbl>, new_sn_f65 <dbl>, new_ep_m014 <dbl>, new_ep_m1524 <dbl>, new_ep_m2534 <dbl>, new_ep_m3544 <dbl>, new_ep_m4554 <dbl>, new_ep_m5564 <dbl>, new_ep_m65 <dbl>,
# new_ep_f014 <dbl>, new_ep_f1524 <dbl>, new_ep_f2534 <dbl>, new_ep_f3544 <dbl>, new_ep_f4554 <dbl>, new_ep_f5564 <dbl>, new_ep_f65 <dbl>, newrel_m014 <dbl>, newrel_m1524 <dbl>, newrel_m2534 <dbl>, newrel_m3544 <dbl>, newrel_m4554 <dbl>, newrel_m5564 <dbl>, newrel_m65 <dbl>, newrel_f014 <dbl>,
# newrel_f1524 <dbl>, newrel_f2534 <dbl>, newrel_f3544 <dbl>, newrel_f4554 <dbl>, newrel_f5564 <dbl>, newrel_f65 <dbl>

country, iso2, iso3, and year are already variables, so they can be left as is. But the columns from new_sp_m014 to the last column newrel_f65 encode four pieces of information in their names:

  • The new_ and new prefix indicates that the counts are new tuberculosis cases.
  • sp, sn, ep, and rel describe the diagnosis result: sp, positive pulmonary smear; sn, negative pulmonary smear; ep, extra pulmonary; rel, relapse.
  • m an f indicates the gender: m, male; f, female.
  • The digits show the age ranges: 014, 0-14 years of age; 1524, 15-24 years of age; 2535, 25-35 years of age, etc.

It’s desirable to break these four pieces of information into four separate columns during pivoting. We can use names_to to specify names of the four new column to create: status, diagnosis, gender and age. Meanwhile, we use names_pattern to extract values for each of the four new columns: you give it a regular expression containing capture groups defined by a pair of parentheses (), and it puts each capture group into each associated column.

who.max %>% pivot_longer(  cols = new_sp_m014 : last_col(),  # create 3 new columns  names_to = c("status", "diagnosis", "gender", "age"),  # define the associated capture group for each new column  names_pattern = "(new)_?(.*)_(.)(.*)",  values_to = "count")

Output:

# A tibble: 1,120 × 9
country iso2 iso3 year status diagnosis gender age count
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 India IN IND 2010 new sp m 014 4871
2 India IN IND 2010 new sp m 1524 78278
3 India IN IND 2010 new sp m 2534 82757
4 India IN IND 2010 new sp m 3544 90440
5 India IN IND 2010 new sp m 4554 81210
6 India IN IND 2010 new sp m 5564 60766
7 India IN IND 2010 new sp m 65 38442
8 India IN IND 2010 new sp f 014 8544
9 India IN IND 2010 new sp f 1524 53415
10 India IN IND 2010 new sp f 2534 49425
# ℹ 1,110 more rows

The above script is equivalent to first calling pivot_longer(names_to = "condition", values_to = "count"), and then split column condition into separate columns (first extract the four pieces of information with capture groups, and then put the pieces into four separate columns using stringr::str_match() as demonstrated here).

We could go one step further using names_transform and the readr functions to convert gender and age to factors. In the following script, we create lambda functions on the fly, marked by the tilde sign ~, and use .x to refer to columns being processed.

who %>% pivot_longer(  cols = new_sp_m014 : last_col(),  names_to = c("diagnosis", "gender", "age"),  names_pattern = "new_?(.*)_(.)(.*)",  # convert gender and age to normal and ordered factor, respectively  names_transform = list(    gender = ~ readr::parse_factor(.x, levels = c("f", "m")),    age = ~ readr::parse_factor(      .x, levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),      ordered = T)  ),  values_to = "count")

Output:

# A tibble: 405,440 × 8
country iso2 iso3 year diagnosis gender age count
<chr> <chr> <chr> <dbl> <chr> <fct> <ord> <dbl>
1 Afghanistan AF AFG 1980 sp m 014 NA
2 Afghanistan AF AFG 1980 sp m 1524 NA
3 Afghanistan AF AFG 1980 sp m 2534 NA
4 Afghanistan AF AFG 1980 sp m 3544 NA
5 Afghanistan AF AFG 1980 sp m 4554 NA
6 Afghanistan AF AFG 1980 sp m 5564 NA
7 Afghanistan AF AFG 1980 sp m 65 NA
8 Afghanistan AF AFG 1980 sp f 014 NA
9 Afghanistan AF AFG 1980 sp f 1524 NA
10 Afghanistan AF AFG 1980 sp f 2534 NA
# ℹ 405,430 more rows