Pack Columns into a Dataframe-Column, and Unpack a Dataframe-Column into Separate Columns

pack() mimics the nested column headers popularly used in Excel, and clusters multiple columns into a single dataframe-column. unpack() is the reverse procedure, and turns the single packed data-frame column into separate columns.

pack() is barely used in practice, while unpack() is generally a more useful function, as it turns complicated nested structure into simple one. We’ll first discuss the basics of the two functions, and then demonstrate the use of unpack() in the context of a capture group.

This tutorial covers the following content:


Pack multiple columns into a single dataframe-column

In this example, the columns Sepal.Length and Sepal.Width are packed into a single dataframe-column Sepal, and Petal.Length and Petal.Width are packed into Petal. Selection helpers, e.g., starts_with(), can be used to select a range of columns to be packed up.

library(tidyr)library(dplyr)
iris %>% as_tibble() %>% pack(Sepal = starts_with("Sepal"), Petal = starts_with("Petal"))

Output:

# A tibble: 150 × 3
Species Sepal$Sepal.Length $Sepal.Width Petal$Petal.Length $Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5 3.6 1.4 0.2
# ℹ 145 more rows

To simplify columns names, you can use .names_sep to strip off the common prefix before the specified separator in column names.

a <- iris %>% as_tibble() %>%   pack(Sepal = starts_with("Sepal"),       Petal = starts_with("Petal"),       .names_sep = ".")a

Output:

# A tibble: 150 × 3
Species Sepal$Length $Width Petal$Length $Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5 3.6 1.4 0.2
# ℹ 145 more rows

You can extract Sepal in two ways with a slightly different effect:

# return two unpacked columnsa$Sepal

Output:

# A tibble: 150 × 2
Length Width
<dbl> <dbl>
1 5.1 3.5
2 4.9 3
3 4.7 3.2
4 4.6 3.1
5 5 3.6
# ℹ 145 more rows
# retain dataframe-column formata %>% select(Sepal)

Output:

# A tibble: 150 × 1
Sepal$Length $Width
<dbl> <dbl>
1 5.1 3.5
2 4.9 3
3 4.7 3.2
4 4.6 3.1
5 5 3.6
# ℹ 145 more rows

Unpack the dataframe-column

Use unpack() to release the clustered dataframe-column into separate columns.

a %>% unpack(Sepal)

Output:

# A tibble: 150 × 4
Species Length Width Petal$Length $Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5 3.6 1.4 0.2
# ℹ 145 more rows
a %>% unpack(c(Sepal, Petal), names_sep = "_")

Output:

# A tibble: 150 × 5
Species Sepal_Length Sepal_Width Petal_Length Petal_Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.6 3.1 1.5 0.2
5 setosa 5 3.6 1.4 0.2
# ℹ 145 more rows

Use unpack() with capture groups

A capture group is a part of a regular expression that is enclosed in parentheses (). str_match() in the stringr package is a nice tool to capture and extract the matched patterns, and returns a matrix. This matrix can be embedded in a tibble as a matrix-column, then converted to a tibble-column, and further unpacked to release the individual columns. The following example demonstrates this procedure. (A detailed instruction about the preparation of dataset x can be found here.)

library(tidyr)library(stringr)
# create a named capture groupnamed.capture <- "(?<status>new)_?(?<type>.*)_(?<gender>.)(?<age>.*)"
x <- who %>% # select top-20 rows containing most outbreaks in male at age 15-24 slice_max(order_by = new_sp_m1524, n = 20) %>% # convert to tidy structure pivot_longer(-c(1:4), names_to = "condition", values_to = "count") %>% # split the `condition` column into 4 separate columns # with values extracted from the defined capture group # return the captured groups as a tibble-column mutate( a = str_match(condition, named.capture) %>% as_tibble(), .keep = "unused")x

Output:

# A tibble: 1,120 × 6
country iso2 iso3 year count a$V1 $status $type $gender $age
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 India IN IND 2010 4871 new_sp_m014 new sp m 014
2 India IN IND 2010 78278 new_sp_m1524 new sp m 1524
3 India IN IND 2010 82757 new_sp_m2534 new sp m 2534
4 India IN IND 2010 90440 new_sp_m3544 new sp m 3544
5 India IN IND 2010 81210 new_sp_m4554 new sp m 4554
# ℹ 1,115 more rows
# unpack the single tibble-column into separate columnsx %>% unpack(a)

Output:

# A tibble: 1,120 × 10
country iso2 iso3 year count V1 status type gender age
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 India IN IND 2010 4871 new_sp_m014 new sp m 014
2 India IN IND 2010 78278 new_sp_m1524 new sp m 1524
3 India IN IND 2010 82757 new_sp_m2534 new sp m 2534
4 India IN IND 2010 90440 new_sp_m3544 new sp m 3544
5 India IN IND 2010 81210 new_sp_m4554 new sp m 4554
# ℹ 1,115 more rows