Gather Columns into Longer and Narrower Dataset (4/4): Deal with Multiple observations Per Row

So far, we have been working with data frames that have one observation per row, but many important pivoting problems involve multiple observations per row. You can usually recognize this case when the input column names contain both variable names (as they correctly should be) and observation names (which should otherwise be recorded in different rows). In this section, you’ll learn how to pivot this sort of data.

e.g. 1. In the following dataset, each child has two pieces of record, gender and dob (date of birth). child1 and child2 are essentially different observations; instead of occupying different rows as they should be in a tidy dataset, they occupy different columns, rending the dataset not tidy enough.

library(tidyr)library(dplyr)
family <- tribble(  ~family,  ~dob_child1,  ~dob_child2, ~gender_child1, ~gender_child2,  1L, "2021-02-23", "2024-06-03",             1L,             2L,  2L, "1920-06-22",           NA,             2L,             NA,  3L, "2019-07-11", "2023-12-26",             2L,             2L,  4L, "2024-10-10", "2024-10-10",             1L,             1L,  5L, "2018-12-05", "2023-02-28",             2L,             1L,)family <- family %>% mutate_at(vars(starts_with("dob")), readr::parse_date)family

Output:

# A tibble: 5 × 5 family dob_child1 dob_child2 gender_child1 gender_child2 <int> <date> <date> <int> <int> 1 1 2021-02-23 2024-06-03 1 2 2 2 1920-06-22 NA 2 NA 3 3 2019-07-11 2023-12-26 2 2 4 4 2024-10-10 2024-10-10 1 1 5 5 2018-12-05 2023-02-28 2 1

To tidy up the dataset, child1 and child2 should be cell values in the same column, and gender and dob should be reserved as separate columns. The following script can be conceptualized as pivoting the data at the child level while maintaining the original structure of dob and gender.

family %>%   pivot_longer(    -family,     names_sep = "_",     names_to = c(".value", "child"),     values_drop_na = TRUE  )

Output:

# A tibble: 9 × 4 family child dob gender <int> <chr> <date> <int> 1 1 child1 2021-02-23 1 2 1 child2 2024-06-03 2 3 2 child1 1920-06-22 2 4 3 child1 2019-07-11 2 5 3 child2 2023-12-26 2 6 4 child1 2024-10-10 1 7 4 child2 2024-10-10 1 8 5 child1 2018-12-05 2 9 5 child2 2023-02-28 1

names_sep = "_" indicates that the underscore _ is used as the separator to split the original column names into two parts: the part before the underscore, and the part after.
The generic function of names_to = "x" is to turn the input column names as cell values under the x variable. In this case, the second part of the column names, child1 and child2, are turned into cell values under the new column child. The string .value is a special placeholder that, in this case, matches the first part of column names, i.e., dob and gender, and serves two roles: 1) the matched part are reserved as new column names, and 2) .value specifies the values being measured for the new columns in the output (reminiscent to the argument values_to).

e.g.2. Anscombe’s quartet is a famous example in statistics illustrating the importance of visualizing data distribution rather than relying solely on summary statistics. It consists of four datasets, each containing two variables (x and y). Despite having different data distribution, the four datasets share identical or very similar summary statistics such as mean, standard deviation, correlation, and regression lines. It highlights the importance of visualizing the data distribution, as different datasets with the same summary statistics can have vastly different characteristics.

anscombe

Output:

x1 x2 x3 x4 y1 y2 y3 y4 1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 7 6 6 6 8 7.24 6.13 6.08 5.25 8 4 4 4 19 4.26 3.10 5.39 12.50 9 12 12 12 8 10.84 9.13 8.15 5.56 10 7 7 7 8 4.82 7.26 6.42 7.91 11 5 5 5 8 5.68 4.74 5.73 6.89

Below we’ll produce a dataset with columns set, x and y only. For the regular expression (.)(.), each dot is a wildcard representing any character, and the entire expression matches any two single consecutive characters, with each character being an individual capture group.

a <- anscombe %>%   pivot_longer(    everything(),    names_to = c(".value", "set"),    names_pattern = "(.)(.)"  ) %>%   arrange(set)a

Output:

# A tibble: 44 × 3 set x y <chr> <dbl> <dbl> 1 1 10 8.04 2 1 8 6.95 3 1 13 7.58 4 1 9 8.81 5 1 11 8.33 6 1 14 9.96 7 1 6 7.24 8 1 4 4.26 9 1 12 10.8 10 1 7 4.82 # ℹ 34 more rows

To demonstrate the continence brought by the tidy structure, the output can be readily streamlined with ggplot2 to visualize all datasets at the same time.

library(ggplot2)a %>%   ggplot(aes(x = x, y = y)) + geom_point() +  facet_wrap(~set, nrow = 1) +  theme_bw()

e.g.3. The same tidying approach used for anscombe above can be used to tidy up the dataset below.

pnl <- tibble(  x = 1:4,   a = c(1, 1,0, 0),  b = c(0, 1, 1, 1),  y1 = rnorm(4),  y2 = rnorm(4),  z1 = rep(3, 4),  z2 = rep(-2, 4))
pnl

Output:
# A tibble: 4 × 7
      x     a     b    y1     y2    z1    z2
  <int> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
1     1     1     0 0.619  0.485     3    -2
2     2     1     1 0.855 -0.886     3    -2
3     3     0     1 2.32  -0.808     3    -2
4     4     0     1 1.64   1.36      3    -2

pnl %>%   pivot_longer(    !c(x, a, b),     names_to = c(".value", "time"),     names_pattern = "(.)(.)"  )

Output:
# A tibble: 8 × 6
      x     a     b time       y     z
  <int> <dbl> <dbl> <chr>  <dbl> <dbl>
1     1     1     0 1      0.619     3
2     1     1     0 2      0.485    -2
3     2     1     1 1      0.855     3
4     2     1     1 2     -0.886    -2
5     3     0     1 1      2.32      3
6     3     0     1 2     -0.808    -2
7     4     0     1 1      1.64      3
8     4     0     1 2      1.36     -2

Amazing eBook to learn ggplot2 FAST & EASY