Spread Columns into Wider Dataset (2/3): Aggregate the Data During Data Pivot

You can use pivot_wider() to perform simple aggregation. Consider the following warpbreaks dataset which is built in base R.

library(tidyr)library(dplyr)
# convert to a tibble for the better print methodwarpbreaks <- warpbreaks %>% as_tibble() %>% select(wool, tension, breaks)
warpbreaks

Output:

# A tibble: 54 × 3
wool tension breaks
<fct> <fct> <dbl>
1 A L 26
2 A L 30
3 A L 54
4 A L 25
5 A L 70
6 A L 52
7 A L 51
8 A L 26
9 A L 67
10 A M 18
# ℹ 44 more rows

This is a designed experiment with 9 replicates for every combination of wool type (A and B) and tension magnitude (L, M, H).

# explore the dataset structure by counting the number of# observations (rows) of different combinations of "wool" and "tension"warpbreaks %>% count(wool, tension)

Output:

# A tibble: 6 × 3
wool tension n
<fct> <fct> <int>
1 A L 9
2 A M 9
3 A H 9
4 B L 9
5 B M 9
6 B H 9

What happens if we attempt to spread the unique levels of wool into separate columns?

# Warning message:# Values are not uniquely identified; output will contain list-cols.# * Use `values_fn = list` to suppress this warning.# * Use `values_fn = length` to identify where the duplicates arise# * Use `values_fn = {summary_fun}` to summarise duplicates 
a <- warpbreaks %>% pivot_wider( names_from = wool, values_from = breaks )a

Output:

# A tibble: 3 × 3
tension A B
<fct> <list> <list>
1 L <dbl [9]> <dbl [9]>
2 M <dbl [9]> <dbl [9]>
3 H <dbl [9]> <dbl [9]>

The default behavior produces list-columns, with a warning that each cell in the output corresponds to multiple cells in the input. We see from the code below that in the list-columns, each cell is a vector of values of the 9 experimental replicates.

a$A[[1]]

Output:

[1] 26 30 54 25 70 52 51 26 67

A more useful output would be summary statistics, e.g. the mean breaks for each combination of wool and tension.

warpbreaks %>%   pivot_wider(    names_from = wool,     values_from = breaks,    # calculate the mean of 9 replicates for each cell in list-columns    values_fn = list(breaks = mean)   )

Output:

# A tibble: 3 × 3
tension A B
<fct> <dbl> <dbl>
1 L 44.6 28.2
2 M 24 28.8
3 H 24.6 18.8

The code above is equivalent to the code below: creating your own summary dataset first, which is then pivoted to wider format.

warpbreaks %>%   # summarize mean "breaks" for each "wool" type at each "tension" test  group_by(wool, tension) %>%   summarise(breaks = mean(breaks)) %>%   # spread wool types into separate columns  pivot_wider(    names_from = wool,     values_from = breaks)

For more complex summary operations, it is recommend to summarize before reshaping (e.g., with group_by() and summarize()), but for simple cases as this example, it’s often convenient to summarize within pivot_wider().