Spread Columns into Wider Dataset (2/3): Aggregate the Data During Data Pivot

You can use pivot_wider() to perform simple aggregation. Consider the following warpbreaks dataset which is built in base R.

library(tidyr)library(dplyr)
# convert to a tibble for the better print methodwarpbreaks <- warpbreaks %>%   as_tibble() %>%   select(wool, tension, breaks)
warpbreaks

Output:

# A tibble: 54 × 3 wool tension breaks <fct> <fct> <dbl> 1 A L 26 2 A L 30 3 A L 54 4 A L 25 5 A L 70 6 A L 52 7 A L 51 8 A L 26 9 A L 67 10 A M 18 # ℹ 44 more rows

This is a designed experiment with 9 replicates for every combination of wool type (A and B) and tension magnitude (L, M, H).

# explore the dataset structure by counting the number of# observations (rows) of different combinations of "wool" and "tension"warpbreaks %>% count(wool, tension)

Output:
# A tibble: 6 × 3
  wool  tension     n
  <fct> <fct>   <int>
1 A     L           9
2 A     M           9
3 A     H           9
4 B     L           9
5 B     M           9
6 B     H           9

What happens if we attempt to spread the unique levels of wool into separate columns?

# Warning message:# Values are not uniquely identified; output will contain list-cols.# * Use `values_fn = list` to suppress this warning.# * Use `values_fn = length` to identify where the duplicates arise# * Use `values_fn = {summary_fun}` to summarise duplicates 
a <- warpbreaks %>%   pivot_wider(    names_from = wool,    values_from = breaks  )a

Output:
# A tibble: 3 × 3
  tension A         B
  <fct>   <list>    <list>
1 L       <dbl [9]> <dbl [9]>
2 M       <dbl [9]> <dbl [9]>
3 H       <dbl [9]> <dbl [9]>

The default behavior produces list-columns, with a warning that each cell in the output corresponds to multiple cells in the input. We see from the code below that in the list-columns, each cell is a vector of values of the 9 experimental replicates.

a$A[[1]]

Output:
[1] 26 30 54 25 70 52 51 26 67

A more useful output would be summary statistics, e.g. the mean breaks for each combination of wool and tension.

warpbreaks %>%   pivot_wider(    names_from = wool,     values_from = breaks,    # calculate the mean of 9 replicates for each cell in list-columns    values_fn = list(breaks = mean)   )

Output:
# A tibble: 3 × 3
  tension     A     B
  <fct>   <dbl> <dbl>
1 L        44.6  28.2
2 M        24    28.8
3 H        24.6  18.8

The code above is equivalent to the code below: creating your own summary dataset first, which is then pivoted to wider format.

warpbreaks %>%   # summarize mean "breaks" for each "wool" type at each "tension" test  group_by(wool, tension) %>%   summarise(breaks = mean(breaks)) %>%   # spread wool types into separate columns  pivot_wider(    names_from = wool,     values_from = breaks)

For more complex summary operations, it is recommend to summarize before reshaping (e.g., with group_by() and summarize()), but for simple cases as this example, it’s often convenient to summarize within pivot_wider().

Amazing eBook to learn ggplot2 FAST & EASY