Spread Columns into Wider Dataset (2/3): Aggregate the Data During Data Pivot
You can use pivot_wider() to perform simple aggregation. Consider the following warpbreaks dataset which is built in base R.
library(tidyr)library(dplyr) # convert to a tibble for the better print methodwarpbreaks <- warpbreaks %>%as_tibble() %>%select(wool, tension, breaks) warpbreaks
Output:
# A tibble: 54 × 3 wool tension breaks <fct> <fct> <dbl> 1 A L 26 2 A L 30 3 A L 54 4 A L 25 5 A L 70 6 A L 52 7 A L 51 8 A L 26 9 A L 67 10 A M 18 # ℹ 44 more rows
This is a designed experiment with 9 replicates for every combination of wool type (A and B) and tension magnitude (L, M, H).
# explore the dataset structure by counting the number of# observations (rows) of different combinations of "wool" and "tension"warpbreaks %>%count(wool, tension)
Output:
# A tibble: 6 × 3 wool tension n <fct> <fct> <int> 1 A L 9 2 A M 9 3 A H 9 4 B L 9 5 B M 9 6 B H 9
What happens if we attempt to spread the unique levels of wool into separate columns?
# Warning message:# Values are not uniquely identified; output will contain list-cols.# * Use `values_fn = list` to suppress this warning.# * Use `values_fn = length` to identify where the duplicates arise# * Use `values_fn = {summary_fun}` to summarise duplicates a <- warpbreaks %>%pivot_wider(names_from = wool,values_from = breaks )a
Output:
# A tibble: 3 × 3 tension A B <fct> <list> <list> 1 L <dbl [9]> <dbl [9]> 2 M <dbl [9]> <dbl [9]> 3 H <dbl [9]> <dbl [9]>
The default behavior produces list-columns, with a warning that each cell in the output corresponds to multiple cells in the input. We see from the code below that in the list-columns, each cell is a vector of values of the 9 experimental replicates.
a$A[[1]]
Output:
[1] 26 30 54 25 70 52 51 26 67
A more useful output would be summary statistics, e.g. the mean breaks for each combination of wool and tension.
warpbreaks %>%pivot_wider(names_from = wool, values_from = breaks,# calculate the mean of 9 replicates for each cell in list-columnsvalues_fn =list(breaks = mean) )
Output:
# A tibble: 3 × 3 tension A B <fct> <dbl> <dbl> 1 L 44.6 28.2 2 M 24 28.8 3 H 24.6 18.8
The code above is equivalent to the code below: creating your own summary dataset first, which is then pivoted to wider format.
warpbreaks %>%# summarize mean "breaks" for each "wool" type at each "tension" testgroup_by(wool, tension) %>%summarise(breaks =mean(breaks)) %>%# spread wool types into separate columnspivot_wider(names_from = wool, values_from = breaks)
For more complex summary operations, it is recommend to summarize before reshaping (e.g., with group_by() and summarize()), but for simple cases as this example, it’s often convenient to summarize within pivot_wider().