Create All Possible Combinations of Selected Variables (2/3): Use expand() with dplyr functions

  • Use group_by() before expand() to create combinations within each group.
  • Use expand() with anti_join() to find the missing combinations.
  • Use expand() with right_join() to convert implicit missing combinations to explicit missing values, a procedure that can be also performed by the complete() function.

In this tutorial, we’ll continue using the fruits dataset to demonstrate the use of expand() in junction with functions of the dplyr package.

library(tidyr) library(dplyr)
fruits <- tibble( type = c("apple", "apple", "orange", "orange", "orange", "orange"), year = rep(c(2023, 2024), each = 3), size = factor( c("XS", "S", "S", "S", "S", "M"), levels = c("XS", "S", "M", "L") ), weights = rnorm(6, as.numeric(size) + 2))
fruits

Output:

# A tibble: 6 × 4
type year size weights
<chr> <dbl> <fct> <dbl>
1 apple 2023 XS 3.80
2 apple 2023 S 4.41
3 orange 2023 S 3.71
4 orange 2024 S 1.90
5 orange 2024 S 5.46
6 orange 2024 M 6.82

You can use group_by() before expand() to create combinations within each group. This way, only levels that are present within each group are used to create combinations, except that for a factor variable, the full set of levels will be used in combination regardless of the groups. For instance, when size is a “factor”, all factor levels are used in combination regardless of the group; however, if size is a “character” (or other types), only levels present in each group are used in combination.

# size as a "factor"fruits %>%     group_by(type) %>%   expand(year, size)

Output:

# A tibble: 12 × 3
# Groups: type [2]
type year size
<chr> <dbl> <fct>
1 apple 2023 XS
2 apple 2023 S
3 apple 2023 M
4 apple 2023 L
5 orange 2023 XS
6 orange 2023 S
7 orange 2023 M
8 orange 2023 L
9 orange 2024 XS
10 orange 2024 S
11 orange 2024 M
12 orange 2024 L
# size as a "character"fruits %>%   mutate(size = as.character(size)) %>%   group_by(type) %>%   expand(year, size)

Output:

# A tibble: 6 × 3
# Groups: type [2]
type year size
<chr> <dbl> <chr>
1 apple 2023 S
2 apple 2023 XS
3 orange 2023 M
4 orange 2023 S
5 orange 2024 M
6 orange 2024 S

You can use expand() with anti_join() to figure out which combinations are missing. (Recall that anti_join(A, B) returns rows found in dataset A but not in in B.) For instance, the code below looks for all possible combinations of type, size, and year that are not present in fruits.

# find missing combinations (relative to all the possibilities)all_combinations <- fruits %>% expand(type, size, year)all_combinations %>% anti_join(fruits) 

Output:

# A tibble: 11 × 3
type size year
<chr> <fct> <dbl>
1 apple XS 2024
2 apple S 2024
3 apple M 2023
4 apple M 2024
5 apple L 2023
6 apple L 2024
7 orange XS 2023
8 orange XS 2024
9 orange M 2023
10 orange L 2023
11 orange L 2024

You can use expand() with right_join() to convert implicit missing combinations to explicit missing values. In this example, the missing rows have NA values in the weights variable.

fruits %>% right_join(all_combinations)

Output:

# A tibble: 17 × 4
type year size weights
<chr> <dbl> <fct> <dbl>
1 apple 2023 XS 3.80
2 apple 2023 S 4.41
3 orange 2023 S 3.71
4 orange 2024 S 1.90
5 orange 2024 S 5.46
6 orange 2024 M 6.82
7 apple 2024 XS NA
8 apple 2024 S NA
9 apple 2023 M NA
10 apple 2024 M NA
11 apple 2023 L NA
12 apple 2024 L NA
13 orange 2023 XS NA
14 orange 2024 XS NA
15 orange 2023 M NA
16 orange 2023 L NA
17 orange 2024 L NA

The code above can be also written using the complete() function to produce a similar output (though with a different order of rows in the output).

fruits %>% complete(type, year, size)

Output:

# A tibble: 17 × 4
type year size weights
<chr> <dbl> <fct> <dbl>
1 apple 2023 XS 3.80
2 apple 2023 S 4.41
3 apple 2023 M NA
4 apple 2023 L NA
5 apple 2024 XS NA
6 apple 2024 S NA
7 apple 2024 M NA
8 apple 2024 L NA
9 orange 2023 XS NA
10 orange 2023 S 3.71
11 orange 2023 M NA
12 orange 2023 L NA
13 orange 2024 XS NA
14 orange 2024 S 1.90
15 orange 2024 S 5.46
16 orange 2024 M 6.82
17 orange 2024 L NA

Now you have learned the use of expand() to create combinations of variables (columns) of an input dataset. In the next section, you’ll learn a similar function expand_grid() that yet creates combinations based on levels of vectors.