Spread Columns into Wider Dataset (1/3): the basics of pivot_wider()

pivot_wider() (previously known as spread()) is the opposite of pivot_longer(): it makes a dataset wider by increasing the number of columns and decreasing the number of rows. It’s relatively rare to need pivot_wider() to make tidy data, but it’s often useful for creating summary tables for presentation, or data in a format needed by many other tools.

Below you’ll learn the basics of pivot_wider() and appreciate its application from the following three excellent examples.

e.g.1 population is a subset of data from the World Health Organization that records the annual population in countries from 1995 to 2013. The dataset is in a nice tidy structure.

library(tidyr)library(dplyr)
population

Output:

# A tibble: 4,060 × 3
country year population
<chr> <dbl> <dbl>
1 Afghanistan 1995 17586073
2 Afghanistan 1996 18415307
3 Afghanistan 1997 19021226
4 Afghanistan 1998 19496836
5 Afghanistan 1999 19987071
6 Afghanistan 2000 20595360
7 Afghanistan 2001 21347782
8 Afghanistan 2002 22202806
9 Afghanistan 2003 23116142
10 Afghanistan 2004 24018682
# ℹ 4,050 more rows

You can use pivot_wider() to reshape the dataset into a wider format to make it easier for visual check and additional analysis with other tools. The first argument is the dataset to tidy up. Besides, there are another two basic arguments:

  • names_from specifies the column whose unique values will become the new column names in the pivoted wide-format data frame.

  • values_from specifies the column whose values will populate the cells of the resulted wide-format data frame.

population %>% pivot_wider(  # unique values in the 'year' column are spread out as new column names  names_from = year,    # values in the 'population' column are used to fill up cells of new columns  values_from = population)

Output:

# A tibble: 219 × 20
country `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 17586073 18415307 19021226 19496836 19987071 20595360 21347782 22202806 23116142 24018682 24860855 25631282 26349243 27032197 27708187 28397812 29105480 29824536 30551674
2 Albania 3357858 3341043 3331317 3325456 3317941 3304948 3286084 3263596 3239385 3216197 3196130 3179573 3166222 3156608 3151185 3150143 3153883 3162083 3173271
3 Algeria 29315463 29845208 30345466 30820435 31276295 31719449 32150198 32572977 33003442 33461345 33960903 34507214 35097043 35725377 36383302 37062820 37762962 38481705 39208194
4 American Samoa 52874 53926 54942 55899 56768 57522 58176 58729 59117 59262 59117 58652 57919 57053 56245 55636 55274 55128 55165
5 Andorra 63854 64274 64090 63799 64084 65399 68000 71639 75643 79060 81223 81877 81292 79969 78659 77907 77865 78360 79218
6 Angola 12104952 12451945 12791388 13137542 13510616 13924930 14385283 14886574 15421075 15976715 16544376 17122409 17712824 18314441 18926650 19549124 20180490 20820525 21471618
7 Anguilla 9807 10063 10305 10545 10797 11071 11371 11693 12023 12342 12637 12903 13145 13365 13571 13768 13956 14132 14300
8 Antigua and Barbuda 68349 70245 72232 74206 76041 77648 78972 80030 80904 81718 82565 83467 84397 85349 86300 87233 88152 89069 89985
9 Argentina 34833168 35264070 35690778 36109342 36514558 36903067 37273361 37627545 37970411 38308779 38647854 38988923 39331357 39676083 40023641 40374224 40728738 41086927 41446246
10 Armenia 3223173 3173425 3137652 3112958 3093820 3076098 3059960 3047002 3036032 3025652 3014917 3002911 2989882 2977488 2968154 2963496 2964120 2969081 2976566
# ℹ 209 more rows


e.g.2 When a tagged fish swims downstream in a river, each instance of its detection by an autonomous monitor (referred to as a station) is encoded as 1. The fish_encounters dataset records such an encounters history.

fish_encounters

Output:

# A tibble: 114 × 3
fish station seen
<fct> <fct> <int>
1 4842 Release 1
2 4842 I80_1 1
3 4842 Lisbon 1
4 4842 Rstr 1
5 4842 Base_TD 1
6 4842 BCE 1
7 4842 BCW 1
8 4842 BCE2 1
9 4842 BCW2 1
10 4842 MAE 1
# ℹ 104 more rows

Below we’ll pivot the dataset into a wider format. This quickly generates a bird view of detection instances of the tagged fish by different stations.

fish_encounters %>%   pivot_wider(    # unique values in the 'station' column are spread out as new column names    names_from = station,     # values in the 'seen' column are used to fill up cells of new columns    values_from = seen)

Output:

# A tibble: 19 × 12
fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 4842 1 1 1 1 1 1 1 1 1 1 1
2 4843 1 1 1 1 1 1 1 1 1 1 1
3 4844 1 1 1 1 1 1 1 1 1 1 1
4 4845 1 1 1 1 1 NA NA NA NA NA NA
5 4847 1 1 1 NA NA NA NA NA NA NA NA
6 4848 1 1 1 1 NA NA NA NA NA NA NA
7 4849 1 1 NA NA NA NA NA NA NA NA NA
8 4850 1 1 NA 1 1 1 1 NA NA NA NA
9 4851 1 1 NA NA NA NA NA NA NA NA NA
10 4854 1 1 NA NA NA NA NA NA NA NA NA
11 4855 1 1 1 1 1 NA NA NA NA NA NA
12 4857 1 1 1 1 1 1 1 1 1 NA NA
13 4858 1 1 1 1 1 1 1 1 1 1 1
14 4859 1 1 1 1 1 NA NA NA NA NA NA
15 4861 1 1 1 1 1 1 1 1 1 1 1
16 4862 1 1 1 1 1 1 1 1 1 NA NA
17 4863 1 1 NA NA NA NA NA NA NA NA NA
18 4864 1 1 NA NA NA NA NA NA NA NA NA
19 4865 1 1 1 NA NA NA NA NA NA NA NA

A registry of 1 is recorded by a station only when a fish was detected; nothing was recorded when the fish was not detected, resulting in NA values. As such, we can ask pivot_wider() to fill in these missing values with zeros using the values_fill argument.

fish_encounters %>% pivot_wider(  names_from = station,   values_from = seen,  values_fill = 0 # use 0 in place of NA values)

Output:

# A tibble: 19 × 12
fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 4842 1 1 1 1 1 1 1 1 1 1 1
2 4843 1 1 1 1 1 1 1 1 1 1 1
3 4844 1 1 1 1 1 1 1 1 1 1 1
4 4845 1 1 1 1 1 0 0 0 0 0 0
5 4847 1 1 1 0 0 0 0 0 0 0 0
6 4848 1 1 1 1 0 0 0 0 0 0 0
7 4849 1 1 0 0 0 0 0 0 0 0 0
8 4850 1 1 0 1 1 1 1 0 0 0 0
9 4851 1 1 0 0 0 0 0 0 0 0 0
10 4854 1 1 0 0 0 0 0 0 0 0 0
11 4855 1 1 1 1 1 0 0 0 0 0 0
12 4857 1 1 1 1 1 1 1 1 1 0 0
13 4858 1 1 1 1 1 1 1 1 1 1 1
14 4859 1 1 1 1 1 0 0 0 0 0 0
15 4861 1 1 1 1 1 1 1 1 1 1 1
16 4862 1 1 1 1 1 1 1 1 1 0 0
17 4863 1 1 0 0 0 0 0 0 0 0 0
18 4864 1 1 0 0 0 0 0 0 0 0 0
19 4865 1 1 1 0 0 0 0 0 0 0 0


e.g.3 Imagine you have a contact list that you’ve copied and pasted from a website:

contacts <- tribble(  ~field, ~value,  "name", "Jiena McLellan",  "company", "Toyota",   "name", "John Smith",   "company", "google",   "email", "john@google.com",  "name", "Huxley Ratcliffe")
contacts

Output:

# A tibble: 6 × 2
field value
<chr> <chr>
1 name Jiena McLellan
2 company Toyota
3 name John Smith
4 company google
5 email john@google.com
6 name Huxley Ratcliffe

This is challenging because there’s no variable that identifies which observations belong together. We can fix this by noting that every contact starts with a name, so we can create a unique id by counting every time we see name as the field.

contacts <- contacts %>%   mutate(person_id = cumsum(field == "name"))
contacts

Output:

# A tibble: 6 × 3
field value person_id
<chr> <chr> <int>
1 name Jiena McLellan 1
2 company Toyota 1
3 name John Smith 2
4 company google 2
5 email john@google.com 2
6 name Huxley Ratcliffe 3

Now that we have a unique identifier for each person, we can pivot field and value into the columns.

contacts %>%   pivot_wider(names_from = field, values_from = value)

Output:

# A tibble: 3 × 4
person_id name company email
<int> <chr> <chr> <chr>
1 1 Jiena McLellan Toyota <NA>
2 2 John Smith google john@google.com
3 3 Huxley Ratcliffe <NA> <NA>


Now you have been familiar with the basic use of pivot_wider(). The following two tutorials on pivot_wider() will discuss more advanced features that allow you to efficiently pivot datasets with increasingly complex structure.