Gather Columns into Longer and Narrower Dataset (1/4): the Basics of pivot_longer()

pivot_longer() (previously known as gather()) is one of the most important functions in data cleanup. It converts a dataset into a tidy structure: each row is an observation, each column is a variable, and each cell contains a value entry. It is frequently needed to tidy wild-caught datasets which are often optimized for ease of data entry or visual comparison instead of ease of analysis.

Consider the following dataset relig_income that records the number of respondents of each income range in different religions.

library(tidyr)library(dplyr)
relig_income

Output:

# A tibble: 18 × 11
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k` `$100-150k` `>150k` `Don't know/refused`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137 122 109 84 96
2 Atheist 12 27 37 52 35 70 73 59 74 76
3 Buddhist 27 21 30 34 33 58 62 39 53 54
4 Catholic 418 617 732 670 638 1116 949 792 633 1489
5 Don’t know/refused 15 14 15 11 10 35 21 17 18 116
6 Evangelical Prot 575 869 1064 982 881 1486 949 723 414 1529
7 Hindu 1 9 7 9 11 34 47 48 54 37
8 Historically Black Prot 228 244 236 238 197 223 131 81 78 339
9 Jehovah's Witness 20 27 24 24 21 30 15 11 6 37
10 Jewish 19 19 25 25 30 95 69 87 151 162
11 Mainline Prot 289 495 619 655 651 1107 939 753 634 1328
12 Mormon 29 40 48 51 56 112 85 49 42 69
13 Muslim 6 7 9 10 9 23 16 8 6 22
14 Orthodox 13 17 23 32 32 47 38 42 46 73
15 Other Christian 9 7 11 13 13 14 18 14 12 18
16 Other Faiths 20 33 40 46 49 63 46 40 41 71
17 Other World Religions 5 2 3 4 2 7 3 4 4 8
18 Unaffiliated 217 299 374 365 341 528 407 321 258 597

Instead of having the income ranges <$10k, $10-20k, $20-30k … each being a separate column, we want in the tidied dataset a single column to record the income range, and another column to record the associated counts of respondents.

relig_income %>%   pivot_longer(    cols = -religion,     names_to = "income",     values_to = "count")

Output:

# A tibble: 180 × 3
religion income count
<chr> <chr> <dbl>
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6 Agnostic $50-75k 137
7 Agnostic $75-100k 122
8 Agnostic $100-150k 109
9 Agnostic >150k 84
10 Agnostic Don't know/refused 96
# ℹ 170 more rows
  • data is the first argument, taking the value of the dataset to be tidied up (e.g., relig_income). It can be conveniently passed into this function using the pipe operator %>%.

  • cols describes which columns need to be restructured. In this case, it’s every column except religion. Note that for the cols argument, column names are not quoted, a unique data masking feature in tidyverse that allows one to easily select columns by calling column names directly.

  • names_to gives the name of the new variable (e.g., income), whose cell values will come from names of columns specified by cols. That is, the names of the cols-specified columns will become values of the income variable of the returned dataset.

  • values_to gives the name of the new variable (e.g., count), whose cell values will be created from the cells stored in columns specified by cols. That is, the cell values of the cols-selected columns will become values of the count variable of the returned dataset.

Neither the names_to nor the values_to column exists in relig_income, so we provide them as strings surrounded by quotes.

In the output result, for each religion, the people count at each income range is displayed as a single row, and each column is a single variable.

Now you have acquainted yourself with the basic use of pivot_longer(). In the following three tutorials of pivot_longer(), you’ll get additional exercise over this important function, and learn more of its advanced features to efficiently pivot dataset with increasingly complex structure.