Gather Columns into Longer and Narrower Dataset (1/4): the Basics ofpivot_longer()
pivot_longer() (previously known as gather()) is one of the most important functions in data cleanup. It converts a dataset into a tidy structure: each row is an observation, each column is a variable, and each cell contains a value entry. It is frequently needed to tidy wild-caught datasets which are often optimized for ease of data entry or visual comparison instead of ease of analysis.
Consider the following dataset relig_income that records the number of respondents of each income range in different religions.
Instead of having the income ranges <$10k, $10-20k, $20-30k … each being a separate column, we want in the tidied dataset a single column to record the income range, and another column to record the associated counts of respondents.
data is the first argument, taking the value of the dataset to be tidied up (e.g., relig_income). It can be conveniently passed into this function using the pipe operator %>%.
cols describes which columns need to be restructured. In this case, it’s every column except religion. Note that for the cols argument, column names are not quoted, a unique data masking feature in tidyverse that allows one to easily select columns by calling column names directly.
names_to gives the name of the new variable (e.g., income), whose cell values will come from names of columns specified by cols. That is, the names of the cols-specified columns will become values of the income variable of the returned dataset.
values_to gives the name of the new variable (e.g., count), whose cell values will be created from the cells stored in columns specified by cols. That is, the cell values of the cols-selected columns will become values of the count variable of the returned dataset.
Neither the names_to nor the values_to column exists in relig_income, so we provide them as strings surrounded by quotes.
In the output result, for each religion, the people count at each income range is displayed as a single row, and each column is a single variable.
Now you have acquainted yourself with the basic use of pivot_longer(). In the following three tutorials of pivot_longer(), you’ll get additional exercise over this important function, and learn more of its advanced features to efficiently pivot dataset with increasingly complex structure.