Gather Columns into Longer and Narrower Dataset (2/4): Deal with Numeric data in Column Names

The billboard dataset records the billboard rank of songs across all weeks in the year 2000. It has a similar structure as the relig_income dataset used in the last tutorial.

library(tidyr)library(dplyr)
billboard

Output:

# A tibble: 317 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11 wk12 wk13 wk14 wk15 wk16 wk17 wk18 wk19 wk20 wk21 wk22 wk23 wk24 wk25 wk26 wk27 wk28 wk29 wk30 wk31 wk32 wk33 wk34 wk35 wk36 wk37 wk38 wk39 wk40 wk41 wk42 wk43 wk44 wk45
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 3 Doors… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 51 51 51 51 47 44 38 28 22 18 18 14 12 7 6 6 6 5 5 4 4 4 4 3 3 3 4 5 5 9 9 15 14 13 14 16 17
4 3 Doors… Loser 2000-10-21 76 76 72 69 67 65 55 59 62 61 61 59 61 66 72 76 75 67 73 70 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49 53 57 64 70 75 76 78 85 92 96 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 3 6 7 22 29 36 47 67 66 84 93 94 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38 38 36 37 37 38 49 61 63 62 67 83 86 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14 12 10 9 8 6 1 2 2 2 2 3 4 5 5 6 9 13 14 16 23 22 33 36 43 NA NA NA NA NA NA NA NA NA NA NA NA NA
10 Adams, … Open… 2000-08-26 76 76 74 69 68 67 61 58 57 59 66 68 61 67 59 63 67 71 79 89 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# ℹ 307 more rows
# ℹ 31 more variables: wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>, wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>, wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>, wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>,
# wk69 <lgl>, wk70 <lgl>, wk71 <lgl>, wk72 <lgl>, wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>

For all columns containing the prefix “wk”, we’ll restructure them into two columns: save the column names wk1, wk2, wk3…as values of the column week, and save all ranks across different weeks as values of the rank column. Here we use the selection helper starts_with() to select columns with the prefix “wk”.

billboard %>%   pivot_longer(    cols = starts_with("wk"),    names_to = "week",     values_to = "rank")

Output:

# A tibble: 24,092 × 5
artist track date.entered week rank
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
# ℹ 24,082 more rows

The useful data encoded in billboard’s column names wk1, wk2… is essentially the week number, while the constant prefix “wk” does not carry much useful information. As such, we can use argument names_prefix to remove the prefix “wk” from the week column in the pivoted dataset.

billboard %>%   pivot_longer(    cols = starts_with("wk"),    names_to = "week",    names_prefix = "wk", # remove prefix "wk"    values_to = "rank")

Output:

# A tibble: 24,092 × 5
artist track date.entered week rank
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
8 2 Pac Baby Don't Cry (Keep... 2000-02-26 8 NA
9 2 Pac Baby Don't Cry (Keep... 2000-02-26 9 NA
10 2 Pac Baby Don't Cry (Keep... 2000-02-26 10 NA
# ℹ 24,082 more rows

On top of removing the prefix, we can use names_transform and a list of column name - function pair to change the type of the week column (or to perform other processing as demonstrated in later tutorials). The following script turns column week into an integer type.

billboard %>%   pivot_longer(    cols = starts_with("wk"),    names_to = "week",    names_prefix = "wk",     # convert 'week' to integer    names_transform = list(week = as.integer),     values_to = "rank")

Output:

# A tibble: 24,092 × 5
artist track date.entered week rank
<chr> <chr> <date> <int> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
8 2 Pac Baby Don't Cry (Keep... 2000-02-26 8 NA
9 2 Pac Baby Don't Cry (Keep... 2000-02-26 9 NA
10 2 Pac Baby Don't Cry (Keep... 2000-02-26 10 NA
# ℹ 24,082 more rows

Alternatively, you can use a single argument with readr::parse_number() to automatically strip non-numeric components.

billboard %>%   pivot_longer(    cols = starts_with("wk"),     names_to = "week",     # remove non-numeric components, and convert to 'numeric' type    names_transform = list(week = readr::parse_number),    values_to = "rank")

Use values_drop_na = TRUE to remove rows containing NA values in the rank column. (This effectively converts explicit missing values to implicit missing values, and should generally be used only when missing values in data were created by its structure.)

billboard %>%   pivot_longer(    cols = starts_with("wk"),     names_to = "week",     names_transform = list(week = readr::parse_number),    values_to = "rank",    values_drop_na = T) # remove rows with NA values in 'rank'.

Output:

# A tibble: 5,307 × 5
artist track date.entered week rank
<chr> <chr> <date> <dbl> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
# ℹ 5,297 more rows