Gather Columns into Longer and Narrower Dataset (2/4): Deal with Numeric data in Column Names
The billboard dataset records the billboard rank of songs across all weeks in the year 2000. It has a similar structure as the relig_income dataset used in the last tutorial.
library(tidyr)library(dplyr) billboard
Output:
# A tibble: 317 × 79 artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11 wk12 wk13 wk14 wk15 wk16 wk17 wk18 wk19 wk20 wk21 wk22 wk23 wk24 wk25 wk26 wk27 wk28 wk29 wk30 wk31 wk32 wk33 wk34 wk35 wk36 wk37 wk38 wk39 wk40 wk41 wk42 wk43 wk44 wk45 <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 3 Doors… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 51 51 51 51 47 44 38 28 22 18 18 14 12 7 6 6 6 5 5 4 4 4 4 3 3 3 4 5 5 9 9 15 14 13 14 16 17 4 3 Doors… Loser 2000-10-21 76 76 72 69 67 65 55 59 62 61 61 59 61 66 72 76 75 67 73 70 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49 53 57 64 70 75 76 78 85 92 96 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 3 6 7 22 29 36 47 67 66 84 93 94 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38 38 36 37 37 38 49 61 63 62 67 83 86 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14 12 10 9 8 6 1 2 2 2 2 3 4 5 5 6 9 13 14 16 23 22 33 36 43 NA NA NA NA NA NA NA NA NA NA NA NA NA 10 Adams, … Open… 2000-08-26 76 76 74 69 68 67 61 58 57 59 66 68 61 67 59 63 67 71 79 89 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA # ℹ 307 more rows # ℹ 31 more variables: wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>, wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>, wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>, wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>, # wk69 <lgl>, wk70 <lgl>, wk71 <lgl>, wk72 <lgl>, wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>
For all columns containing the prefix “wk”, we’ll restructure them into two columns: save the column names wk1, wk2, wk3…as values of the column week, and save all ranks across different weeks as values of the rank column. Here we use the selection helper starts_with() to select columns with the prefix “wk”.
billboard %>%pivot_longer(cols =starts_with("wk"),names_to ="week", values_to ="rank")
The useful data encoded in billboard’s column names wk1, wk2… is essentially the week number, while the constant prefix “wk” does not carry much useful information. As such, we can use argument names_prefix to remove the prefix “wk” from the week column in the pivoted dataset.
billboard %>%pivot_longer(cols =starts_with("wk"),names_to ="week",names_prefix ="wk", # remove prefix "wk"values_to ="rank")
On top of removing the prefix, we can use names_transform and a list of column name - function pair to change the type of the week column (or to perform other processing as demonstrated in later tutorials). The following script turns column week into an integer type.
billboard %>%pivot_longer(cols =starts_with("wk"),names_to ="week",names_prefix ="wk", # convert 'week' to integernames_transform =list(week = as.integer), values_to ="rank")
Alternatively, you can use a single argument with readr::parse_number() to automatically strip non-numeric components.
billboard %>%pivot_longer(cols =starts_with("wk"), names_to ="week", # remove non-numeric components, and convert to 'numeric' typenames_transform =list(week = readr::parse_number),values_to ="rank")
Use values_drop_na = TRUE to remove rows containing NA values in the rank column. (This effectively converts explicit missing values to implicit missing values, and should generally be used only when missing values in data were created by its structure.)
billboard %>%pivot_longer(cols =starts_with("wk"), names_to ="week", names_transform =list(week = readr::parse_number),values_to ="rank",values_drop_na = T) # remove rows with NA values in 'rank'.