Split a Column into Multiple Columns

separate() splits a single character column into multiple columns based on a specified separator. In this tutorial, we’ll cover the following content:

The basics of separate()
Deal with unequal length of split pieces
Use regular expression to specify separators

The basics of `separate()`

separate() has four basic arguments, and can be summarized as separate(data, col, into, sep):

data: the input data frame. It can be conveniently passed into this function using the pipe operator %>%.
col specifies the name of column that should be split into separate columns.
into specifies the names of new variables to create.
sep specifies the separator used to split the col. If not specified, any sequence of non-alphanumeric values will be used to separate the col. In this example, we use the dot as a separator. As the dot is typically interpreted as a wildcard (representing any character), we render it a literal dot by adding two backslashes before it.

library(tidyr)library(dplyr)
df <- tibble(x = c("ABC.XYZ.8701", "apple.juice.1765", NA, "yes.no.123"))df

Output:
# A tibble: 4 × 1
  x
  <chr>
1 ABC.XYZ.8701
2 apple.juice.1765
3 <NA>
4 yes.no.123

Here we split the column x into three separate columns (A, B, C) based on the dot (.) separator.

df %>% separate(  col = x,                 # split column "x"  into = c("A", "B", "C"), # create three new columns "A", "B", and "C"  sep = "\\.")             # split based on separator dot

Output:
# A tibble: 4 × 3
  A     B     C
  <chr> <chr> <chr>
1 ABC   XYZ   8701
2 apple juice 1765
3 <NA>  <NA>  <NA>
4 yes   no    123

Use NA in into to remove a specific variable from the output.

# retain only the last two columns in the outputdf %>% separate(col = x, into = c(NA, "B", "C"))

Output:
# A tibble: 4 × 2
  B     C
  <chr> <chr>
1 XYZ   8701
2 juice 1765
3 <NA>  <NA>
4 no    123

By default, all new columns are generated as characters. Use convert = T to automatically parse the columns as the appropriate data types.

df %>% separate(  col = x, into = c("A", "B", "C"),  convert = T) # column "C" is nicely parsed as integers

Output:
# A tibble: 4 × 3
  A     B         C
  <chr> <chr> <int>
1 ABC   XYZ    8701
2 apple juice  1765
3 <NA>  <NA>     NA
4 yes   no      123

Deal with unequal length of split pieces

df2 <- tibble(x = c("A", "A B", "X Y Z"))df2

Output:
# A tibble: 3 × 1
  x
  <chr>
1 A
2 A B
3 X Y Z

The following script splits column x into two columns by the white space separator. However, only the second row generates exactly two pieces; the first row generates only a single piece, and the third row generates three pieces.

# Warning messages:# 1: Expected 2 pieces. Additional pieces discarded in 1 rows [3]. # 2: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].df2 %>% separate(col = x, into = c("col_1", "col_2"))

Output:
# A tibble: 3 × 2
  col_1 col_2
  <chr> <chr>
1 A     <NA>
2 A     B
3 X     Y

In case of unequal number of split pieces, use the fill and extra arguments to control the output:

fill specifies how to address the rows when there are not enough pieces (as the case of the first row). It has three value options:
- "warn" (the default): emit a warning and fill with missing values from the right.
- "right": fill with missing values on the right.
- "left": fill with missing values on the left.
extra specifies how to address the extra split pieces (as the case of the third row). It has three value options:
- "warn" (the default): emit a warning and drop extra values.
- "drop": drop any extra values without a warning.
- "merge": only split into the most length of pieces (specified by into), and keep intact the additional non-split pieces.

df2 %>%   separate(    col = x, into = c("col_1", "col_2"),     fill = "right", # fill with NA on the right (for 1st row)    extra = "drop") # remove extra pieces (for 3rd row)

Output:
# A tibble: 3 × 2
  col_1 col_2
  <chr> <chr>
1 A     <NA>
2 A     B
3 X     Y

df2 %>%   separate(    col = x, into = c("col_1", "col_2"),     fill = "left",   # fill with NA on the left (for 1st row)    extra = "merge") # merge extra pieces (for 3rd row)

Output:
# A tibble: 3 × 2
  col_1 col_2
  <chr> <chr>
1 <NA>  A
2 A     B
3 X     Y Z

Or you can keep all split pieces by increasing the number of newly generated columns.

df2 %>% separate(col = x, into = c("col_1", "col_2", "col_3"),                 fill = "right")

Output:
# A tibble: 3 × 3
  col_1 col_2 col_3
  <chr> <chr> <chr>
1 A     <NA>  <NA>
2 A     B     <NA>
3 X     Y     Z

Use regular expression to specify separators

You can use regular expressions to separate based on multiple characters. Below we use character class [.?: ] to specify any of the characters, dot, question mark, and colon, and white space, to be valid separators.

df3 <- tibble(x = c(NA, "x?y", "x.z", "y:z", "abc ABC"))df3

Output:
# A tibble: 5 × 1
  x
  <chr>
1 <NA>
2 x?y
3 x.z
4 y:z
5 abc ABC

df3 %>% separate(col = x, into = c("A","B"), sep = "[.?: ]")

Output:
# A tibble: 5 × 2
  A     B
  <chr> <chr>
1 <NA>  <NA>
2 x     y
3 x     z
4 y     z
5 abc   ABC

The basics of separate()

Deal with unequal length of split pieces

Use regular expression to specify separators

Amazing eBook to learn ggplot2 FAST & EASY

The basics of `separate()`