Split a Column into Multiple Columns

separate() splits a single character column into multiple columns based on a specified separator. In this tutorial, we’ll cover the following content:


The basics of separate()

separate() has four basic arguments, and can be summarized as separate(data, col, into, sep):

  • data: the input data frame. It can be conveniently passed into this function using the pipe operator %>%.
  • col specifies the name of column that should be split into separate columns.
  • into specifies the names of new variables to create.
  • sep specifies the separator used to split the col. If not specified, any sequence of non-alphanumeric values will be used to separate the col. In this example, we use the dot as a separator. As the dot is typically interpreted as a wildcard (representing any character), we render it a literal dot by adding two backslashes before it.
library(tidyr)library(dplyr)
df <- tibble(x = c("ABC.XYZ.8701", "apple.juice.1765", NA, "yes.no.123"))df

Output:

# A tibble: 4 × 1
x
<chr>
1 ABC.XYZ.8701
2 apple.juice.1765
3 <NA>
4 yes.no.123

Here we split the column x into three separate columns (A, B, C) based on the dot (.) separator.

df %>% separate(  col = x,                 # split column "x"  into = c("A", "B", "C"), # create three new columns "A", "B", and "C"  sep = "\\.")             # split based on separator dot

Output:

# A tibble: 4 × 3
A B C
<chr> <chr> <chr>
1 ABC XYZ 8701
2 apple juice 1765
3 <NA> <NA> <NA>
4 yes no 123

Use NA in into to remove a specific variable from the output.

# retain only the last two columns in the outputdf %>% separate(col = x, into = c(NA, "B", "C"))

Output:

# A tibble: 4 × 2
B C
<chr> <chr>
1 XYZ 8701
2 juice 1765
3 <NA> <NA>
4 no 123

By default, all new columns are generated as characters. Use convert = T to automatically parse the columns as the appropriate data types.

df %>% separate(  col = x, into = c("A", "B", "C"),  convert = T) # column "C" is nicely parsed as integers

Output:

# A tibble: 4 × 3
A B C
<chr> <chr> <int>
1 ABC XYZ 8701
2 apple juice 1765
3 <NA> <NA> NA
4 yes no 123

Deal with unequal length of split pieces

df2 <- tibble(x = c("A", "A B", "X Y Z"))df2

Output:

# A tibble: 3 × 1
x
<chr>
1 A
2 A B
3 X Y Z

The following script splits column x into two columns by the white space separator. However, only the second row generates exactly two pieces; the first row generates only a single piece, and the third row generates three pieces.

# Warning messages:# 1: Expected 2 pieces. Additional pieces discarded in 1 rows [3]. # 2: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].df2 %>% separate(col = x, into = c("col_1", "col_2"))

Output:

# A tibble: 3 × 2
col_1 col_2
<chr> <chr>
1 A <NA>
2 A B
3 X Y

In case of unequal number of split pieces, use the fill and extra arguments to control the output:

  • fill specifies how to address the rows when there are not enough pieces (as the case of the first row). It has three value options:

    • "warn" (the default): emit a warning and fill with missing values from the right.
    • "right": fill with missing values on the right.
    • "left": fill with missing values on the left.
  • extra specifies how to address the extra split pieces (as the case of the third row). It has three value options:

    • "warn" (the default): emit a warning and drop extra values.
    • "drop": drop any extra values without a warning.
    • "merge": only split into the most length of pieces (specified by into), and keep intact the additional non-split pieces.
df2 %>%   separate(    col = x, into = c("col_1", "col_2"),     fill = "right", # fill with NA on the right (for 1st row)    extra = "drop") # remove extra pieces (for 3rd row)

Output:

# A tibble: 3 × 2
col_1 col_2
<chr> <chr>
1 A <NA>
2 A B
3 X Y
df2 %>%   separate(    col = x, into = c("col_1", "col_2"),     fill = "left",   # fill with NA on the left (for 1st row)    extra = "merge") # merge extra pieces (for 3rd row)

Output:

# A tibble: 3 × 2
col_1 col_2
<chr> <chr>
1 <NA> A
2 A B
3 X Y Z

Or you can keep all split pieces by increasing the number of newly generated columns.

df2 %>% separate(col = x, into = c("col_1", "col_2", "col_3"),                 fill = "right")

Output:

# A tibble: 3 × 3
col_1 col_2 col_3
<chr> <chr> <chr>
1 A <NA> <NA>
2 A B <NA>
3 X Y Z

Use regular expression to specify separators

You can use regular expressions to separate based on multiple characters. Below we use character class [.?: ] to specify any of the characters, dot, question mark, and colon, and white space, to be valid separators.

df3 <- tibble(x = c(NA, "x?y", "x.z", "y:z", "abc ABC"))df3

Output:

# A tibble: 5 × 1
x
<chr>
1 <NA>
2 x?y
3 x.z
4 y:z
5 abc ABC
df3 %>% separate(col = x, into = c("A","B"), sep = "[.?: ]")

Output:

# A tibble: 5 × 2
A B
<chr> <chr>
1 <NA> <NA>
2 x y
3 x z
4 y z
5 abc ABC