library(tidyr)library(dplyr)
<- tibble(x = c("ABC.XYZ.8701", "apple.juice.1765", NA, "yes.no.123")) df df
Output:
# A tibble: 4 × 1
x
<chr>
1 ABC.XYZ.8701
2 apple.juice.1765
3 <NA>
4 yes.no.123
separate()
splits a single character column into multiple columns based on a specified separator. In this tutorial, we’ll cover the following content:
separate()
separate()
separate()
has four basic arguments, and can be summarized as separate(data, col, into, sep)
:
data
: the input data frame. It can be conveniently passed into this function using the pipe operator %>%
.col
specifies the name of column that should be split into separate columns.into
specifies the names of new variables to create.sep
specifies the separator used to split the col
. If not specified, any sequence of non-alphanumeric values will be used to separate the col
. In this example, we use the dot as a separator. As the dot is typically interpreted as a wildcard (representing any character), we render it a literal dot by adding two backslashes before it.library(tidyr)library(dplyr)
<- tibble(x = c("ABC.XYZ.8701", "apple.juice.1765", NA, "yes.no.123")) df df
Output:
# A tibble: 4 × 1
x
<chr>
1 ABC.XYZ.8701
2 apple.juice.1765
3 <NA>
4 yes.no.123
Here we split the column x
into three separate columns (A
, B
, C
) based on the dot (.
) separator.
%>% separate( df col = x, # split column "x" into = c("A", "B", "C"), # create three new columns "A", "B", and "C" sep = "\\.") # split based on separator dot
Output:
# A tibble: 4 × 3
A B C
<chr> <chr> <chr>
1 ABC XYZ 8701
2 apple juice 1765
3 <NA> <NA> <NA>
4 yes no 123
Use NA
in into
to remove a specific variable from the output.
# retain only the last two columns in the output%>% separate(col = x, into = c(NA, "B", "C")) df
Output:
# A tibble: 4 × 2
B C
<chr> <chr>
1 XYZ 8701
2 juice 1765
3 <NA> <NA>
4 no 123
By default, all new columns are generated as characters. Use convert = T
to automatically parse the columns as the appropriate data types.
%>% separate( df col = x, into = c("A", "B", "C"), convert = T) # column "C" is nicely parsed as integers
Output:
# A tibble: 4 × 3
A B C
<chr> <chr> <int>
1 ABC XYZ 8701
2 apple juice 1765
3 <NA> <NA> NA
4 yes no 123
<- tibble(x = c("A", "A B", "X Y Z")) df2 df2
Output:
# A tibble: 3 × 1
x
<chr>
1 A
2 A B
3 X Y Z
The following script splits column x
into two columns by the white space separator. However, only the second row generates exactly two pieces; the first row generates only a single piece, and the third row generates three pieces.
# Warning messages:# 1: Expected 2 pieces. Additional pieces discarded in 1 rows [3]. # 2: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].%>% separate(col = x, into = c("col_1", "col_2")) df2
Output:
# A tibble: 3 × 2
col_1 col_2
<chr> <chr>
1 A <NA>
2 A B
3 X Y
In case of unequal number of split pieces, use the fill
and extra
arguments to control the output:
fill
specifies how to address the rows when there are not enough pieces (as the case of the first row). It has three value options:
"warn"
(the default): emit a warning and fill with missing values from the right."right"
: fill with missing values on the right."left"
: fill with missing values on the left.extra
specifies how to address the extra split pieces (as the case of the third row). It has three value options:
"warn"
(the default): emit a warning and drop extra values."drop"
: drop any extra values without a warning."merge"
: only split into the most length of pieces (specified by into
), and keep intact the additional non-split pieces.%>% df2 separate( col = x, into = c("col_1", "col_2"), fill = "right", # fill with NA on the right (for 1st row) extra = "drop") # remove extra pieces (for 3rd row)
Output:
# A tibble: 3 × 2
col_1 col_2
<chr> <chr>
1 A <NA>
2 A B
3 X Y
%>% df2 separate( col = x, into = c("col_1", "col_2"), fill = "left", # fill with NA on the left (for 1st row) extra = "merge") # merge extra pieces (for 3rd row)
Output:
# A tibble: 3 × 2
col_1 col_2
<chr> <chr>
1 <NA> A
2 A B
3 X Y Z
Or you can keep all split pieces by increasing the number of newly generated columns.
%>% separate(col = x, into = c("col_1", "col_2", "col_3"), df2 fill = "right")
Output:
# A tibble: 3 × 3
col_1 col_2 col_3
<chr> <chr> <chr>
1 A <NA> <NA>
2 A B <NA>
3 X Y Z
You can use regular expressions to separate based on multiple characters. Below we use character class [.?: ]
to specify any of the characters, dot, question mark, and colon, and white space, to be valid separators.
<- tibble(x = c(NA, "x?y", "x.z", "y:z", "abc ABC")) df3 df3
Output:
# A tibble: 5 × 1
x
<chr>
1 <NA>
2 x?y
3 x.z
4 y:z
5 abc ABC
%>% separate(col = x, into = c("A","B"), sep = "[.?: ]") df3
Output:
# A tibble: 5 × 2
A B
<chr> <chr>
1 <NA> <NA>
2 x y
3 x z
4 y z
5 abc ABC