Extract Matched Patterns from a String

  • str_extract() extracts the first match from each string element.
  • str_extract_all() extracts all matches from each string element.

str_extract() extracts characters that follow a specified pattern from each string element.

In the following example, the regular expression [a-z]{1,6} (character class) matches any sequence of 1 to 6 consecutive lowercase letters.

library(stringr)
shop_list <- c( "apples *40 Walmart", "flour *12 Target", "sugar *3 Costco")
str_extract(shop_list, pattern = "[a-z]{1,6}")

Output:

[1] "apples" "flour" "sugar"

Note that for str_extract(), only the first match is extracted; although “Walmart”, “Target”, and “Costco” are also matched patterns, they are not extracted and retained in the output.

In addition to regular expression, the package rebus offers a more intuitive and easily memorable syntax to define a pattern, e.g., one_or_more(WRD) matches any pattern that contains one or multiple consecutive words (letters or digits, or said WRD).

# install.packages("rebus")library(rebus) str_extract(shop_list, pattern = one_or_more(WRD))

Output:

[1] "apples" "flour" "sugar"

Extract consecutive digits (DGT).

str_extract(shop_list, pattern = one_or_more(DGT)) 

Output:

[1] "40" "12" "3"

str_extract_all() extracts all matches from each string element, and returns a list.

# shop_list <- c(#   "apples x4", "bag of flour x1", #   "bag of sugar x3", "milk x4")
str_extract_all(shop_list, pattern = one_or_more(WRD))

Output:

[[1]]
[1] "apples" "40" "Walmart"
[[2]]
[1] "flour" "12" "Target"
[[3]]
[1] "sugar" "3" "Costco"

Use simplify = T to return a character matrix.

str_extract_all(shop_list, pattern = one_or_more(WRD),                simplify = T) 

Output:

[,1] [,2] [,3]
[1,] "apples" "40" "Walmart"
[2,] "flour" "12" "Target"
[3,] "sugar" "3" "Costco"