Capture Group: Extract Components from Matched Patterns

A capture group is a part of a regular expression that is enclosed in parentheses (). As its name suggests, matched patterns inside the parentheses are captured and extracted. stringr::str_match() captures and extracts the matched patterns inside the parentheses, and return matched components as a character matrix.

Below we’ll illustrate capture group and str_match() covering the following four aspects using three excellent examples:


Basics of capture group with str_match()

e.g. 1. Let’s first extract the month-date-year without using the capture group.

library(stringr)dates <- c("Swedan, 12-22-1862",           "Norway, 5-31-1864",           "France, 12-6-1901")
# define the pattern of a month-day-yearp1 <- "[0-9]{1,2}-[0-9]{1,2}-[0-9]{4}"
str_view_all(dates, p1)

Output:

[1] │ Swedan, <12-22-1862>
[2] │ Norway, <5-31-1864>
[3] │ France, <12-6-1901>
str_extract(dates, p1)

Output:

[1] "12-22-1862" "5-31-1864" "12-6-1901"

To capture the month, date, and year, each as a separate component, wrap the associated regular expressions in parentheses, respectively. str_match() returns a character matrix, displaying both the entire string of month-day-year, and the individual captured components in separate columns.

p2 <- "([0-9]{1,2})-([0-9]{1,2})-([0-9]{4})"str_match(dates, p2)

Output:

[,1] [,2] [,3] [,4]
[1,] "12-22-1862" "12" "22" "1862"
[2,] "5-31-1864" "5" "31" "1864"
[3,] "12-6-1901" "12" "6" "1901"

Named capture groups

To create named captured groups, add ?<name> before the matching pattern within the pair of parentheses, i.e., using the syntax (?<name>pattern).

p2.named <- "(?<MON>[0-9]{1,2})-(?<DAY>[0-9]{1,2})-(?<year>[0-9]{4})"str_match(dates, p2.named)

Output:

MON DAY year
[1,] "12-22-1862" "12" "22" "1862"
[2,] "5-31-1864" "5" "31" "1864"
[3,] "12-6-1901" "12" "6" "1901"

Extract components with multiple matches

e.g. 2. The following example extracts phone number components separated by hyphens, white spaces, or dots (matched with [- .]). [0-9]{3} matches the captured groups of three-digit area code and exchange code, i.e., “xxx”, and [0-9]{4} matches the four-digit line number, “xxxx”. Each capture group is wrapped in parentheses in the regular expression, and extracted and displayed in separate columns in the output.

tel <- c("329-293-8753 ",         "239 923 8115 and 842 566 4692",          "Work: 579-499-7527",          "Home: 543.355.3679")
# define the pattern of a phone numberp <- "([0-9]{3})[- .]([0-9]{3})[- .]([0-9]{4})"
str_match(tel, p)

Output:

[,1] [,2] [,3] [,4]
[1,] "329-293-8753" "329" "293" "8753"
[2,] "239 923 8115" "239" "923" "8115"
[3,] "579-499-7527" "579" "499" "7527"
[4,] "543.355.3679" "543" "355" "3679"

In str_match(), for each vector element, only the first match is selected, while all following matches are excluded from the output (similar to str_extract()); e.g., the second phone number in the second string element is not returned. Use str_match_all() instead to extract all matches, and return the result as a list (similar to str_extract_all()).

str_match_all(tel, p)

Output:

[[1]]
[,1] [,2] [,3] [,4]
[1,] "329-293-8753" "329" "293" "8753"
[[2]]
[,1] [,2] [,3] [,4]
[1,] "239 923 8115" "239" "923" "8115"
[2,] "842 566 4692" "842" "566" "4692"
[[3]]
[,1] [,2] [,3] [,4]
[1,] "579-499-7527" "579" "499" "7527"
[[4]]
[,1] [,2] [,3] [,4]
[1,] "543.355.3679" "543" "355" "3679"

Capture group with tibbles (matrix- and tibble-column)

Tibble is the central data structure in tidyverse, and working with tibbles is an essential part of capture groups.

e.g. 3. Here we use the who dataset (of tidyr package) from the World Health Organization Global Tuberculosis Report. For ease of demo, we’ll first do a simple data cleanup: use slice_max() to select the top 20 rows of records of the most tuberculosis outbreaks, and then use pivot_longer() to transform the dataset into a tidy structure.

library(tidyr)library(dplyr)
# select top-20 rows containing most outbreaks in male at age 15-24 who.max <- who %>% slice_max(order_by = new_sp_m1524, n = 20)
# convert to tidy structurewho.tidy <- who.max %>% pivot_longer(-c(1:4), names_to = "condition", values_to = "count")
who.tidy

Output:

# A tibble: 1,120 × 6
country iso2 iso3 year condition count
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 India IN IND 2010 new_sp_m014 4871
2 India IN IND 2010 new_sp_m1524 78278
3 India IN IND 2010 new_sp_m2534 82757
4 India IN IND 2010 new_sp_m3544 90440
5 India IN IND 2010 new_sp_m4554 81210
6 India IN IND 2010 new_sp_m5564 60766
7 India IN IND 2010 new_sp_m65 38442
8 India IN IND 2010 new_sp_f014 8544
9 India IN IND 2010 new_sp_f1524 53415
10 India IN IND 2010 new_sp_f2534 49425
# ℹ 1,110 more rows

The condition column actually consists of four pieces of information:

  • The new and new_ prefix indicates that the counts are new tuberculosis cases.
  • sp, sn, ep, and rel describe the diagnosis result: sp, positive pulmonary smear; sn, negative pulmonary smear; ep, extra pulmonary; rel, relapse.
  • m an f indicates the gender: m, male; f, female.
  • The digits show the age ranges: 014, 0-14 years of age; 1524, 15-24 years of age; 2535, 25-35 years of age, etc.

Here we use capture groups to extract the four types of information, and split them into four more separate columns.

pattern <- "(new)_?(.*)_(.)(.*)"
x <- who.tidy %>% mutate(a = str_match(condition, pattern), .keep = "unused")x

Output:

# A tibble: 1,120 × 6
country iso2 iso3 year count a[,1] [,2] [,3] [,4] [,5]
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 India IN IND 2010 4871 new_sp_m014 new sp m 014
2 India IN IND 2010 78278 new_sp_m1524 new sp m 1524
3 India IN IND 2010 82757 new_sp_m2534 new sp m 2534
4 India IN IND 2010 90440 new_sp_m3544 new sp m 3544
5 India IN IND 2010 81210 new_sp_m4554 new sp m 4554
6 India IN IND 2010 60766 new_sp_m5564 new sp m 5564
7 India IN IND 2010 38442 new_sp_m65 new sp m 65
8 India IN IND 2010 8544 new_sp_f014 new sp f 014
9 India IN IND 2010 53415 new_sp_f1524 new sp f 1524
10 India IN IND 2010 49425 new_sp_f2534 new sp f 2534
# ℹ 1,110 more rows

Note that columns of a[, 1], [,2][,5] are essentially a single matrix-column, which can be accessed and returned as a matrix by calling x$a. To make it easy for downstream analysis, we can split this matrix-column into separate columns by:

  • first turning the matrix-column into a single tibble-column with as_tibble()
  • and then unpack this tibble-column into separate columns with unpack().

In addition, we use named capture groups to add names to the newly generated columns.

# create a named capture groupnamed.capture <- "(?<status>new)_?(?<type>.*)_(?<gender>.)(?<age>.*)"
who.tidy %>% mutate( # capture, and return as a tibble a = str_match(condition, named.capture) %>% as_tibble(), .keep = "unused") %>% # unpack the single tibble-column into separate columns unpack(a)

Output:

# A tibble: 1,120 × 10
country iso2 iso3 year count V1 status type gender age
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 India IN IND 2010 4871 new_sp_m014 new sp m 014
2 India IN IND 2010 78278 new_sp_m1524 new sp m 1524
3 India IN IND 2010 82757 new_sp_m2534 new sp m 2534
4 India IN IND 2010 90440 new_sp_m3544 new sp m 3544
5 India IN IND 2010 81210 new_sp_m4554 new sp m 4554
6 India IN IND 2010 60766 new_sp_m5564 new sp m 5564
7 India IN IND 2010 38442 new_sp_m65 new sp m 65
8 India IN IND 2010 8544 new_sp_f014 new sp f 014
9 India IN IND 2010 53415 new_sp_f1524 new sp f 1524
10 India IN IND 2010 49425 new_sp_f2534 new sp f 2534
# ℹ 1,110 more rows