Lookaround Assertions

Lookaround assertions include positive and negative lookarounds, and each include lookahead and lookbehind. In this tutorial, we’ll first focus on details of positive lookarounds, and then briefly touch upon negative lookarounds.

Positive lookaround assertion

A positive look around assertion indicates that in order for a particular pattern A to be a match, a pattern B must be coexisting, while pattern B itself is not part of the match. If the matched pattern A comes immediately before the pattern B, it constitutes a positive lookahead, noted as A(?=B); if A comes immediately after B, it constitutes positive lookbehind, noted as (?<=B)A.

eg. 1. Positive lookahead. \\w+(?=ing) matches words that is followed with “ing”, but only considers the part before “ing” as a match. (Recall that \w is equivalent to character class [a-zA-Z0-9_]; also note that a second backslash is needed to escape the first backslash).

library(stringr)x <- c("I sing", "I am singing", "She is dancing")
str_view_all(x, "\\w+(?=ing)")

Output:

[1] │ I <s>ing
[2] │ I am <sing>ing
[3] │ She is <danc>ing
str_extract(x, "\\w+(?=ing)")

Output:

[1] "s" "sing" "danc"

eg. 2. Positive lookahead. .+(?=@harvard.edu$) matches the username (anything before “@”) of any email address that ends with “@harvard.edu”. Note that the ending anchor $ is inside the positive lookahead parentheses.

emails <- c("john.smith@harvard.edu",            "mary_bosse@harvard.edu",            "Mr.Cool@tesla.com")p <- ".+(?=@harvard.edu$)"
str_view_all(emails, p)

Output:

[1] │ <john.smith>@harvard.edu
[2] │ <mary_bosse>@harvard.edu
[3] │ Mr.Cool@tesla.com
str_extract(emails, p)

Output:

[1] "john.smith" "mary_bosse" NA

eg. 3. Positive lookbehind. Using the same character vector emails as example, here (?<=@).* matches the domain of an email address, i.e., strings of any length (noted by .*) that follows @ (while the @ sign itself is not part of the match).

str_view_all(emails, "(?<=@).*")

Output:

[1] │ john.smith@<harvard.edu>
[2] │ mary_bosse@<harvard.edu>
[3] │ Mr.Cool@<tesla.com>
str_extract(emails, "(?<=@).*")

Output:

[1] "harvard.edu" "harvard.edu" "tesla.com"

eg. 4. Positive lookbehind. (?<=\\$\\s{0,1})\\d+ matches a monetary amount, without including the dollar sign and the white space (if any) in the matched pattern. Specifically, the literal dollar sign is noted as \\$, zero or one white space as \\s{0,1}, and one or more digits as \\d+.

a <- c("2 books at $ 33",        "8 pen and 10 pensils at $10")money <- "(?<=\\$\\s{0,1})\\d+"
str_view_all(a, money)

Output:

[1] │ 2 books at $ <33>
[2] │ 8 pen and 10 pensils at $<10>
str_extract(a, money)

Output:

[1] "33" "10"

Negative lookaround assertion

A negative lookaround assertion works very similarly to its positive counterpart, but in the other way around: in order for a particular pattern A to become a match, a particular pattern B must not be coexisting. The negative lookaround includes negative lookahead A(?!B), with matched pattern A immediately ahead of the (undesired) pattern B; and lookbehind (?<!B)A, with pattern A following the unwanted pattern B. We use the exclamation ! the negate sign to create negative assertion, in place of the equation sign = as used in positive assertions.