library(stringr)<- c("raw_data.xlsx", "data_analysis.RData") x
Escape Characters
An escape character is a character that indicates that its following character(s) should be interpreted differently (escaping from its original meaning). Backslash \
is the most common escape character.
Escape a special character
For a special character to be a matching pattern (i.e., to be treated as a literal character), it has to be immediately escaped (preceded) by a backslash \
. For example:
\(
and\)
separately matches the left and right literal parenthesis.\[
and\]
separately matches the left and right literal square bracket.\.
is treated as a dot itself, instead of a wildcard.\^
and\$
is treated respectively as a literal carat and dollar sign, instead of a position anchor.
Since the backslash itself is a special character, it needs to be escaped with another backslash to be interpreted literally, e.g., using \\.
, \\^
, and \\$
, and \\(
.
eg.1. \\.
matches a literal dot, and .*
matches a string of any length (here the dot is a wildcard). Thus, \\..*
matches a literal dot and its following characters, i.e., the file extension.
str_view_all(x, "\\..*")
Output:
[1] │ raw_data<.xlsx>
[2] │ data_analysis<.RData>
str_extract(x, "\\..*")
Output:
[1] ".xlsx" ".RData"
Special characters with characer class
When special characters are used with character class (within a pair of square brackets), they are interpreted literally, and does not need the backslash to escape.
eg.2. [$^*]
matches “$”, “^”, and “*” as literal characters.
<- c("an book $", "carot or carat ^", "stars ** in the sky") s
str_view_all(s, "[$^*]")
Output:
[1] │ an book <$>
[2] │ carot or carat <^>
[3] │ stars <*><*> in the sky
str_extract(s, "[$^*]")
Output:
[1] "$" "^" "*"
Escape a regular letter
As demonstrated above, when a special character is escaped (preceded) with a backslash \
, it is interpreted literally as a character itself. On the other hand, an ordinary letter can be escaped to convey a different meaning:
\d
matches a single digit\D
matches a single non-digit\w
matches a word character (alphanumeric + underscore)\W
matches a non-word character\s
matches any whitespace\S
matches a non-whitespace\b
matches a word boundary\B
matches a position that is not a word boundary\t
matches a tab character\n
matches a newline character
Again, a second backslash is needed to escape itself, e.g., using \\S
.
Consider the following examples.
eg.3. \\$
matches a literal dollar sign, and \\d+
matches one or more digits. As such, \\$\\d+
matches a dollar amount.
<- c("book of $123", "price at 20% off") d
str_view_all(d, "\\$\\d+")
Output:
[1] │ book of <$123>
[2] │ price at 20% off
str_extract(d, "\\$\\d+")
Output:
[1] "$123" NA
eg.4. \\d{3}\\.
matches three consecutive digits, followed with a literal dot. As such, \\d{3}\\.\\d{3}\\.\\d{4}
matches a phone number in the form of xxx.xxx.xxxx.
<- c("Bob: 787.902.1068", "Mike: 910.087.1483") a <- "\\d{3}\\.\\d{3}\\.\\d{4}" p
str_view_all(a, p)
Output:
[1] │ Bob: <787.902.1068>
[2] │ Mike: <910.087.1483>
str_extract(a, p)
Output:
[1] "787.902.1068" "910.087.1483"