library(stringr)<- c("melon - 4B", "Banana * B4", "Pineapple $ 3") x
Literal Characters and Character Class
Literal Characters
Literal characters are the simplest form of regular expression, and match themselves, e.g., “4B” matches exactly the pattern “4B” in a string.
str_view_all(x, "4B")
Output:
[1] │ melon - <4B>
[2] │ Banana * B4
[3] │ Pineapple $ 3
str_extract_all(x, "4B", simplify = T)
Output:
[,1]
[1,] "4B"
[2,] ""
[3,] ""
Character Classes
Square brackets [ ]
are used to define a character class. Each individual character within the brackets defines a pattern, e.g., [4B]
matches any individual “4” and “B” in a string.
str_view_all(x, "[4B]")
Output:
[1] │ melon - <4><B>
[2] │ <B>anana * <B><4>
[3] │ Pineapple $ 3
str_extract_all(x, "[4B]", simplify = T)
Output:
[,1] [,2] [,3]
[1,] "4" "B" ""
[2,] "B" "B" "4"
[3,] "" "" ""
The pattern can be a range of characters, e.g., [1-5]
matches any numerical characters ranging from 1 to 5 in a string.
str_view_all(x, "[1-5]")
Output:
[1] │ melon - <4>B
[2] │ Banana * B<4>
[3] │ Pineapple $ <3>
str_extract_all(x, "[1-5]", simplify = T)
Output:
[,1]
[1,] "4"
[2,] "4"
[3,] "3"
[A-C]
matches any one of the three letters “A”, “B”, or “C” in a string.
str_view_all(x, "[A-C]")
Output:
[1] │ melon - 4<B>
[2] │ <B>anana * <B>4
[3] │ Pineapple $ 3
str_extract_all(x, "[A-C]", simplify = T)
Output:
[,1] [,2]
[1,] "B" ""
[2,] "B" "B"
[3,] "" ""
In like manner, [a-z]
matches any lowercase letter, and [a-zA-Z0-9]
matches any letter (in either lowercase or uppercase) and any numerical digit.
The character class can be written in shorthands wrapped in square brackets.
[:digit:]
matches any digit (0-9), equivalent to [0-9]
.
[:alpha:]
matches any alphabetic character (uppercase or lowercase), equivalent to [a-zA-Z]
.
[:lower:]
matches any lowercase alphabetic character, equivalent to [a-z]
.
[:upper:]
matches any uppercase alphabetic character, equivalent to [A-Z]
.
[:alnum:]
matches any alphanumeric character (letter or digit), equivalent to [a-zA-Z0-9]
.
[:space:]
matches a single whitespace character.
[:punct:]
matches any punctuation character.
[:print:]
matches any printable character (including alphanumeric characters and punctuation).
In the following example, [:upper:][:digit:]
matches any two-character string that comprises sequentially an uppercase letter and a digit number.
<- c("E=mc^2...!", "P = M * A4") a <- "[:upper:][:digit:]" b1
str_view_all(a, b1)
Output:
[1] │ E=mc^2...!
[2] │ P = M * <A4>
str_extract_all(a, b1, simplify = T)
Output:
[,1]
[1,] ""
[2,] "A4"
In contrast, if [:upper:]
and [:digit:]
is put inside the character class brackets, i.e., [[:upper:][:digit:]]
, then it matches any single uppercase letter or digital number.
<- "[[:upper:][:digit:]]" b2
str_view_all(a, b2)
Output:
[1] │ <E>=mc^<2>...!
[2] │ <P> = <M> * <A><4>
str_extract_all(a, b2, simplify = T)
Output:
[,1] [,2] [,3] [,4]
[1,] "E" "2" "" ""
[2,] "P" "M" "A" "4"