Literal Characters and Character Class

Literal Characters

Literal characters are the simplest form of regular expression, and match themselves, e.g., “4B” matches exactly the pattern “4B” in a string.

library(stringr)x <- c("melon - 4B", "Banana * B4", "Pineapple $ 3")
str_view_all(x, "4B")

Output:

[1] │ melon - <4B>
[2] │ Banana * B4
[3] │ Pineapple $ 3
str_extract_all(x, "4B", simplify = T)

Output:

[,1]
[1,] "4B"
[2,] ""
[3,] ""

Character Classes

Square brackets [ ] are used to define a character class. Each individual character within the brackets defines a pattern, e.g., [4B] matches any individual “4” and “B” in a string.

str_view_all(x, "[4B]")

Output:

[1] │ melon - <4><B>
[2] │ <B>anana * <B><4>
[3] │ Pineapple $ 3
str_extract_all(x, "[4B]", simplify = T)

Output:

[,1] [,2] [,3]
[1,] "4" "B" ""
[2,] "B" "B" "4"
[3,] "" "" ""

The pattern can be a range of characters, e.g., [1-5] matches any numerical characters ranging from 1 to 5 in a string.

str_view_all(x, "[1-5]")

Output:

[1] │ melon - <4>B
[2] │ Banana * B<4>
[3] │ Pineapple $ <3>
str_extract_all(x, "[1-5]", simplify = T)

Output:

[,1]
[1,] "4"
[2,] "4"
[3,] "3"

[A-C] matches any one of the three letters “A”, “B”, or “C” in a string.

str_view_all(x, "[A-C]") 

Output:

[1] │ melon - 4<B>
[2] │ <B>anana * <B>4
[3] │ Pineapple $ 3
str_extract_all(x, "[A-C]", simplify = T) 

Output:

[,1] [,2]
[1,] "B" ""
[2,] "B" "B"
[3,] "" ""

In like manner, [a-z] matches any lowercase letter, and [a-zA-Z0-9] matches any letter (in either lowercase or uppercase) and any numerical digit.

The character class can be written in shorthands wrapped in square brackets.

[:digit:] matches any digit (0-9), equivalent to [0-9].

[:alpha:] matches any alphabetic character (uppercase or lowercase), equivalent to [a-zA-Z].

[:lower:] matches any lowercase alphabetic character, equivalent to [a-z].

[:upper:] matches any uppercase alphabetic character, equivalent to [A-Z].

[:alnum:] matches any alphanumeric character (letter or digit), equivalent to [a-zA-Z0-9].

[:space:] matches a single whitespace character.

[:punct:] matches any punctuation character.

[:print:] matches any printable character (including alphanumeric characters and punctuation).

In the following example, [:upper:][:digit:] matches any two-character string that comprises sequentially an uppercase letter and a digit number.

a <- c("E=mc^2...!", "P = M * A4")b1 <- "[:upper:][:digit:]"
str_view_all(a, b1) 

Output:

[1] │ E=mc^2...!
[2] │ P = M * <A4>
str_extract_all(a, b1, simplify = T)

Output:

[,1]
[1,] ""
[2,] "A4"

In contrast, if [:upper:] and [:digit:] is put inside the character class brackets, i.e., [[:upper:][:digit:]], then it matches any single uppercase letter or digital number.

b2 <- "[[:upper:][:digit:]]"
str_view_all(a, b2) 

Output:

[1] │ <E>=mc^<2>...!
[2] │ <P> = <M> * <A><4>
str_extract_all(a, b2, simplify = T) 

Output:

[,1] [,2] [,3] [,4]
[1,] "E" "2" "" ""
[2,] "P" "M" "A" "4"