stringr/ 0000755 0001762 0000144 00000000000 13427716423 011753 5 ustar ligges users stringr/inst/ 0000755 0001762 0000144 00000000000 13427574670 012736 5 ustar ligges users stringr/inst/htmlwidgets/ 0000755 0001762 0000144 00000000000 13427574674 015275 5 ustar ligges users stringr/inst/htmlwidgets/str_view.js 0000644 0001762 0000144 00000000367 12613714660 017466 0 ustar ligges users HTMLWidgets.widget({ name: 'str_view', type: 'output', initialize: function(el, width, height) { }, renderValue: function(el, x, instance) { el.innerHTML = x.html; }, resize: function(el, width, height, instance) { } }); stringr/inst/htmlwidgets/lib/ 0000755 0001762 0000144 00000000000 12613713452 016024 5 ustar ligges users stringr/inst/htmlwidgets/lib/str_view.css 0000644 0001762 0000144 00000000270 12613715274 020403 0 ustar ligges users .str_view ul, .str_view li { list-style: none; padding: 0; margin: 0.5em 0; font-family: monospace; } .str_view .match { border: 1px solid #ccc; background-color: #eee; } stringr/inst/htmlwidgets/str_view.yaml 0000644 0001762 0000144 00000000147 12613714471 020010 0 ustar ligges users dependencies: - name: str_view version: 0.1.0 src: htmlwidgets/lib/ stylesheet: str_view.css stringr/inst/doc/ 0000755 0001762 0000144 00000000000 13427574670 013503 5 ustar ligges users stringr/inst/doc/stringr.R 0000644 0001762 0000144 00000010074 13427574670 015320 0 ustar ligges users ## ---- include = FALSE---------------------------------------------------- library(stringr) knitr::opts_chunk$set( comment = "#>", collapse = TRUE ) ## ------------------------------------------------------------------------ str_length("abc") ## ------------------------------------------------------------------------ x <- c("abcdef", "ghifjk") # The 3rd letter str_sub(x, 3, 3) # The 2nd to 2nd-to-last character str_sub(x, 2, -2) ## ------------------------------------------------------------------------ str_sub(x, 3, 3) <- "X" x ## ------------------------------------------------------------------------ str_dup(x, c(2, 3)) ## ------------------------------------------------------------------------ x <- c("abc", "defghi") str_pad(x, 10) # default pads on left str_pad(x, 10, "both") ## ------------------------------------------------------------------------ str_pad(x, 4) ## ------------------------------------------------------------------------ x <- c("Short", "This is a long string") x %>% str_trunc(10) %>% str_pad(10, "right") ## ------------------------------------------------------------------------ x <- c(" a ", "b ", " c") str_trim(x) str_trim(x, "left") ## ------------------------------------------------------------------------ jabberwocky <- str_c( "`Twas brillig, and the slithy toves ", "did gyre and gimble in the wabe: ", "All mimsy were the borogoves, ", "and the mome raths outgrabe. " ) cat(str_wrap(jabberwocky, width = 40)) ## ------------------------------------------------------------------------ x <- "I like horses." str_to_upper(x) str_to_title(x) str_to_lower(x) # Turkish has two sorts of i: with and without the dot str_to_lower(x, "tr") ## ------------------------------------------------------------------------ x <- c("y", "i", "k") str_order(x) str_sort(x) # In Lithuanian, y comes between i and k str_sort(x, locale = "lt") ## ------------------------------------------------------------------------ strings <- c( "apple", "219 733 8965", "329-293-8753", "Work: 579-499-7527; Home: 543.355.3679" ) phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})" ## ------------------------------------------------------------------------ # Which strings contain phone numbers? str_detect(strings, phone) str_subset(strings, phone) ## ------------------------------------------------------------------------ # How many phone numbers in each string? str_count(strings, phone) ## ------------------------------------------------------------------------ # Where in the string is the phone number located? (loc <- str_locate(strings, phone)) str_locate_all(strings, phone) ## ------------------------------------------------------------------------ # What are the phone numbers? str_extract(strings, phone) str_extract_all(strings, phone) str_extract_all(strings, phone, simplify = TRUE) ## ------------------------------------------------------------------------ # Pull out the three components of the match str_match(strings, phone) str_match_all(strings, phone) ## ------------------------------------------------------------------------ str_replace(strings, phone, "XXX-XXX-XXXX") str_replace_all(strings, phone, "XXX-XXX-XXXX") ## ------------------------------------------------------------------------ str_split("a-b-c", "-") str_split_fixed("a-b-c", "-", n = 2) ## ------------------------------------------------------------------------ a1 <- "\u00e1" a2 <- "a\u0301" c(a1, a2) a1 == a2 ## ------------------------------------------------------------------------ str_detect(a1, fixed(a2)) str_detect(a1, coll(a2)) ## ------------------------------------------------------------------------ i <- c("I", "İ", "i", "ı") i str_subset(i, coll("i", ignore_case = TRUE)) str_subset(i, coll("i", ignore_case = TRUE, locale = "tr")) ## ------------------------------------------------------------------------ x <- "This is a sentence." str_split(x, boundary("word")) str_count(x, boundary("word")) str_extract_all(x, boundary("word")) ## ------------------------------------------------------------------------ str_split(x, "") str_count(x, "") stringr/inst/doc/stringr.html 0000644 0001762 0000144 00000125777 13427574670 016104 0 ustar ligges users
There are four main families of functions in stringr:
Character manipulation: these functions allow you to manipulate individual characters within the strings in character vectors.
Whitespace tools to add, remove, and manipulate whitespace.
Locale sensitive operations whose operations will vary from locale to locale.
Pattern matching functions. These recognise four engines of pattern description. The most common is regular expressions, but there are three other tools.
You can get the length of the string with str_length()
:
This is now equivalent to the base R function nchar()
. Previously it was needed to work around issues with nchar()
such as the fact that it returned 2 for nchar(NA)
. This has been fixed as of R 3.3.0, so it is no longer so important.
You can access individual character using str_sub()
. It takes three arguments: a character vector, a start
position and an end
position. Either position can either be a positive integer, which counts from the left, or a negative integer which counts from the right. The positions are inclusive, and if longer than the string, will be silently truncated.
x <- c("abcdef", "ghifjk")
# The 3rd letter
str_sub(x, 3, 3)
#> [1] "c" "i"
# The 2nd to 2nd-to-last character
str_sub(x, 2, -2)
#> [1] "bcde" "hifj"
You can also use str_sub()
to modify strings:
To duplicate individual strings, you can use str_dup()
:
Three functions add, remove, or modify whitespace:
str_pad()
pads a string to a fixed length by adding extra whitespace on the left, right, or both sides.
x <- c("abc", "defghi")
str_pad(x, 10) # default pads on left
#> [1] " abc" " defghi"
str_pad(x, 10, "both")
#> [1] " abc " " defghi "
(You can pad with other characters by using the pad
argument.)
str_pad()
will never make a string shorter:
So if you want to ensure that all strings are the same length (often useful for print methods), combine str_pad()
and str_trunc()
:
The opposite of str_pad()
is str_trim()
, which removes leading and trailing whitespace:
You can use str_wrap()
to modify existing whitespace in order to wrap a paragraph of text, such that the length of each line is as similar as possible.
jabberwocky <- str_c(
"`Twas brillig, and the slithy toves ",
"did gyre and gimble in the wabe: ",
"All mimsy were the borogoves, ",
"and the mome raths outgrabe. "
)
cat(str_wrap(jabberwocky, width = 40))
#> `Twas brillig, and the slithy toves did
#> gyre and gimble in the wabe: All mimsy
#> were the borogoves, and the mome raths
#> outgrabe.
A handful of stringr functions are locale-sensitive: they will perform differently in different regions of the world. These functions are case transformation functions:
x <- "I like horses."
str_to_upper(x)
#> [1] "I LIKE HORSES."
str_to_title(x)
#> [1] "I Like Horses."
str_to_lower(x)
#> [1] "i like horses."
# Turkish has two sorts of i: with and without the dot
str_to_lower(x, "tr")
#> [1] "ı like horses."
String ordering and sorting:
x <- c("y", "i", "k")
str_order(x)
#> [1] 2 3 1
str_sort(x)
#> [1] "i" "k" "y"
# In Lithuanian, y comes between i and k
str_sort(x, locale = "lt")
#> [1] "i" "y" "k"
The locale always defaults to English to ensure that the default behaviour is identical across systems. Locales always include a two letter ISO-639-1 language code (like “en” for English or “zh” for Chinese), and optionally a ISO-3166 country code (like “en_UK” vs “en_US”). You can see a complete list of available locales by running stringi::stri_locale_list()
.
The vast majority of stringr functions work with patterns. These are parameterised by the task they perform and the types of patterns they match.
Each pattern matching function has the same first two arguments, a character vector of string
s to process and a single pattern
to match. stringr provides pattern matching functions to detect, locate, extract, match, replace, and split strings. I’ll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:
strings <- c(
"apple",
"219 733 8965",
"329-293-8753",
"Work: 579-499-7527; Home: 543.355.3679"
)
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
str_detect()
detects the presence or absence of a pattern and returns a logical vector (similar to grepl()
). str_subset()
returns the elements of a character vector that match a regular expression (similar to grep()
with value = TRUE
)`.
str_count()
counts the number of matches:
str_locate()
locates the first position of a pattern and returns a numeric matrix with columns start and end. str_locate_all()
locates all matches, returning a list of numeric matrices. Similar to regexpr()
and gregexpr()
.
# Where in the string is the phone number located?
(loc <- str_locate(strings, phone))
#> start end
#> [1,] NA NA
#> [2,] 1 12
#> [3,] 1 12
#> [4,] 7 18
str_locate_all(strings, phone)
#> [[1]]
#> start end
#>
#> [[2]]
#> start end
#> [1,] 1 12
#>
#> [[3]]
#> start end
#> [1,] 1 12
#>
#> [[4]]
#> start end
#> [1,] 7 18
#> [2,] 27 38
str_extract()
extracts text corresponding to the first match, returning a character vector. str_extract_all()
extracts all matches and returns a list of character vectors.
# What are the phone numbers?
str_extract(strings, phone)
#> [1] NA "219 733 8965" "329-293-8753" "579-499-7527"
str_extract_all(strings, phone)
#> [[1]]
#> character(0)
#>
#> [[2]]
#> [1] "219 733 8965"
#>
#> [[3]]
#> [1] "329-293-8753"
#>
#> [[4]]
#> [1] "579-499-7527" "543.355.3679"
str_extract_all(strings, phone, simplify = TRUE)
#> [,1] [,2]
#> [1,] "" ""
#> [2,] "219 733 8965" ""
#> [3,] "329-293-8753" ""
#> [4,] "579-499-7527" "543.355.3679"
str_match()
extracts capture groups formed by ()
from the first match. It returns a character matrix with one column for the complete match and one column for each group. str_match_all()
extracts capture groups from all matches and returns a list of character matrices. Similar to regmatches()
.
# Pull out the three components of the match
str_match(strings, phone)
#> [,1] [,2] [,3] [,4]
#> [1,] NA NA NA NA
#> [2,] "219 733 8965" "219" "733" "8965"
#> [3,] "329-293-8753" "329" "293" "8753"
#> [4,] "579-499-7527" "579" "499" "7527"
str_match_all(strings, phone)
#> [[1]]
#> [,1] [,2] [,3] [,4]
#>
#> [[2]]
#> [,1] [,2] [,3] [,4]
#> [1,] "219 733 8965" "219" "733" "8965"
#>
#> [[3]]
#> [,1] [,2] [,3] [,4]
#> [1,] "329-293-8753" "329" "293" "8753"
#>
#> [[4]]
#> [,1] [,2] [,3] [,4]
#> [1,] "579-499-7527" "579" "499" "7527"
#> [2,] "543.355.3679" "543" "355" "3679"
str_replace()
replaces the first matched pattern and returns a character vector. str_replace_all()
replaces all matches. Similar to sub()
and gsub()
.
str_replace(strings, phone, "XXX-XXX-XXXX")
#> [1] "apple"
#> [2] "XXX-XXX-XXXX"
#> [3] "XXX-XXX-XXXX"
#> [4] "Work: XXX-XXX-XXXX; Home: 543.355.3679"
str_replace_all(strings, phone, "XXX-XXX-XXXX")
#> [1] "apple"
#> [2] "XXX-XXX-XXXX"
#> [3] "XXX-XXX-XXXX"
#> [4] "Work: XXX-XXX-XXXX; Home: XXX-XXX-XXXX"
str_split_fixed()
splits a string into a fixed number of pieces based on a pattern and returns a character matrix. str_split()
splits a string into a variable number of pieces and returns a list of character vectors.
There are four main engines that stringr can use to describe patterns:
Regular expressions, the default, as shown above, and described in vignette("regular-expressions")
.
Fixed bytewise matching, with fixed()
.
Locale-sensitive character matching, with coll()
Text boundary analysis with boundary()
.
fixed(x)
only matches the exact sequence of bytes specified by x
. This is a very limited “pattern”, but the restriction can make matching much faster. Beware using fixed()
with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define “á”: either as a single character or as an “a” plus an accent:
They render identically, but because they’re defined differently, fixed()
doesn’t find a match. Instead, you can use coll()
, explained below, to respect human character comparison rules:
coll(x)
looks for a match to x
using human-language collation rules, and is particularly important if you want to do case insensitive matching. Collation rules differ around the world, so you’ll also need to supply a locale
parameter.
i <- c("I", "İ", "i", "ı")
i
#> [1] "I" "İ" "i" "ı"
str_subset(i, coll("i", ignore_case = TRUE))
#> [1] "I" "i"
str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
#> [1] "İ" "i"
The downside of coll()
is speed. Because the rules for recognising which characters are the same are complicated, coll()
is relatively slow compared to regex()
and fixed()
. Note that when both fixed()
and regex()
have ignore_case
arguments, they perform a much simpler comparison than coll()
.
boundary()
matches boundaries between characters, lines, sentences or words. It’s most useful with str_split()
, but can be used with all pattern matching functions:
x <- "This is a sentence."
str_split(x, boundary("word"))
#> [[1]]
#> [1] "This" "is" "a" "sentence"
str_count(x, boundary("word"))
#> [1] 4
str_extract_all(x, boundary("word"))
#> [[1]]
#> [1] "This" "is" "a" "sentence"
By convention, ""
is treated as boundary("character")
:
Regular expressions are a concise and flexible tool for describing patterns in strings. This vignette describes the key features of stringr’s regular expressions, as implemented by stringi. It is not a tutorial, so if you’re unfamiliar regular expressions, I’d recommend starting at http://r4ds.had.co.nz/strings.html. If you want to master the details, I’d recommend reading the classic Mastering Regular Expressions by Jeffrey E. F. Friedl.
Regular expressions are the default pattern engine in stringr. That means when you use a pattern matching function with a bare string, it’s equivalent to wrapping it in a call to regex()
:
You will need to use regex()
explicitly if you want to override the default options, as you’ll see in examples below.
The simplest patterns match exact strings:
You can perform a case-insensitive match using ignore_case = TRUE
:
bananas <- c("banana", "Banana", "BANANA")
str_detect(bananas, "banana")
#> [1] TRUE FALSE FALSE
str_detect(bananas, regex("banana", ignore_case = TRUE))
#> [1] TRUE TRUE TRUE
The next step up in complexity is .
, which matches any character except a newline:
You can allow .
to match everything, including \n
, by setting dotall = TRUE
:
If “.
” matches any character, how do you match a literal “.
”? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, \
, to escape special behaviour. So to match an .
, you need the regexp \.
. Unfortunately this creates a problem. We use strings to represent regular expressions, and \
is also used as an escape symbol in strings. So to create the regular expression \.
we need the string "\\."
.
# To create the regular expression, we need \\
dot <- "\\."
# But the expression itself only contains one:
writeLines(dot)
#> \.
# And this tells R to look for an explicit .
str_extract(c("abc", "a.c", "bef"), "a\\.c")
#> [1] NA "a.c" NA
If \
is used as an escape character in regular expressions, how do you match a literal \
? Well you need to escape it, creating the regular expression \\
. To create that regular expression, you need to use a string, which also needs to escape \
. That means to match a literal \
you need to write "\\\\"
— you need four backslashes to match one!
In this vignette, I use \.
to denote the regular expression, and "\\."
to denote the string that represents the regular expression.
An alternative quoting mechanism is \Q...\E
: all the characters in ...
are treated as exact matches. This is useful if you want to exactly match user input as part of a regular expression.
Escapes also allow you to specify individual characters that are otherwise hard to type. You can specify individual unicode characters in five ways, either as a variable number of hex digits (four is most common), or by name:
\xhh
: 2 hex digits.
\x{hhhh}
: 1-6 hex digits.
\uhhhh
: 4 hex digits.
\Uhhhhhhhh
: 8 hex digits.
\N{name}
, e.g. \N{grinning face}
matches the basic smiling emoji.
Similarly, you can specify many common control characters:
\a
: bell.
\cX
: match a control-X character.
\e
: escape (\u001B
).
\f
: form feed (\u000C
).
\n
: line feed (\u000A
).
\r
: carriage return (\u000D
).
\t
: horizontal tabulation (\u0009
).
\0ooo
match an octal character. ‘ooo’ is from one to three octal digits, from 000 to 0377. The leading zero is required.
(Many of these are only of historical interest and are only included here for the sake of completeness.)
There are a number of patterns that match more than one character. You’ve already seen .
, which matches any character (except a newline). A closely related operator is \X
, which matches a grapheme cluster, a set of individual elements that form a single symbol. For example, one way of representing “á” is as the letter “a” plus an accent: .
will match the component “a”, while \X
will match the complete symbol:
There are five other escaped pairs that match narrower classes of characters:
\d
: matches any digit. The complement, \D
, matches any character that is not a decimal digit.
Technically, \d
includes any character in the Unicode Category of Nd (“Number, Decimal Digit”), which also includes numeric symbols from other languages:
\s
: matches any whitespace. This includes tabs, newlines, form feeds, and any character in the Unicode Z Category (which includes a variety of space characters and other separators.). The complement, \S
, matches any non-whitespace character.
\p{property name}
matches any character with specific unicode property, like \p{Uppercase}
or \p{Diacritic}
. The complement, \P{property name}
, matches all characters without the property. A complete list of unicode properties can be found at http://www.unicode.org/reports/tr44/#Property_Index.
\w
matches any “word” character, which includes alphabetic characters, marks and decimal numbers. The complement, \W
, matches any non-word character.
str_extract_all("Don't eat that!", "\\w+")[[1]]
#> [1] "Don" "t" "eat" "that"
str_split("Don't eat that!", "\\W")[[1]]
#> [1] "Don" "t" "eat" "that" ""
Technically, \w
also matches connector punctuation, \u200c
(zero width connector), and \u200d
(zero width joiner), but these are rarely seen in the wild.
\b
matches word boundaries, the transition between word and non-word characters. \B
matches the opposite: boundaries that have either both word or non-word characters on either side.
You can also create your own character classes using []
:
[abc]
: matches a, b, or c.[a-z]
: matches every character between a and z (in Unicode code point order).[^abc]
: matches anything except a, b, or c.[\^\-]
: matches ^
or -
.There are a number of pre-built classes that you can use inside []
:
[:punct:]
: punctuation.[:alpha:]
: letters.[:lower:]
: lowercase letters.[:upper:]
: upperclass letters.[:digit:]
: digits.[:xdigit:]
: hex digits.[:alnum:]
: letters and numbers.[:cntrl:]
: control characters.[:graph:]
: letters, numbers, and punctuation.[:print:]
: letters, numbers, punctuation, and whitespace.[:space:]
: space characters (basically equivalent to \s
).[:blank:]
: space and tab.These all go inside the []
for character classes, i.e. [[:digit:]AX]
matches all digits, A, and X.
You can also using Unicode properties, like [\p{Letter}]
, and various set operations, like [\p{Letter}--\p{script=latin}]
. See ?"stringi-search-charclass"
for details.
|
is the alternation operator, which will pick between one or more possible matches. For example, abc|def
will match abc
or def
.
Note that the precedence for |
is low, so that abc|def
matches abc
or def
not abcyz
or abxyz
.
You can use parentheses to override the default precedence rules:
str_extract(c("grey", "gray"), "gre|ay")
#> [1] "gre" "ay"
str_extract(c("grey", "gray"), "gr(e|a)y")
#> [1] "grey" "gray"
Parenthesis also define “groups” that you can refer to with backreferences, like \1
, \2
etc, and can be extracted with str_match()
. For example, the following regular expression finds all fruits that have a repeated pair of letters:
pattern <- "(..)\\1"
fruit %>%
str_subset(pattern)
#> [1] "banana" "coconut" "cucumber" "jujube" "papaya"
#> [6] "salal berry"
fruit %>%
str_subset(pattern) %>%
str_match(pattern)
#> [,1] [,2]
#> [1,] "anan" "an"
#> [2,] "coco" "co"
#> [3,] "cucu" "cu"
#> [4,] "juju" "ju"
#> [5,] "papa" "pa"
#> [6,] "alal" "al"
You can use (?:...)
, the non-grouping parentheses, to control precedence but not capture the match in a group. This is slightly more efficient than capturing parentheses.
str_match(c("grey", "gray"), "gr(e|a)y")
#> [,1] [,2]
#> [1,] "grey" "e"
#> [2,] "gray" "a"
str_match(c("grey", "gray"), "gr(?:e|a)y")
#> [,1]
#> [1,] "grey"
#> [2,] "gray"
This is most useful for more complex cases where you need to capture matches and control precedence independently.
By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string:
^
matches the start of string.$
matches the end of the string.x <- c("apple", "banana", "pear")
str_extract(x, "^a")
#> [1] "a" NA NA
str_extract(x, "a$")
#> [1] NA "a" NA
To match a literal “$” or “^”, you need to escape them, \$
, and \^
.
For multiline strings, you can use regex(multiline = TRUE)
. This changes the behaviour of ^
and $
, and introduces three new operators:
^
now matches the start of each line.
$
now matches the end of each line.
\A
matches the start of the input.
\z
matches the end of the input.
\Z
matches the end of the input, but before the final line terminator, if it exists.
You can control how many times a pattern matches with the repetition operators:
?
: 0 or 1.+
: 1 or more.*
: 0 or more.x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_extract(x, "CC?")
#> [1] "CC"
str_extract(x, "CC+")
#> [1] "CCC"
str_extract(x, 'C[LX]+')
#> [1] "CLXXX"
Note that the precedence of these operators is high, so you can write: colou?r
to match either American or British spellings. That means most uses will need parentheses, like bana(na)+
.
You can also specify the number of matches precisely:
{n}
: exactly n{n,}
: n or more{n,m}
: between n and mstr_extract(x, "C{2}")
#> [1] "CC"
str_extract(x, "C{2,}")
#> [1] "CCC"
str_extract(x, "C{2,3}")
#> [1] "CCC"
By default these matches are “greedy”: they will match the longest string possible. You can make them “lazy”, matching the shortest string possible by putting a ?
after them:
??
: 0 or 1, prefer 0.+?
: 1 or more, match as few times as possible.*?
: 0 or more, match as few times as possible.{n,}?
: n or more, match as few times as possible.{n,m}?
: between n and m, , match as few times as possible, but at least n.str_extract(x, c("C{2,3}", "C{2,3}?"))
#> [1] "CCC" "CC"
str_extract(x, c("C[LX]+", "C[LX]+?"))
#> [1] "CLXXX" "CL"
You can also make the matches possessive by putting a +
after them, which means that if later parts of the match fail, the repetition will not be re-tried with a smaller number of characters. This is an advanced feature used to improve performance in worst-case scenarios (called “catastrophic backtracking”).
?+
: 0 or 1, possessive.++
: 1 or more, possessive.*+
: 0 or more, possessive.{n}+
: exactly n, possessive.{n,}+
: n or more, possessive.{n,m}+
: between n and m, possessive.A related concept is the atomic-match parenthesis, (?>...)
. If a later match fails and the engine needs to back-track, an atomic match is kept as is: it succeeds or fails as a whole. Compare the following two regular expressions:
The atomic match fails because it matches A, and then the next character is a C so it fails. The regular match succeeds because it matches A, but then C doesn’t match, so it back-tracks and tries B instead.
These assertions look ahead or behind the current match without “consuming” any characters (i.e. changing the input position).
(?=...)
: positive look-ahead assertion. Matches if ...
matches at the current input.
(?!...)
: negative look-ahead assertion. Matches if ...
does not match at the current input.
(?<=...)
: positive look-behind assertion. Matches if ...
matches text preceding the current position, with the last character of the match being the character just before the current position. Length must be bounded
(i.e. no *
or +
).
(?<!...)
: negative look-behind assertion. Matches if ...
does not match text preceding the current position. Length must be bounded
(i.e. no *
or +
).
These are useful when you want to check that a pattern exists, but you don’t want to include it in the result:
Comments
There are two ways to include comments in a regular expression. The first is with
(?#...)
:The second is to use
regex(comments = TRUE)
. This form ignores spaces and newlines, and anything everything after#
. To match a literal space, you’ll need to escape it:"\\ "
. This is a useful way of describing complex regular expressions: