957 lines
32 KiB
Racket
957 lines
32 KiB
Racket
#lang scribble/doc
|
|
@(require scribble/manual
|
|
scribble/eval
|
|
"guide-utils.ss")
|
|
|
|
@title[#:tag "regexp" #:style 'toc]{Regular Expressions}
|
|
|
|
@margin-note{This chapter is a modified version of @cite["Sitaram05"].}
|
|
|
|
A @deftech{regexp} value encapsulates a pattern that is described by a
|
|
string or @tech{byte string}. The regexp matcher tries to match this
|
|
pattern against (a portion of) another string or byte string, which we
|
|
will call the @deftech{text string}, when you call functions like
|
|
@scheme[regexp-match]. The text string is treated as raw text, and
|
|
not as a pattern.
|
|
|
|
@local-table-of-contents[]
|
|
|
|
@refdetails["regexp"]{regexps}
|
|
|
|
@; ----------------------------------------
|
|
|
|
@section[#:tag "regexp-intro"]{Writing Regexp Patterns}
|
|
|
|
A string or @tech{byte string} can be used directly as a @tech{regexp}
|
|
pattern, or it can be prefixed with @litchar{#rx} to form a literal
|
|
@tech{regexp} value. For example, @scheme[#rx"abc"] is a string-based
|
|
@tech{regexp} value, and @scheme[#rx#"abc"] is a @tech{byte
|
|
string}-based @tech{regexp} value. Alternately, a string or byte
|
|
string can be prefixed with @litchar{#px}, as in @scheme[#px"abc"],
|
|
for a slightly extended syntax of patterns within the string.
|
|
|
|
Most of the characters in a @tech{regexp} pattern are meant to match
|
|
occurrences of themselves in the @tech{text string}. Thus, the pattern
|
|
@scheme[#rx"abc"] matches a string that contains the characters
|
|
@litchar{a}, @litchar{b}, and @litchar{c} in succession. Other
|
|
characters act as @deftech{metacharacters}, and some character
|
|
sequences act as @deftech{metasequences}. That is, they specify
|
|
something other than their literal selves. For example, in the
|
|
pattern @scheme[#rx"a.c"], the characters @litchar{a} and @litchar{c}
|
|
stand for themselves, but the @tech{metacharacter} @litchar{.} can
|
|
match @emph{any} character. Therefore, the pattern @scheme[#rx"a.c"]
|
|
matches an @litchar{a}, any character, and @litchar{c} in succession.
|
|
|
|
@margin-note{When we want a literal @litchar{\} inside a Scheme string
|
|
or regexp literal, we must escape it so that it shows up in the string
|
|
at all. Scheme strings use @litchar{\} as the escape character, so we
|
|
end up with two @litchar{\}s: one Scheme-string @litchar{\} to escape
|
|
the regexp @litchar{\}, which then escapes the @litchar{.}. Another
|
|
character that would need escaping inside a Scheme string is
|
|
@litchar{"}.}
|
|
|
|
If we needed to match the character @litchar{.} itself, we can escape
|
|
it by precede it with a @litchar{\}. The character sequence
|
|
@litchar{\.} is thus a @tech{metasequence}, since it doesn't match
|
|
itself but rather just @litchar{.}. So, to match @litchar{a},
|
|
@litchar{.}, and @litchar{c} in succession, we use the regexp pattern
|
|
@scheme[#rx"a\\.c"]; the double @litchar{\} is an artifact of Scheme
|
|
strings, not the @tech{regexp} pattern itself.
|
|
|
|
The @scheme[regexp] function takes a string or byte string and
|
|
produces a @tech{regexp} value. Use @scheme[regexp] when you construct
|
|
a pattern to be matched against multiple strings, since a pattern is
|
|
compiled to a @tech{regexp} value before it can be used in a match.
|
|
The @scheme[pregexp] function is like @scheme[regexp], but using the
|
|
extended syntax. Regexp values as literals with @litchar{#rx} or
|
|
@litchar{#px} are compiled once and for all when they are read.
|
|
|
|
|
|
The @scheme[regexp-quote] function takes an arbitrary string and
|
|
returns a string for a pattern that matches exactly the original
|
|
string. In particular, characters in the input string that could serve
|
|
as regexp metacharacters are escaped with a backslash, so that they
|
|
safely match only themselves.
|
|
|
|
@interaction[
|
|
(regexp-quote "cons")
|
|
(regexp-quote "list?")
|
|
]
|
|
|
|
The @scheme[regexp-quote] function is useful when building a composite
|
|
@tech{regexp} from a mix of @tech{regexp} strings and verbatim strings.
|
|
|
|
|
|
@; ----------------------------------------
|
|
|
|
@section[#:tag "regexp-match"]{Matching Regexp Patterns}
|
|
|
|
The @scheme[regexp-match-positions] function takes a @tech{regexp}
|
|
pattern and a @tech{text string}, and it returns a match if the regexp
|
|
matches (some part of) the @tech{text string}, or @scheme[#f] if the regexp
|
|
did not match the string. A successful match produces a list of
|
|
@deftech{index pairs}.
|
|
|
|
@examples[
|
|
(regexp-match-positions #rx"brain" "bird")
|
|
(regexp-match-positions #rx"needle" "hay needle stack")
|
|
]
|
|
|
|
In the second example, the integers @scheme[4] and @scheme[10]
|
|
identify the substring that was matched. The @scheme[4] is the
|
|
starting (inclusive) index, and @scheme[10] the ending (exclusive)
|
|
index of the matching substring:
|
|
|
|
@interaction[
|
|
(substring "hay needle stack" 4 10)
|
|
]
|
|
|
|
In this first example, @scheme[regexp-match-positions]'s return list
|
|
contains only one index pair, and that pair represents the entire
|
|
substring matched by the regexp. When we discuss @tech{subpatterns}
|
|
later, we will see how a single match operation can yield a list of
|
|
@tech{submatch}es.
|
|
|
|
The @scheme[regexp-match-positions] function takes optional third and
|
|
fourth arguments that specify the indices of the @tech{text string} within
|
|
which the matching should take place.
|
|
|
|
@interaction[
|
|
(regexp-match-positions
|
|
#rx"needle"
|
|
"his needle stack -- my needle stack -- her needle stack"
|
|
20 39)
|
|
]
|
|
|
|
Note that the returned indices are still reckoned relative to the full
|
|
@tech{text string}.
|
|
|
|
The @scheme[regexp-match] function is like
|
|
@scheme[regexp-match-positions], but instead of returning index pairs,
|
|
it returns the matching substrings:
|
|
|
|
@interaction[
|
|
(regexp-match #rx"brain" "bird")
|
|
(regexp-match #rx"needle" "hay needle stack")
|
|
]
|
|
|
|
When @scheme[regexp-match] is used with byte-string regexp, the result
|
|
is a matching byte substring:
|
|
|
|
@interaction[
|
|
(regexp-match #rx#"needle" #"hay needle stack")
|
|
]
|
|
|
|
@margin-note{A byte-string regexp can be applied to a string, and a
|
|
string regexp can be applied to a byte string. In both
|
|
cases, the result is a byte string. Internally, all
|
|
regexp matching is in terms of bytes, and a string regexp
|
|
is expanded to a regexp that matches UTF-8 encodings of
|
|
characters. For maximum efficiency, use byte-string
|
|
matching instead of string, since matching bytes directly
|
|
avoids UTF-8 encodings.}
|
|
|
|
If you have data that is in a port, there's no need to first read it
|
|
into a string. Functions like @scheme[regexp-match] can match on the
|
|
port directly:
|
|
|
|
@interaction[
|
|
(define-values (i o) (make-pipe))
|
|
(write "hay needle stack" o)
|
|
(close-output-port o)
|
|
(regexp-match #rx#"needle" i)
|
|
]
|
|
|
|
The @scheme[regexp-match?] function is like
|
|
@scheme[regexp-match-positions], but simply returns a boolean
|
|
indicating whether the match succeeded:
|
|
|
|
@interaction[
|
|
(regexp-match? #rx"brain" "bird")
|
|
(regexp-match? #rx"needle" "hay needle stack")
|
|
]
|
|
|
|
The @scheme[regexp-split] function takes two arguments, a
|
|
@tech{regexp} pattern and a text string, and it returns a list of
|
|
substrings of the text string; the pattern identifies the delimiter
|
|
separating the substrings.
|
|
|
|
@interaction[
|
|
(regexp-split #rx":" "/bin:/usr/bin:/usr/bin/X11:/usr/local/bin")
|
|
(regexp-split #rx" " "pea soup")
|
|
]
|
|
|
|
If the first argument matches empty strings, then the list of all the
|
|
single-character substrings is returned.
|
|
|
|
@interaction[
|
|
(regexp-split #rx"" "smithereens")
|
|
]
|
|
|
|
Thus, to identify one-or-more spaces as the delimiter, take care to
|
|
use the regexp @scheme[#rx"\u20+"], not @scheme[#rx"\u20*"].
|
|
|
|
@interaction[
|
|
(regexp-split #rx" +" "split pea soup")
|
|
(regexp-split #rx" *" "split pea soup")
|
|
]
|
|
|
|
The @scheme[regexp-replace] function replaces the matched portion of
|
|
the text string by another string. The first argument is the pattern,
|
|
the second the text string, and the third is either the string to be
|
|
inserted or a procedure to convert matches to the insert string.
|
|
|
|
@interaction[
|
|
(regexp-replace #rx"te" "liberte" "ty")
|
|
(regexp-replace #rx"." "scheme" string-upcase)
|
|
]
|
|
|
|
If the pattern doesn't occur in the text string, the returned string
|
|
is identical to the text string.
|
|
|
|
The @scheme[regexp-replace*] function replaces @emph{all} matches in
|
|
the text string by the insert string:
|
|
|
|
@interaction[
|
|
(regexp-replace* #rx"te" "liberte egalite fraternite" "ty")
|
|
(regexp-replace* #rx"[ds]" "drscheme" string-upcase)
|
|
]
|
|
|
|
@; ----------------------------------------
|
|
|
|
@section[#:tag "regexp-assert"]{Basic Assertions}
|
|
|
|
The @deftech{assertions} @litchar{^} and @litchar{$} identify the
|
|
beginning and the end of the text string, respectively. They ensure
|
|
that their adjoining regexps match at one or other end of the text
|
|
string:
|
|
|
|
@interaction[
|
|
(regexp-match-positions #rx"^contact" "first contact")
|
|
]
|
|
|
|
The @tech{regexp} above fails to match because @litchar{contact} does
|
|
not occur at the beginning of the text string. In
|
|
|
|
@interaction[
|
|
(regexp-match-positions #rx"laugh$" "laugh laugh laugh laugh")
|
|
]
|
|
|
|
the regexp matches the @emph{last} @litchar{laugh}.
|
|
|
|
The metasequence @litchar{\b} asserts that a word boundary exists, but
|
|
this metasequence works only with @litchar{#px} syntax. In
|
|
|
|
@interaction[
|
|
(regexp-match-positions #px"yack\\b" "yackety yack")
|
|
]
|
|
|
|
the @litchar{yack} in @litchar{yackety} doesn't end at a word boundary
|
|
so it isn't matched. The second @litchar{yack} does and is.
|
|
|
|
The metasequence @litchar{\B} (also @litchar{#px} only) has the
|
|
opposite effect to @litchar{\b}; it asserts that a word boundary does
|
|
not exist. In
|
|
|
|
@interaction[
|
|
(regexp-match-positions #px"an\\B" "an analysis")
|
|
]
|
|
|
|
the @litchar{an} that doesn't end in a word boundary is matched.
|
|
|
|
@; ----------------------------------------
|
|
|
|
@section[#:tag "regexp-chars"]{Characters and Character Classes}
|
|
|
|
Typically, a character in the regexp matches the same character in the
|
|
text string. Sometimes it is necessary or convenient to use a regexp
|
|
@tech{metasequence} to refer to a single character. For example, the
|
|
metasequence @litchar{\.} matches the period character.
|
|
|
|
The @tech{metacharacter} @litchar{.} matches @emph{any} character
|
|
(other than newline in @tech{multi-line mode}; see
|
|
@secref["regexp-cloister"]):
|
|
|
|
@interaction[
|
|
(regexp-match #rx"p.t" "pet")
|
|
]
|
|
|
|
The above pattern also matches @litchar{pat}, @litchar{pit},
|
|
@litchar{pot}, @litchar{put}, and @litchar{p8t}, but not
|
|
@litchar{peat} or @litchar{pfffft}.
|
|
|
|
A @deftech{character class} matches any one character from a set of
|
|
characters. A typical format for this is the @deftech{bracketed
|
|
character class} @litchar{[}...@litchar{]}, which matches any one
|
|
character from the non-empty sequence of characters enclosed within
|
|
the brackets. Thus, @scheme[#rx"p[aeiou]t"] matches @litchar{pat},
|
|
@litchar{pet}, @litchar{pit}, @litchar{pot}, @litchar{put}, and
|
|
nothing else.
|
|
|
|
Inside the brackets, a @litchar{-} between two characters specifies
|
|
the Unicode range between the characters. For example,
|
|
@scheme[#rx"ta[b-dgn-p]"] matches @litchar{tab}, @litchar{tac},
|
|
@litchar{tad}, @litchar{tag}, @litchar{tan}, @litchar{tao}, and
|
|
@litchar{tap}.
|
|
|
|
An initial @litchar{^} after the left bracket inverts the set
|
|
specified by the rest of the contents; i.e., it specifies the set of
|
|
characters @emph{other than} those identified in the brackets. For
|
|
example, @scheme[#rx"do[^g]"] matches all three-character sequences
|
|
starting with @litchar{do} except @litchar{dog}.
|
|
|
|
Note that the @tech{metacharacter} @litchar{^} inside brackets means
|
|
something quite different from what it means outside. Most other
|
|
@tech{metacharacters} (@litchar{.}, @litchar{*}, @litchar{+},
|
|
@litchar{?}, etc.) cease to be @tech{metacharacters} when inside
|
|
brackets, although you may still escape them for peace of mind. A
|
|
@litchar{-} is a @tech{metacharacter} only when it's inside brackets,
|
|
and when it is neither the first nor the last character between the
|
|
brackets.
|
|
|
|
Bracketed character classes cannot contain other bracketed character
|
|
classes (although they contain certain other types of character
|
|
classes; see below). Thus, a @litchar{[} inside a bracketed character
|
|
class doesn't have to be a metacharacter; it can stand for itself.
|
|
For example, @scheme[#rx"[a[b]"] matches @litchar{a}, @litchar{[}, and
|
|
@litchar{b}.
|
|
|
|
Furthermore, since empty bracketed character classes are disallowed, a
|
|
@litchar{]} immediately occurring after the opening left bracket also
|
|
doesn't need to be a metacharacter. For example, @scheme[#rx"[]ab]"]
|
|
matches @litchar{]}, @litchar{a}, and @litchar{b}.
|
|
|
|
@subsection{Some Frequently Used Character Classes}
|
|
|
|
In @litchar{#px} syntax, some standard character classes can be
|
|
conveniently represented as metasequences instead of as explicit
|
|
bracketed expressions: @litchar{\d} matches a digit
|
|
(the same as @litchar{[0-9]}); @litchar{\s} matches an ASCII whitespace character; and
|
|
@litchar{\w} matches a character that could be part of a
|
|
``word''.
|
|
|
|
@margin-note{Following regexp custom, we identify ``word'' characters
|
|
as @litchar{[A-Za-z0-9_]}, although these are too restrictive for what
|
|
a Schemer might consider a ``word.''}
|
|
|
|
The upper-case versions of these metasequences stand for the
|
|
inversions of the corresponding character classes: @litchar{\D}
|
|
matches a non-digit, @litchar{\S} a non-whitespace character, and
|
|
@litchar{\W} a non-``word'' character.
|
|
|
|
Remember to include a double backslash when putting these
|
|
metasequences in a Scheme string:
|
|
|
|
@interaction[
|
|
(regexp-match #px"\\d\\d"
|
|
"0 dear, 1 have 2 read catch 22 before 9")
|
|
]
|
|
|
|
These character classes can be used inside a bracketed expression. For
|
|
example, @scheme[#px"[a-z\\d]"] matches a lower-case letter or a
|
|
digit.
|
|
|
|
@subsection{POSIX character classes}
|
|
|
|
A @deftech{POSIX character class} is a special @tech{metasequence} of
|
|
the form @litchar{[:}...@litchar{:]} that can be used only inside a
|
|
bracketed expression in @litchar{#px} syntax. The POSIX classes
|
|
supported are
|
|
|
|
@itemize[#:style "compact"
|
|
|
|
@item{@litchar{[:alnum:]} --- ASCII letters and digits}
|
|
|
|
@item{@litchar{[:alpha:]} --- ASCII letters}
|
|
|
|
@item{@litchar{[:ascii:]} --- ASCII characters}
|
|
|
|
@item{@litchar{[:blank:]} --- ASCII widthful whitespace: space and tab}
|
|
|
|
@item{@litchar{[:cntrl:]} --- ``control'' characters: ASCII 0 to 32}
|
|
|
|
@item{@litchar{[:digit:]} --- ASCII digits, same as @litchar{\d}}
|
|
|
|
@item{@litchar{[:graph:]} --- ASCII characters that use ink}
|
|
|
|
@item{@litchar{[:lower:]} --- ASCII lower-case letters}
|
|
|
|
@item{@litchar{[:print:]} --- ASCII ink-users plus widthful whitespace}
|
|
|
|
@item{@litchar{[:space:]} --- ASCII whitespace, same as @litchar{\s}}
|
|
|
|
@item{@litchar{[:upper:]} --- ASCII upper-case letters}
|
|
|
|
@item{@litchar{[:word:]} --- ASCII same as @litchar{\w}}
|
|
|
|
@item{@litchar{[:xdigit:]} --- ASCII hex digits}
|
|
|
|
]
|
|
|
|
For example, the @scheme[#px"[[:alpha:]_]"] matches a letter or
|
|
underscore.
|
|
|
|
@interaction[
|
|
(regexp-match #px"[[:alpha:]_]" "--x--")
|
|
(regexp-match #px"[[:alpha:]_]" "--_--")
|
|
(regexp-match #px"[[:alpha:]_]" "--:--")
|
|
]
|
|
|
|
The POSIX class notation is valid @emph{only} inside a bracketed
|
|
expression. For instance, @litchar{[:alpha:]}, when not inside a
|
|
bracketed expression, will not be read as the letter class. Rather,
|
|
it is (from previous principles) the character class containing the
|
|
characters @litchar{:}, @litchar{a}, @litchar{l}, @litchar{p},
|
|
@litchar{h}.
|
|
|
|
@interaction[
|
|
(regexp-match #px"[:alpha:]" "--a--")
|
|
(regexp-match #px"[:alpha:]" "--x--")
|
|
]
|
|
|
|
@; ----------------------------------------
|
|
|
|
@section[#:tag "regexp-quant"]{Quantifiers}
|
|
|
|
The @deftech{quantifiers} @litchar{*}, @litchar{+}, and @litchar{?}
|
|
match respectively: zero or more, one or more, and zero or one
|
|
instances of the preceding subpattern.
|
|
|
|
@interaction[
|
|
(regexp-match-positions #rx"c[ad]*r" "cadaddadddr")
|
|
(regexp-match-positions #rx"c[ad]*r" "cr")
|
|
|
|
(regexp-match-positions #rx"c[ad]+r" "cadaddadddr")
|
|
(regexp-match-positions #rx"c[ad]+r" "cr")
|
|
|
|
(regexp-match-positions #rx"c[ad]?r" "cadaddadddr")
|
|
(regexp-match-positions #rx"c[ad]?r" "cr")
|
|
(regexp-match-positions #rx"c[ad]?r" "car")
|
|
]
|
|
|
|
In @litchar{#px} syntax, you can use braces to specify much
|
|
finer-tuned quantification than is possible with @litchar{*},
|
|
@litchar{+}, @litchar{?}:
|
|
|
|
@itemize[
|
|
|
|
@item{The quantifier @litchar["{"]@math{m}@litchar["}"] matches
|
|
@emph{exactly} @math{m} instances of the preceding
|
|
@tech{subpattern}; @math{m} must be a nonnegative integer.}
|
|
|
|
@item{The quantifier
|
|
@litchar["{"]@math{m}@litchar{,}@math{n}@litchar["}"] matches
|
|
at least @math{m} and at most @math{n} instances. @litchar{m}
|
|
and @litchar{n} are nonnegative integers with @math{m} less or
|
|
equal to @math{n}. You may omit either or both numbers, in
|
|
which case @math{m} defaults to @math{0} and @math{n} to
|
|
infinity.}
|
|
|
|
]
|
|
|
|
It is evident that @litchar{+} and @litchar{?} are abbreviations for
|
|
@litchar{{1,}} and @litchar{{0,1}} respectively, and @litchar{*}
|
|
abbreviates @litchar{{,}}, which is the same as @litchar{{0,}}.
|
|
|
|
@interaction[
|
|
(regexp-match #px"[aeiou]{3}" "vacuous")
|
|
(regexp-match #px"[aeiou]{3}" "evolve")
|
|
(regexp-match #px"[aeiou]{2,3}" "evolve")
|
|
(regexp-match #px"[aeiou]{2,3}" "zeugma")
|
|
]
|
|
|
|
The quantifiers described so far are all @deftech{greedy}: they match
|
|
the maximal number of instances that would still lead to an overall
|
|
match for the full pattern.
|
|
|
|
@interaction[
|
|
(regexp-match #rx"<.*>" "<tag1> <tag2> <tag3>")
|
|
]
|
|
|
|
To make these quantifiers @deftech{non-greedy}, append a @litchar{?}
|
|
to them. Non-greedy quantifiers match the minimal number of instances
|
|
needed to ensure an overall match.
|
|
|
|
@interaction[
|
|
(regexp-match #rx"<.*?>" "<tag1> <tag2> <tag3>")
|
|
]
|
|
|
|
The non-greedy quantifiers are respectively: @litchar{*?},
|
|
@litchar{+?}, @litchar{??}, @litchar["{"]@math{m}@litchar["}?"],
|
|
@litchar["{"]@math{m}@litchar{,}@math{n}@litchar["}?"]. Note the two
|
|
uses of the metacharacter @litchar{?}.
|
|
|
|
@; ----------------------------------------
|
|
|
|
@section[#:tag "regexp-clusters"]{Clusters}
|
|
|
|
@deftech{Clustering}---enclosure within parens
|
|
@litchar{(}...@litchar{)}---identifies the enclosed
|
|
@deftech{subpattern} as a single entity. It causes the matcher to
|
|
capture the @deftech{submatch}, or the portion of the string matching
|
|
the subpattern, in addition to the overall match:
|
|
|
|
@interaction[
|
|
(regexp-match #rx"([a-z]+) ([0-9]+), ([0-9]+)" "jan 1, 1970")
|
|
]
|
|
|
|
Clustering also causes a following quantifier to treat the entire
|
|
enclosed subpattern as an entity:
|
|
|
|
@interaction[
|
|
(regexp-match #rx"(poo )*" "poo poo platter")
|
|
]
|
|
|
|
The number of submatches returned is always equal to the number of
|
|
subpatterns specified in the regexp, even if a particular subpattern
|
|
happens to match more than one substring or no substring at all.
|
|
|
|
@interaction[
|
|
(regexp-match #rx"([a-z ]+;)*" "lather; rinse; repeat;")
|
|
]
|
|
|
|
Here, the @litchar{*}-quantified subpattern matches three times, but
|
|
it is the last submatch that is returned.
|
|
|
|
It is also possible for a quantified subpattern to fail to match, even
|
|
if the overall pattern matches. In such cases, the failing submatch
|
|
is represented by @scheme[#f]
|
|
|
|
@interaction[
|
|
(define date-re
|
|
(code:comment #, @t{match `month year' or `month day, year';})
|
|
(code:comment #, @t{subpattern matches day, if present})
|
|
#rx"([a-z]+) +([0-9]+,)? *([0-9]+)")
|
|
(regexp-match date-re "jan 1, 1970")
|
|
(regexp-match date-re "jan 1970")
|
|
]
|
|
|
|
|
|
@subsection{Backreferences}
|
|
|
|
@tech{Submatch}es can be used in the insert string argument of the
|
|
procedures @scheme[regexp-replace] and @scheme[regexp-replace*]. The
|
|
insert string can use @litchar{\}@math{n} as a @deftech{backreference}
|
|
to refer back to the @math{n}th submatch, which is the the substring
|
|
that matched the @math{n}th subpattern. A @litchar{\0} refers to the
|
|
entire match, and it can also be specified as @litchar{\&}.
|
|
|
|
@interaction[
|
|
(regexp-replace #rx"_(.+?)_"
|
|
"the _nina_, the _pinta_, and the _santa maria_"
|
|
"*\\1*")
|
|
(regexp-replace* #rx"_(.+?)_"
|
|
"the _nina_, the _pinta_, and the _santa maria_"
|
|
"*\\1*")
|
|
|
|
(regexp-replace #px"(\\S+) (\\S+) (\\S+)"
|
|
"eat to live"
|
|
"\\3 \\2 \\1")
|
|
]
|
|
|
|
Use @litchar{\\} in the insert string to specify a literal backslash.
|
|
Also, @litchar{\$} stands for an empty string, and is useful for
|
|
separating a backreference @litchar{\}@math{n} from an immediately
|
|
following number.
|
|
|
|
Backreferences can also be used within a @litchar{#px} pattern to
|
|
refer back to an already matched subpattern in the pattern.
|
|
@litchar{\}@math{n} stands for an exact repeat of the @math{n}th
|
|
submatch. Note that @litchar{\0}, which is useful in an insert string,
|
|
makes no sense within the regexp pattern, because the entire regexp
|
|
has not matched yet that you could refer back to it.}
|
|
|
|
@interaction[
|
|
(regexp-match #px"([a-z]+) and \\1"
|
|
"billions and billions")
|
|
]
|
|
|
|
Note that the @tech{backreference} is not simply a repeat of the
|
|
previous subpattern. Rather it is a repeat of the particular
|
|
substring already matched by the subpattern.
|
|
|
|
In the above example, the @tech{backreference} can only match
|
|
@litchar{billions}. It will not match @litchar{millions}, even though
|
|
the subpattern it harks back to---@litchar{([a-z]+)}---would have had
|
|
no problem doing so:
|
|
|
|
@interaction[
|
|
(regexp-match #px"([a-z]+) and \\1"
|
|
"billions and millions")
|
|
]
|
|
|
|
The following example corrects doubled words:
|
|
|
|
@interaction[
|
|
(regexp-replace* #px"(\\S+) \\1"
|
|
(string-append "now is the the time for all good men to "
|
|
"to come to the aid of of the party")
|
|
"\\1")
|
|
]
|
|
|
|
The following example marks all immediately repeating patterns in a
|
|
number string:
|
|
|
|
@interaction[
|
|
(regexp-replace* #px"(\\d+)\\1"
|
|
"123340983242432420980980234"
|
|
"{\\1,\\1}")
|
|
]
|
|
|
|
@subsection{Non-capturing Clusters}
|
|
|
|
It is often required to specify a cluster (typically for
|
|
quantification) but without triggering the capture of @tech{submatch}
|
|
information. Such clusters are called @deftech{non-capturing}. To
|
|
create a non-capturing cluster, use @litchar{(?:} instead of
|
|
@litchar{(} as the cluster opener.
|
|
|
|
In the following example, a non-capturing cluster eliminates the
|
|
``directory'' portion of a given Unix pathname, and a capturing
|
|
cluster identifies the basename.
|
|
|
|
@margin-note{But don't parse paths with regexps. Use functions like
|
|
@scheme[split-path], instead.}
|
|
|
|
@interaction[
|
|
(regexp-match #rx"^(?:[a-z]*/)*([a-z]+)$"
|
|
"/usr/local/bin/mzscheme")
|
|
]
|
|
|
|
@subsection[#:tag "regexp-cloister"]{Cloisters}
|
|
|
|
The location between the @litchar{?} and the @litchar{:} of a
|
|
non-capturing cluster is called a @deftech{cloister}. You can put
|
|
modifiers there that will cause the enclustered @tech{subpattern} to
|
|
be treated specially. The modifier @litchar{i} causes the subpattern
|
|
to match case-insensitively:
|
|
|
|
@margin-note{The term @defterm{cloister} is a useful, if terminally
|
|
cute, coinage from the abbots of Perl.}
|
|
|
|
@interaction[
|
|
(regexp-match #rx"(?i:hearth)" "HeartH")
|
|
]
|
|
|
|
The modifier @litchar{m} causes the @tech{subpattern} to match in
|
|
@deftech{multi-line mode}, where @litchar{.} does not match a newline
|
|
character, @litchar{^} can match just after a newline, and @litchar{$}
|
|
can match just before a newline.
|
|
|
|
@interaction[
|
|
(regexp-match #rx"." "\na\n")
|
|
(regexp-match #rx"(?m:.)" "\na\n")
|
|
(regexp-match #rx"^A plan$" "A man\nA plan\nA canal")
|
|
(regexp-match #rx"(?m:^A plan$)" "A man\nA plan\nA canal")
|
|
]
|
|
|
|
You can put more than one modifier in the cloister:
|
|
|
|
@interaction[
|
|
(regexp-match #rx"(?mi:^A Plan$)" "a man\na plan\na canal")
|
|
]
|
|
|
|
A minus sign before a modifier inverts its meaning. Thus, you can use
|
|
@litchar{-i} in a @deftech{subcluster} to overturn the
|
|
case-insensitivities caused by an enclosing cluster.
|
|
|
|
@interaction[
|
|
(regexp-match #rx"(?i:the (?-i:TeX)book)"
|
|
"The TeXbook")
|
|
]
|
|
|
|
The above regexp will allow any casing for @litchar{the} and
|
|
@litchar{book}, but it insists that @litchar{TeX} not be differently
|
|
cased.
|
|
|
|
@; ----------------------------------------
|
|
|
|
@section[#:tag "regexp-alternation"]{Alternation}
|
|
|
|
You can specify a list of @emph{alternate} @tech{subpatterns} by
|
|
separating them by @litchar{|}. The @litchar{|} separates
|
|
@tech{subpatterns} in the nearest enclosing cluster (or in the entire
|
|
pattern string if there are no enclosing parens).
|
|
|
|
@interaction[
|
|
(regexp-match #rx"f(ee|i|o|um)" "a small, final fee")
|
|
(regexp-replace* #rx"([yi])s(e[sdr]?|ing|ation)"
|
|
(string-append
|
|
"analyse an energising organisation"
|
|
" pulsing with noisy organisms")
|
|
"\\1z\\2")
|
|
]
|
|
|
|
Note again that if you wish to use clustering merely to specify a list
|
|
of alternate subpatterns but do not want the submatch, use
|
|
@litchar{(?:} instead of @litchar{(}.
|
|
|
|
@interaction[
|
|
(regexp-match #rx"f(?:ee|i|o|um)" "fun for all")
|
|
]
|
|
|
|
An important thing to note about alternation is that the leftmost
|
|
matching alternate is picked regardless of its length. Thus, if one
|
|
of the alternates is a prefix of a later alternate, the latter may not
|
|
have a chance to match.
|
|
|
|
@interaction[
|
|
(regexp-match #rx"call|call-with-current-continuation"
|
|
"call-with-current-continuation")
|
|
]
|
|
|
|
To allow the longer alternate to have a shot at matching, place it
|
|
before the shorter one:
|
|
|
|
@interaction[
|
|
(regexp-match #rx"call-with-current-continuation|call"
|
|
"call-with-current-continuation")
|
|
]
|
|
|
|
In any case, an overall match for the entire regexp is always
|
|
preferred to an overall non-match. In the following, the longer
|
|
alternate still wins, because its preferred shorter prefix fails to
|
|
yield an overall match.
|
|
|
|
@interaction[
|
|
(regexp-match
|
|
#rx"(?:call|call-with-current-continuation) constrained"
|
|
"call-with-current-continuation constrained")
|
|
]
|
|
|
|
@; ----------------------------------------
|
|
|
|
@section{Backtracking}
|
|
|
|
We've already seen that greedy quantifiers match the maximal number of
|
|
times, but the overriding priority is that the overall match succeed.
|
|
Consider
|
|
|
|
@interaction[
|
|
(regexp-match #rx"a*a" "aaaa")
|
|
]
|
|
|
|
The regexp consists of two subregexps: @litchar{a*} followed by
|
|
@litchar{a}. The subregexp @litchar{a*} cannot be allowed to match
|
|
all four @litchar{a}'s in the text string @scheme[aaaa], even though
|
|
@litchar{*} is a greedy quantifier. It may match only the first
|
|
three, leaving the last one for the second subregexp. This ensures
|
|
that the full regexp matches successfully.
|
|
|
|
The regexp matcher accomplishes this via a process called
|
|
@deftech{backtracking}. The matcher tentatively allows the greedy
|
|
quantifier to match all four @litchar{a}'s, but then when it becomes
|
|
clear that the overall match is in jeopardy, it @emph{backtracks} to a
|
|
less greedy match of three @litchar{a}'s. If even this fails, as in
|
|
the call
|
|
|
|
@interaction[
|
|
(regexp-match #rx"a*aa" "aaaa")
|
|
]
|
|
|
|
the matcher backtracks even further. Overall, failure is conceded
|
|
only when all possible backtracking has been tried with no success.
|
|
|
|
Backtracking is not restricted to greedy quantifiers.
|
|
Nongreedy quantifiers match as few instances as
|
|
possible, and progressively backtrack to more and more
|
|
instances in order to attain an overall match. There
|
|
is backtracking in alternation too, as the more
|
|
rightward alternates are tried when locally successful
|
|
leftward ones fail to yield an overall match.
|
|
|
|
Sometimes it is efficient to disable backtracking. For example, we
|
|
may wish to commit to a choice, or we know that trying alternatives is
|
|
fruitless. A nonbacktracking regexp is enclosed in
|
|
@litchar{(?>}...@litchar{)}.
|
|
|
|
@interaction[
|
|
(regexp-match #rx"(?>a+)." "aaaa")
|
|
]
|
|
|
|
In this call, the subregexp @litchar{?>a+} greedily matches all four
|
|
@litchar{a}'s, and is denied the opportunity to backtrack. So, the
|
|
overall match is denied. The effect of the regexp is therefore to
|
|
match one or more @litchar{a}'s followed by something that is
|
|
definitely non-@litchar{a}.
|
|
|
|
@; ----------------------------------------
|
|
|
|
@section{Looking Ahead and Behind}
|
|
|
|
You can have assertions in your pattern that look @emph{ahead} or
|
|
@emph{behind} to ensure that a subpattern does or does not occur.
|
|
These ``look around'' assertions are specified by putting the
|
|
subpattern checked for in a cluster whose leading characters are:
|
|
@litchar{?=} (for positive lookahead), @litchar{?!} (negative
|
|
lookahead), @litchar{?<=} (positive lookbehind), @litchar{?<!}
|
|
(negative lookbehind). Note that the subpattern in the assertion does
|
|
not generate a match in the final result; it merely allows or
|
|
disallows the rest of the match.
|
|
|
|
@subsection{Lookahead}
|
|
|
|
Positive lookahead with @litchar{?=} peeks ahead to ensure that
|
|
its subpattern @emph{could} match.
|
|
|
|
@interaction[
|
|
(regexp-match-positions #rx"grey(?=hound)"
|
|
"i left my grey socks at the greyhound")
|
|
]
|
|
|
|
The regexp @scheme[#rx"grey(?=hound)"] matches @litchar{grey}, but
|
|
@emph{only} if it is followed by @litchar{hound}. Thus, the first
|
|
@litchar{grey} in the text string is not matched.
|
|
|
|
Negative lookahead with @litchar{?!} peeks ahead to ensure that its
|
|
subpattern @emph{could not} possibly match.
|
|
|
|
@interaction[
|
|
(regexp-match-positions #rx"grey(?!hound)"
|
|
"the gray greyhound ate the grey socks")
|
|
]
|
|
|
|
The regexp @scheme[#rx"grey(?!hound)"] matches @litchar{grey}, but
|
|
only if it is @emph{not} followed by @litchar{hound}. Thus the
|
|
@litchar{grey} just before @litchar{socks} is matched.
|
|
|
|
@subsection{Lookbehind}
|
|
|
|
Positive lookbehind with @litchar{?<=} checks that its subpattern
|
|
@emph{could} match immediately to the left of the current position in
|
|
the text string.
|
|
|
|
@interaction[
|
|
(regexp-match-positions #rx"(?<=grey)hound"
|
|
"the hound in the picture is not a greyhound")
|
|
]
|
|
|
|
The regexp @scheme[#rx"(?<=grey)hound"] matches @litchar{hound}, but
|
|
only if it is preceded by @litchar{grey}.
|
|
|
|
Negative lookbehind with @litchar{?<!} checks that its subpattern
|
|
could not possibly match immediately to the left.
|
|
|
|
@interaction[
|
|
(regexp-match-positions #rx"(?<!grey)hound"
|
|
"the greyhound in the picture is not a hound")
|
|
]
|
|
|
|
The regexp @scheme[#rx"(?<!grey)hound"] matches @litchar{hound}, but
|
|
only if it is @emph{not} preceded by @litchar{grey}.
|
|
|
|
Lookaheads and lookbehinds can be convenient when they
|
|
are not confusing.
|
|
|
|
@; ----------------------------------------
|
|
|
|
@section{An Extended Example}
|
|
|
|
@(define ex-eval (make-base-eval))
|
|
|
|
Here's an extended example from Friedl's @italic{Mastering Regular
|
|
Expressions}, page 189, that covers many of the features described in
|
|
this chapter. The problem is to fashion a regexp that will match any
|
|
and only IP addresses or @emph{dotted quads}: four numbers separated
|
|
by three dots, with each number between 0 and 255.
|
|
|
|
First, we define a subregexp @scheme[n0-255] that matches 0 through
|
|
255:
|
|
|
|
@interaction[
|
|
#:eval ex-eval
|
|
(define n0-255
|
|
(string-append
|
|
"(?:"
|
|
"\\d|" (code:comment #, @t{ 0 through 9})
|
|
"\\d\\d|" (code:comment #, @t{ 00 through 99})
|
|
"[01]\\d\\d|" (code:comment #, @t{000 through 199})
|
|
"2[0-4]\\d|" (code:comment #, @t{200 through 249})
|
|
"25[0-5]" (code:comment #, @t{250 through 255})
|
|
")"))
|
|
]
|
|
|
|
@margin-note{Note that @scheme[n0-255] lists prefixes as preferred
|
|
alternates, which is something we cautioned against in
|
|
@secref["regexp-alternation"]. However, since we intend to anchor
|
|
this subregexp explicitly to force an overall match, the order of the
|
|
alternates does not matter.}
|
|
|
|
The first two alternates simply get all single- and
|
|
double-digit numbers. Since 0-padding is allowed, we
|
|
need to match both 1 and 01. We need to be careful
|
|
when getting 3-digit numbers, since numbers above 255
|
|
must be excluded. So we fashion alternates to get 000
|
|
through 199, then 200 through 249, and finally 250
|
|
through 255.
|
|
|
|
An IP-address is a string that consists of four @scheme[n0-255]s with
|
|
three dots separating them.
|
|
|
|
@interaction[
|
|
#:eval ex-eval
|
|
(define ip-re1
|
|
(string-append
|
|
"^" (code:comment #, @t{nothing before})
|
|
n0-255 (code:comment #, @t{the first @scheme[n0-255],})
|
|
"(?:" (code:comment #, @t{then the subpattern of})
|
|
"\\." (code:comment #, @t{a dot followed by})
|
|
n0-255 (code:comment #, @t{an @scheme[n0-255],})
|
|
")" (code:comment #, @t{which is})
|
|
"{3}" (code:comment #, @t{repeated exactly 3 times})
|
|
"$" (code:comment #, @t{with nothing following})
|
|
))
|
|
]
|
|
|
|
Let's try it out:
|
|
|
|
@interaction[
|
|
#:eval ex-eval
|
|
(regexp-match (pregexp ip-re1) "1.2.3.4")
|
|
(regexp-match (pregexp ip-re1) "55.155.255.265")
|
|
]
|
|
|
|
which is fine, except that we also have
|
|
|
|
@interaction[
|
|
#:eval ex-eval
|
|
(regexp-match (pregexp ip-re1) "0.00.000.00")
|
|
]
|
|
|
|
All-zero sequences are not valid IP addresses! Lookahead to the
|
|
rescue. Before starting to match @scheme[ip-re1], we look ahead to
|
|
ensure we don't have all zeros. We could use positive lookahead to
|
|
ensure there @emph{is} a digit other than zero.
|
|
|
|
@interaction[
|
|
#:eval ex-eval
|
|
(define ip-re
|
|
(pregexp
|
|
(string-append
|
|
"(?=.*[1-9])" (code:comment #, @t{ensure there's a non-0 digit})
|
|
ip-re1)))
|
|
]
|
|
|
|
Or we could use negative lookahead to ensure that what's ahead isn't
|
|
composed of @emph{only} zeros and dots.
|
|
|
|
@interaction[
|
|
#:eval ex-eval
|
|
(define ip-re
|
|
(pregexp
|
|
(string-append
|
|
"(?![0.]*$)" (code:comment #, @t{not just zeros and dots})
|
|
(code:comment #, @t{(note: @litchar{.} is not metachar inside @litchar{[}...@litchar{]})})
|
|
ip-re1)))
|
|
]
|
|
|
|
The regexp @scheme[ip-re] will match all and only valid IP addresses.
|
|
|
|
@interaction[
|
|
#:eval ex-eval
|
|
(regexp-match ip-re "1.2.3.4")
|
|
(regexp-match ip-re "0.0.0.0")
|
|
]
|
|
|
|
@close-eval[ex-eval]
|