diff --git a/collects/scribblings/guide/byte-strings.scrbl b/collects/scribblings/guide/byte-strings.scrbl index 691cb2320d..4b555ab680 100644 --- a/collects/scribblings/guide/byte-strings.scrbl +++ b/collects/scribblings/guide/byte-strings.scrbl @@ -14,7 +14,7 @@ numbers that represent bytes. (byte? 256) ] -A @defterm{byte string} is similar to a string---see +A @deftech{byte string} is similar to a string---see @secref["strings"]---but its content is a sequence of bytes instead of characters. Byte strings can be used in applications that process pure ASCII instead of Unicode text. The printed form of a diff --git a/collects/scribblings/guide/char-strings.scrbl b/collects/scribblings/guide/char-strings.scrbl index 6fbdd0842a..f7dc0612a2 100644 --- a/collects/scribblings/guide/char-strings.scrbl +++ b/collects/scribblings/guide/char-strings.scrbl @@ -5,7 +5,7 @@ @title[#:tag "strings"]{Strings (Unicode)} -A @defterm{string} is a fixed-length array of +A @deftech{string} is a fixed-length array of @seclink["characters"]{characters}. It prints using doublequotes, where doublequote and backslash characters within the string are escaped with backslashes. Other common string escapes are supported, diff --git a/collects/scribblings/guide/guide.scrbl b/collects/scribblings/guide/guide.scrbl index eb37d08b49..3318df54f3 100644 --- a/collects/scribblings/guide/guide.scrbl +++ b/collects/scribblings/guide/guide.scrbl @@ -147,6 +147,12 @@ downloadable packages contributed by PLT Scheme users. #:is-book? #t #:date "2002") + (bib-entry #:key "Sitaram05" + #:author "Dorai Sitaram" + #:title "pregexp: Portable Regular Expressions for Scheme and Common Lisp" + #:url "http://www.ccs.neu.edu/home/dorai/pregexp/pregexp.html" + #:date "2002") + ) @index-section[] diff --git a/collects/scribblings/guide/regexp.scrbl b/collects/scribblings/guide/regexp.scrbl index 54482befee..033b66e8c1 100644 --- a/collects/scribblings/guide/regexp.scrbl +++ b/collects/scribblings/guide/regexp.scrbl @@ -3,4 +3,918 @@ scribble/eval "guide-utils.ss") -@title[#:tag "regexp"]{Regular Expressions} +@title[#:tag "regexp" #:style 'toc]{Regular Expressions} + +@margin-note{This section is based on @cite["Sitaram05"].} + +A @deftech{regexp} value encapsulates a pattern that is described by a +string or @tech{byte string}. The regexp matcher tries to match this +pattern against (a portion of) another string or byte string, which we +will call the @deftech{text string}, when you call functions like +@scheme[regexp-match]. The text string is treated as raw text, and +not as a pattern. + +@local-table-of-contents[] + +@; ---------------------------------------- + +@section[#:tag "regexp-intro"]{Writing Regexp Patterns} + +A string or @tech{byte string} can be used directly as a @tech{regexp} +pattern, or it can be prefixed with @litchar{#rx} to form a literal +@tech{regexp} value. For example, @scheme[#rx"abc"] is a string-based +@tech{regexp} value, and @scheme[#rx#"abc"] is a @tech{byte +string}-based @tech{regexp} value. Alternately, a string or byte +string can be prefixed with @litchar{#px}, as in @scheme[#px"abc"], +for a slightly different syntax of patterns within the string. + +Most of the characters in a @tech{regexp} pattern are meant to match +occurrences of themselves in the text string. Thus, the pattern +@scheme[#rx"abc"] matches a string that contains the characters +@litchar{a}, @litchar{b}, and @litchar{c} in succession. + +In the regexp pattern, some characters act as +@deftech{metacharacters}, and some character sequences act as +@deftech{metasequences}. That is, they specify something other than +their literal selves. For example, in the pattern @scheme[#rx"a.c"], +the characters @litchar{a} and @litchar{c} stand for themselves, but +the @tech{metacharacter} @litchar{.} can match @emph{any} character. +Therefore, the pattern @scheme[#rx"a.c"] matches an @litchar{a}, any +character, and @litchar{c} in succession. + +@margin-note{When we want a literal @litchar{\} inside a Scheme string +or regexp literal, we must escape it so that it shows up in the string +at all. Scheme strings use @litchar{\} as the escape character, so we +end up with two @litchar{\}s: one Scheme-string @litchar{\} to escape +the regexp @litchar{\}, which then escapes the @litchar{.}. Another +character that would need escaping inside a Scheme string is +@litchar{"}.} + +If we needed to match the character @litchar{.} itself, we can escape +it by precede it with a @litchar{\}. The character sequence +@litchar{\.} is thus a @tech{metasequence}, since it doesn't match +itself but rather just @litchar{.}. So, to match @litchar{a}, +@litchar{.}, and @litchar{c} in succession, we use the regexp pattern +@scheme["a\\.c"]; the double @litchar{\} is an artifact of Scheme +strings, not the @tech{regexp} pattern itself. + +The @scheme[regexp] function takes a string or byte string and +produces a @tech{regexp} value. Use @scheme[regexp] when you construct +a pattern to be matched against multiple strings, since a pattern must +be compiled to a @tech{regexp} value before it can be used in a match. +The @scheme[pregexp] function is like @scheme[regexp], but it parses +the pattern in an alternate syntax. Regexp values with as literals +with @litchar{#rx} or @litchar{#px} are compiled once and for all when +they are read. + + +The @scheme[regexp-quote] function takes an arbitrary string and +returns a string for a pattern that matches exactly the original +string. In particular, characters in the input string that could serve +as regexp metacharacters are escaped with a backslash, so that they +safely match only themselves. + +@interaction[ +(regexp-quote "cons") +(regexp-quote "list?") +] + +The @scheme[regexp-quote] function is useful when building a composite +@tech{regexp} from a mix of @tech{regexp} strings and verbatim strings. + + +@; ---------------------------------------- + +@section[#:tag "regexp-match"]{Matching Regexp Patterns} + +The @scheme[regexp-match-positions] function takes a @tech{regexp} +pattern and a text string, and it returns a match if the regexp +matches (some part of) the text string, or @scheme[#f] if the regexp +did not match the string. A successful match produces a list of +@deftech{index pairs}. + +@examples[ +(regexp-match-positions #rx"brain" "bird") +(regexp-match-positions #rx"needle" "hay needle stack") +] + +In the second example, the integers @scheme[4] and @scheme[10] +identify the substring that was matched. The @scheme[4] is the +starting (inclusive) index, and @scheme[10] the ending (exclusive) +index of the matching substring: + +@interaction[ +(substring "hay needle stack" 4 10) +] + +In this first example, @scheme[regexp-match-positions]'s return list +contains only one index pair, and that pair represents the entire +substring matched by the regexp. When we discuss @tech{subpatterns} +later, we will see how a single match operation can yield a list of +@tech{submatch}es. + +The @scheme[regexp-match-positions] function takes optional third and +fourth arguments that specify the indices of the text string within +which the matching should take place. + +@interaction[ +(regexp-match-positions + #rx"needle" + "his needle stack -- my needle stack -- her needle stack" + 20 39) +] + +Note that the returned indices are still reckoned relative to the full +text string. + +The @scheme[regexp-match] function is like +@scheme[regexp-match-positions], but instead of returning index pairs, +it returns the matching substrings: + +@interaction[ +(regexp-match #rx"brain" "bird") +(regexp-match #rx"needle" "hay needle stack") +] + + +The @scheme[regexp-split] function takes two arguments, a +@tech{regexp} pattern and a text string, and it returns a list of +substrings of the text string; the pattern identifies the delimiter +separating the substrings. + +@interaction[ +(regexp-split #rx":" "/bin:/usr/bin:/usr/bin/X11:/usr/local/bin") +(regexp-split #rx" " "pea soup") +] + +If the first argument match empty strings, then the list of all the +single-character substrings is returned. + +@interaction[ +(regexp-split #rx"" "smithereens") +] + +Thus, to identify one-or-more spaces as the delimiter, take care to +use the regexp @scheme[#rx"\u20+"], not @scheme[#rx"\u20*"]. + +@interaction[ +(regexp-split #rx" +" "split pea soup") +(regexp-split #rx" *" "split pea soup") +] + +The @scheme[regexp-replace] function replaces the matched portion of +the text string by another string. The first argument is the pattern, +the second the text string, and the third is either the string to be +inserted or a procedure to convert matches to the insert string. + +@interaction[ +(regexp-replace #rx"te" "liberte" "ty") +(regexp-replace #rx"." "scheme" string-upcase) +] + +If the pattern doesn't occur in the text string, the returned string +is identical to the text string. + +The @scheme[regexp-replace*] function replaces @emph{all} matches in +the text string by the insert string: + +@interaction[ +(regexp-replace* #rx"te" "liberte egalite fraternite" "ty") +(regexp-replace* #rx"[ds]" "drscheme" string-upcase) +] + +@; ---------------------------------------- + +@section[#:tag "regexp-assert"]{Basic Assertions} + +The @deftech{assertions} @litchar{^} and @litchar{$} identify the +beginning and the end of the text string, respectively. They ensure +that their adjoining regexps match at one or other end of the text +string: + +@interaction[ +(regexp-match-positions #rx"^contact" "first contact") +] + +The @tech{regexp} above fails to match because @litchar{contact} does +not occur at the beginning of the text string. In + +@interaction[ +(regexp-match-positions #rx"laugh$" "laugh laugh laugh laugh") +] + +the regexp matches the @emph{last} @litchar{laugh}. + +The metasequence @litchar{\b} asserts that a word boundary exists, but +this metasequence works only with @litchar{#px} syntax. In + +@interaction[ +(regexp-match-positions #px"yack\\b" "yackety yack") +] + +the @litchar{yack} in @litchar{yackety} doesn't end at a word boundary +so it isn't matched. The second @litchar{yack} does and is. + +The metasequence @litchar{\B} (also @litchar{#px} only) has the +opposite effect to @litchar{\b}; it asserts that a word boundary does +not exist. In + +@interaction[ +(regexp-match-positions #px"an\\B" "an analysis") +] + +the @litchar{an} that doesn't end in a word boundary is matched. + +@; ---------------------------------------- + +@section[#:tag "regexp-chars"]{Characters and Character Classes} + +Typically, a character in the regexp matches the same character in the +text string. Sometimes it is necessary or convenient to use a regexp +@tech{metasequence} to refer to a single character. For example, the +metasequence @litchar{\.} matches the period character. + +The @tech{metacharacter} @litchar{.} matches @emph{any} character +(other than newline in @tech{multi-line mode}; see +@secref["regexp-cloister"]): + +@interaction[ +(regexp-match #rx"p.t" "pet") +] + +The above patternalso matches @litchar["pat"], @litchar["pit"], +@litchar["pot"], @litchar["put"], and @litchar["p8t"], but not +@litchar["peat"] or @litchar["pfffft"]. + +A @deftech{character class} matches any one character from a set of +characters. A typical format for this is the @deftech{bracketed +character class} @litchar{[}...@litchar{]}, which matches any one +character from the non-empty sequence of characters enclosed within +the brackets. Thus, @scheme[#rx"p[aeiou]t"] matches @litchar{pat}, +@litchar{pet}, @litchar{pit}, @litchar{pot}, @litchar{put}, and +nothing else. + +Inside the brackets, a @litchar{-} between two characters specifies +the Unicode range between the characters. For example, +@scheme[#rx"ta[b-dgn-p]"] matches @litchar{tab}, @litchar{tac}, +@litchar{tad}, @litchar{tag}, @litchar{tan}, @litchar{tao}, and +@litchar{tap}. + +An initial @litchar{^} after the left bracket inverts the set +specified by the rest of the contents; i.e., it specifies the set of +characters @emph{other than} those identified in the brackets. For +example, @scheme[#rx"do[^g]"] matches all three-character sequences +starting with @litchar{do} except @litchar{dog}. + +Note that the @tech{metacharacter} @litchar{^} inside brackets means +something quite different from what it means outside. Most other +@tech{metacharacters} (@litchar{.}, @litchar{*}, @litchar{+}, +@litchar{?}, etc.) cease to be @tech{metacharacters} when inside +brackets, although you may still escape them for peace of mind. A +@litchar{-} is a @tech{metacharacter} only when it's inside brackets, +and when it is neither the first nor the last character between the +brackets. + +Bracketed character classes cannot contain other bracketed character +classes (although they contain certain other types of character +classes; see below). Thus, a @litchar{[} inside a bracketed character +class doesn't have to be a metacharacter; it can stand for itself. +For example, @scheme[#rx"[a[b]"] matches @litchar{a}, @litchar{[}, and +@litchar{b}. + +Furthermore, since empty bracketed character classes are disallowed, a +@litchar{]} immediately occurring after the opening left bracket also +doesn't need to be a metacharacter. For example, @scheme[#rx"[]ab]"] +matches @litchar{]}, @litchar{a}, and @litchar{b}. + +@subsection{Some Frequently Used Character Classes} + +In @litchar{#px} syntax, some standard character classes can be +conveniently represented as metasequences instead of as explicit +bracketed expressions: @litchar{\d} matches a digit +(the same as @litchar{[0-9]}); @litchar{\s} matches an ASCII whitespace character; and +@litchar{\w} matches a character that could be part of a +``word''. + +@margin-note{Following regexp custom, we identify ``word'' characters +as @litchar{[A-Za-z0-9_]}, although these are too restrictive for what +a Schemer might consider a ``word.''} + +The upper-case versions of these metasequences stand for the +inversions of the corresponding character classes: @litchar{\D} +matches a non-digit, @litchar{\S} a non-whitespace character, and +@litchar{\W} a non-``word'' character. + +Remember to include a double backslash when putting these +metasequences in a Scheme string: + +@interaction[ +(regexp-match #px"\\d\\d" + "0 dear, 1 have 2 read catch 22 before 9") +] + +These character classes can be used inside a bracketed expression. For +example, @scheme[#px"[a-z\\d]"] matches a lower-case letter or a +digit. + +@subsection{POSIX character classes} + +A @deftech{POSIX character class} is a special @tech{metasequence} of +the form @litchar{[:}...@litchar{:]} that can be used only inside a +bracketed expression in @litchar{#px} syntax. The POSIX classes +supported are + +@itemize[#:style "compact" + + @item{@litchar{[:alnum:]} --- ASCII letters and digits} + + @item{@litchar{[:alpha:]} --- ASCII letters} + + @item{@litchar{[:ascii:]} --- ASCII characters} + + @item{@litchar{[:blank:]} --- ASCII widthful whitespace: space and tab} + + @item{@litchar{[:cntrl:]} --- ``control'' characters: ASCII 0 to 32} + + @item{@litchar{[:digit:]} --- ASCII digits, same as @litchar{\d}} + + @item{@litchar{[:graph:]} --- ASCII characters that use ink} + + @item{@litchar{[:lower:]} --- ASCII lower-case letters} + + @item{@litchar{[:print:]} --- ASCII ink-users plus widthful whitespace} + + @item{@litchar{[:space:]} --- ASCII whitespace, same as @litchar{\s}} + + @item{@litchar{[:upper:]} --- ASCII upper-case letters} + + @item{@litchar{[:word:]} --- ASCII same as @litchar{\w}} + + @item{@litchar{[:xdigit:]} --- ASCII hex digits} + +] + +For example, the @scheme[#px"[[:alpha:]_]"] matches a letter or +underscore. + +@interaction[ +(regexp-match #px"[[:alpha:]_]" "--x--") +(regexp-match #px"[[:alpha:]_]" "--_--") +(regexp-match #px"[[:alpha:]_]" "--:--") +] + +The POSIX class notation is valid @emph{only} inside a bracketed +expression. For instance, @litchar{[:alpha:]}, when not inside a +bracketed expression, will not be read as the letter class. Rather, +it is (from previous principles) the character class containing the +characters @litchar{:}, @litchar{a}, @litchar{l}, @litchar{p}, +@litchar{h}. + +@interaction[ +(regexp-match #px"[:alpha:]" "--a--") +(regexp-match #px"[:alpha:]" "--x--") +] + +@; ---------------------------------------- + +@section[#:tag "regexp-quant"]{Quantifiers} + +The @deftech{quantifiers} @litchar{*}, @litchar{+}, and @litchar{?} +match respectively: zero or more, one or more, and zero or one +instances of the preceding subpattern. + +@interaction[ +(regexp-match-positions #rx"c[ad]*r" "cadaddadddr") +(regexp-match-positions #rx"c[ad]*r" "cr") + +(regexp-match-positions #rx"c[ad]+r" "cadaddadddr") +(regexp-match-positions #rx"c[ad]+r" "cr") + +(regexp-match-positions #rx"c[ad]?r" "cadaddadddr") +(regexp-match-positions #rx"c[ad]?r" "cr") +(regexp-match-positions #rx"c[ad]?r" "car") +] + +In @litchar{#px} syntax, you can use braces to specify much +finer-tuned quantification than is possible with @litchar{*}, +@litchar{+}, @litchar{?}: + +@itemize[ + + @item{The quantifier @litchar["{"]@math{m}@litchar["}"] matches + @emph{exactly} @math{m} instances of the preceding + @tech{subpattern}; @math{m} must be a nonnegative integer.} + + @item{The quantifier + @litchar["{"]@math{m}@litchar{,}@math{n}@litchar["}"] matches + at least @math{m} and at most @math{n} instances. @litchar{m} + and @litchar{n} are nonnegative integers with @math{m} less or + equal to @math{n}. You may omit either or both numbers, in + which case @math{m} defaults to @math{0} and @math{n} to + infinity.} + +] + +It is evident that @litchar{+} and @litchar{?} are abbreviations for +@litchar["{1,}"] and @litchar["{0,1}"] respectively, and @litchar{*} +abbreviates @litchar["{,}"], which is the same as @litchar["{0,}"]. + +@interaction[ +(regexp-match #px"[aeiou]{3}" "vacuous") +(regexp-match #px"[aeiou]{3}" "evolve") +(regexp-match #px"[aeiou]{2,3}" "evolve") +(regexp-match #px"[aeiou]{2,3}" "zeugma") +] + +The quantifiers described so far are all @deftech{greedy}: they match +the maximal number of instances that would still lead to an overall +match for the full pattern. + +@interaction[ +(regexp-match #rx"<.*>" " ") +] + +To make these quantifiers @deftech{non-greedy}, append a @litchar{?} +to them. Non-greedy quantifiers match the minimal number of instances +needed to ensure an overall match. + +@interaction[ +(regexp-match #rx"<.*?>" " ") +] + +The non-greedy quantifiers are respectively: @litchar{*?}, +@litchar{+?}, @litchar{??}, @litchar["{"]@math{m}@litchar["}?"], +@litchar["{"]@math{m}@litchar{,}@math{n}@litchar["}?"]. Note the two +uses of the metacharacter @litchar{?}. + +@; ---------------------------------------- + +@section[#:tag "regexp-clusters"]{Clusters} + +@deftech{Clustering}---enclosure within parens +@litchar{(}...@litchar{)}---identifies the enclosed +@deftech{subpattern} as a single entity. It causes the matcher to +capture the @deftech{submatch}, or the portion of the string matching +the subpattern, in addition to the overall match: + +@interaction[ +(regexp-match #rx"([a-z]+) ([0-9]+), ([0-9]+)" "jan 1, 1970") +] + +Clustering also causes a following quantifier to treat the entire +enclosed subpattern as an entity: + +@interaction[ +(regexp-match #rx"(poo )*" "poo poo platter") +] + +The number of submatches returned is always equal to the number of +subpatterns specified in the regexp, even if a particular subpattern +happens to match more than one substring or no substring at all. + +@interaction[ +(regexp-match #rx"([a-z ]+;)*" "lather; rinse; repeat;") +] + +Here, the @litchar{*}-quantified subpattern matches three times, but +it is the last submatch that is returned. + +It is also possible for a quantified subpattern to fail to match, even +if the overall pattern matches. In such cases, the failing submatch +is represented by @scheme[#f] + +@interaction[ +(define date-re + (code:comment #, @t{match `month year' or `month day, year';}) + (code:comment #, @t{subpattern matches day, if present}) + #rx"([a-z]+) +([0-9]+,)? *([0-9]+)") +(regexp-match date-re "jan 1, 1970") +(regexp-match date-re "jan 1970") +] + + +@subsection{Backreferences} + +@tech{Submatch}es can be used in the insert string argument of the +procedures @scheme[regexp-replace] and @scheme[regexp-replace*]. The +insert string can use @litchar{\}@math{n} as a @deftech{backreference} +to refer back to the @math{n}th submatch, which is the the substring +that matched the @math{n}th subpattern. A @litchar{\0} refers to the +entire match, and it can also be specified as @litchar{\&}. + +@interaction[ +(regexp-replace #rx"_(.+?)_" + "the _nina_, the _pinta_, and the _santa maria_" + "*\\1*") +(regexp-replace* #rx"_(.+?)_" + "the _nina_, the _pinta_, and the _santa maria_" + "*\\1*") + +(regexp-replace #px"(\\S+) (\\S+) (\\S+)" + "eat to live" + "\\3 \\2 \\1") +] + +Use @litchar{\\} in the insert string to specify a literal backslash. +Also, @litchar{\$} stands for an empty string, and is useful for +separating a backreference @litchar{\}@math{n} from an immediately +following number. + +Backreferences can also be used within a @litchar{#px} pattern to +refer back to an already matched subpattern in the pattern. +@litchar{\}@math{n} stands for an exact repeat of the @math{n}th +submatch. Note that @litchar{\0}, which is useful in an insert string, +makes no sense within the regexp pattern, because the entire regexp +has not matched yet that you could refer back to it.} + +@interaction[ +(regexp-match #px"([a-z]+) and \\1" + "billions and billions") +] + +Note that the @tech{backreference} is not simply a repeat of the +previous subpattern. Rather it is a repeat of the particular +substring already matched by the subpattern. + +In the above example, the @tech{backreference} can only match +@litchar{billions}. It will not match @litchar{millions}, even though +the subpattern it harks back to---@litchar{([a-z]+)}---would have had +no problem doing so: + +@interaction[ +(regexp-match #px"([a-z]+) and \\1" + "billions and millions") +] + +The following example corrects doubled words: + +@interaction[ +(regexp-replace* #px"(\\S+) \\1" + (string-append "now is the the time for all good men to " + "to come to the aid of of the party") + "\\1") +] + +The following example marks all immediately repeating patterns in a +number string: + +@interaction[ +(regexp-replace* #px"(\\d+)\\1" + "123340983242432420980980234" + "{\\1,\\1}") +] + +@subsection{Non-capturing Clusters} + +It is often required to specify a cluster (typically for +quantification) but without triggering the capture of @tech{submatch} +information. Such clusters are called @deftech{non-capturing}. To +create a non-capturing cluster, use @litchar{(?:} instead of +@litchar{(} as the cluster opener. + +In the following example, a non-capturing cluster eliminates the +``directory'' portion of a given Unix pathname, and a capturing +cluster identifies the basename. + +@margin-note{But don't parse paths with regexps. Use functions like + @scheme[split-path], instead.} + +@interaction[ +(regexp-match #rx"^(?:[a-z]*/)*([a-z]+)$" + "/usr/local/bin/mzscheme") +] + +@subsection[#:tag "regexp-cloister"]{Cloisters} + +The location between the @litchar{?} and the @litchar{:} of a +non-capturing cluster is called a @deftech{cloister}. You can put +modifiers there that will cause the enclustered @tech{subpattern} to +be treated specially. The modifier @litchar{i} causes the subpattern +to match case-insensitively: + +@margin-note{The term @defterm{cloister} is a useful, if terminally +cute, coinage from the abbots of Perl.} + +@interaction[ +(regexp-match #rx"(?i:hearth)" "HeartH") +] + +The modifier @litchar{m} causes the @tech{subpattern} to match in +@deftech{multi-line mode}, where @litchar{.} does not match a newline +character, @litchar{^} can match just after a newline, and @litchar{$} +can match just before a newline. + +@interaction[ +(regexp-match #rx"." "\na\n") +(regexp-match #rx"(?m:.)" "\na\n") +(regexp-match #rx"^A plan$" "A man\nA plan\nA canal") +(regexp-match #rx"(?m:^A plan$)" "A man\nA plan\nA canal") +] + +You can put more than one modifier in the cloister: + +@interaction[ +(regexp-match #rx"(?mi:^A Plan$)" "a man\na plan\na canal") +] + +A minus sign before a modifier inverts its meaning. Thus, you can use +@litchar{-i} in a @deftech{subcluster} to overturn the +case-insensitivities caused by an enclosing cluster. + +@interaction[ +(regexp-match #rx"(?i:the (?-i:TeX)book)" + "The TeXbook") +] + +The above regexp will allow any casing for @litchar{the} and +@litchar{book}, but it insists that @litchar{TeX} not be differently +cased. + +@; ---------------------------------------- + +@section[#:tag "regexp-alternation"]{Alternation} + +You can specify a list of @emph{alternate} @tech{subpatterns} by +separating them by @litchar{|}. The @litchar{|} separates +@tech{subpatterns} in the nearest enclosing cluster (or in the entire +pattern string if there are no enclosing parens). + +@interaction[ +(regexp-match #rx"f(ee|i|o|um)" "a small, final fee") +(regexp-replace* #rx"([yi])s(e[sdr]?|ing|ation)" + (string-append + "analyse an energising organisation" + " pulsing with noisy organisms") + "\\1z\\2") +] + +Note again that if you wish to use clustering merely to specify a list +of alternate subpatterns but do not want the submatch, use +@litchar{(?:} instead of @litchar{(}. + +@interaction[ +(regexp-match #rx"f(?:ee|i|o|um)" "fun for all") +] + +An important thing to note about alternation is that the leftmost +matching alternate is picked regardless of its length. Thus, if one +of the alternates is a prefix of a later alternate, the latter may not +have a chance to match. + +@interaction[ +(regexp-match #rx"call|call-with-current-continuation" + "call-with-current-continuation") +] + +To allow the longer alternate to have a shot at matching, place it +before the shorter one: + +@interaction[ +(regexp-match #rx"call-with-current-continuation|call" + "call-with-current-continuation") +] + +In any case, an overall match for the entire regexp is always +preferred to an overall non-match. In the following, the longer +alternate still wins, because its preferred shorter prefix fails to +yield an overall match. + +@interaction[ +(regexp-match + #rx"(?:call|call-with-current-continuation) constrained" + "call-with-current-continuation constrained") +] + +@; ---------------------------------------- + +@section{Backtracking} + +We've already seen that greedy quantifiers match the maximal number of +times, but the overriding priority is that the overall match succeed. +Consider + +@interaction[ +(regexp-match #rx"a*a" "aaaa") +] + +The regexp consists of two subregexps: @litchar{a*} followed by +@litchar{a}. The subregexp @litchar{a*} cannot be allowed to match +all four @litchar{a}'s in the text string @scheme[aaaa], even though +@litchar{*} is a greedy quantifier. It may match only the first +three, leaving the last one for the second subregexp. This ensures +that the full regexp matches successfully. + +The regexp matcher accomplishes this via a process called +@deftech{backtracking}. The matcher tentatively allows the greedy +quantifier to match all four @litchar{a}'s, but then when it becomes +clear that the overall match is in jeopardy, it @emph{backtracks} to a +less greedy match of three @litchar{a}'s. If even this fails, as in +the call + +@interaction[ +(regexp-match #rx"a*aa" "aaaa") +] + +the matcher backtracks even further. Overall, failure is conceded +only when all possible backtracking has been tried with no success. + +Backtracking is not restricted to greedy quantifiers. +Nongreedy quantifiers match as few instances as +possible, and progressively backtrack to more and more +instances in order to attain an overall match. There +is backtracking in alternation too, as the more +rightward alternates are tried when locally successful +leftward ones fail to yield an overall match. + +Sometimes it is efficient to disable backtracking. For example, we +may wish to commit to a choice, or we know that trying alternatives is +fruitless. A nonbacktracking regexp is enclosed in +@litchar{(?>}...@litchar{)}. + +@interaction[ +(regexp-match #rx"(?>a+)." "aaaa") +] + +In this call, the subregexp @litchar{?>a*} greedily matches all four +@litchar{a}'s, and is denied the opportunity to backtrack. So, the +overall match is denied. The effect of the regexp is therefore to +match one or more @litchar{a}'s followed by something that is +definitely non-@litchar{a}. + +@; ---------------------------------------- + +@section{Looking Ahead and Behind} + +You can have assertions in your pattern that look @emph{ahead} or +@emph{behind} to ensure that a subpattern does or does not occur. +These ``look around'' assertions are specified by putting the +subpattern checked for in a cluster whose leading characters are: +@litchar{?=} (for positive lookahead), @litchar{?!} (negative +lookahead), @litchar{?<=} (positive lookbehind), @litchar{?