#lang scribble/doc @(require scribble/bnf "mz.ss" "rx.ss") @title[#:tag "regexp"]{Regular Expressions} @section-index{regexps} @section-index{pattern matching} @section-index["strings" "pattern matching"] @section-index["input ports" "pattern matching"] @guideintro["regexp"]{regular expressions} @deftech{Regular expressions} are specified as strings or byte strings, using the same pattern language as the Unix utility @exec{egrep} or Perl. A string-specified pattern produces a character regexp matcher, and a byte-string pattern produces a byte regexp matcher. If a character regexp is used with a byte string or input port, it matches UTF-8 encodings (see @secref["encodings"]) of matching character streams; if a byte regexp is used with a character string, it matches bytes in the UTF-8 encoding of the string. Regular expressions can be compiled into a @deftech{regexp value} for repeated matches. The @scheme[regexp] and @scheme[byte-regexp] procedures convert a string or byte string (respectively) into a regexp value using one syntax of regular expressions that is most compatible to @exec{egrep}. The @scheme[pregexp] and @scheme[byte-pregexp] procedures produce a regexp value using a slightly different syntax of regular expressions that is more compatible with Perl. In addition, Scheme constants written with @litchar{#rx} or @litchar{#px} (see @secref["reader"]) produce compiled regexp values. The internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 operators. @;------------------------------------------------------------------------ @section[#:tag "regexp-syntax"]{Regexp Syntax} The following syntax specifications describe the content of a string that represents a regular expression. The syntax of the corresponding string may involve extra escape characters. For example, the regular expression @litchar{(.*)\1} can be represented with the string @scheme["(.*)\\1"] or the regexp constant @scheme[#rx"(.*)\\1"]; the @litchar{\} in the regular expression must be escaped to include it in a string or regexp constant. The @scheme[regexp] and @scheme[pregexp] syntaxes share a common core: @common-table The following completes the grammar for @scheme[regexp], which treats @litchar["{"] and @litchar["}"] as literals, @litchar{\} as a literal within ranges, and @litchar{\} as a literal producer outside of ranges. @rx-table The following completes the grammar for @scheme[pregexp], which uses @litchar["{"] and @litchar["}"] bounded repetition and uses @litchar{\} for meta-characters both inside and outside of ranges. @px-table @;------------------------------------------------------------------------ @section{Additional Syntactic Constraints} In addition to matching a grammars, regular expressions must meet two syntactic restrictions: @itemize[ @item{In a @nonterm{repeat} other than @nonterm{atom}@litchar{?}, then @nonterm{atom} must not match an empty sequence.} @item{In a @litchar{(?<=}@nonterm{regexp}@litchar{)} or @litchar{(?* . string?) ((bytes?) () #:rest (listof bytes?) . ->* . bytes?))]) (or/c string? bytes?)]{ Performs a match using @scheme[pattern] on @scheme[input], and then returns a string or byte string in which the matching portion of @scheme[input] is replaced with @scheme[insert]. If @scheme[pattern] matches no part of @scheme[input], then @scheme[iput] is returned unmodified. The @scheme[insert] argument can be either a (byte) string, or a function that returns a (byte) string. In the latter case, the function is applied on the list of values that @scheme[regexp-match] would return (i.e., the first argument is the complete match, and then one argument for each parenthesized sub-expression) to obtain a replacement (byte) string. If @scheme[pattern] is a string or character regexp and @scheme[input] is a string, then @scheme[insert] must be a string or a procedure that accept strings, and the result is a string. If @scheme[pattern] is a byte string or byte regexp, or if @scheme[input] is a byte string, then @scheme[insert] as a string is converted to a byte string, @scheme[insert] as a procedure is called with a byte string, and the result is a byte string. If @scheme[insert] contains @litchar{&}, then @litchar{&} is replaced with the matching portion of @scheme[input] before it is substituted into the match's place. If @scheme[insert] contains @litchar{\}@nonterm{n} for some integer @nonterm{n}, then it is replaced with the @nonterm{n}th matching sub-expression from @scheme[input]. A @litchar{&} and @litchar{\0} are synonymous. If the @nonterm{n}th sub-expression was not used in the match, or if @nonterm{n} is greater than the number of sub-expressions in @scheme[pattern], then @litchar{\}@nonterm{n} is replaced with the empty string. To substitute a literal @litchar{&} or @litchar{\}, use @litchar{\&} and @litchar{\\}, respectively, in @scheme[insert]. A @litchar{\$} in @scheme[insert] is equivalent to an empty sequence; this can be used to terminate a number @nonterm{n} following @litchar{\}. If a @litchar{\} in @scheme[insert] is followed by anything other than a digit, @litchar{&}, @litchar{\}, or @litchar{$}, then the @litchar{\} by itself is treated as @litchar{\0}. Note that the @litchar{\} described in the previous paragraphs is a character or byte of @scheme[input]. To write such an @scheme[input] as a Scheme string literal, an escaping @litchar{\} is needed before the @litchar{\}. For example, the Scheme constant @scheme["\\1"] is @litchar{\1}. @examples[ (regexp-replace "mi" "mi casa" "su") (regexp-replace "mi" "mi casa" string-upcase) (regexp-replace "([Mm])i ([a-zA-Z]*)" "Mi Casa" "\\1y \\2") (regexp-replace "([Mm])i ([a-zA-Z]*)" "mi cerveza Mi Mi Mi" "\\1y \\2") (regexp-replace #rx"x" "12x4x6" "\\\\") (display (regexp-replace #rx"x" "12x4x6" "\\\\")) ]} @defproc[(regexp-replace* [pattern (or/c string? bytes? regexp? byte-regexp?)] [input (or/c string? bytes?)] [insert (or/c string? bytes? (string? . -> . string?) (bytes? . -> . bytes?))]) (or/c string? bytes?)]{ Like @scheme[regexp-replace], except that every instance of @scheme[pattern] in @scheme[input] is replaced with @scheme[insert], instead of just the first match. Only non-overlapping instances of @scheme[pattern] in @scheme[input] are replaced, so instances of @scheme[pattern] within inserted strings are @italic{not} replaced recursively. Zero-length matches are treated the same as in @scheme[regexp-match*]. @examples[ (regexp-replace* "([Mm])i ([a-zA-Z]*)" "mi cerveza Mi Mi Mi" "\\1y \\2") (regexp-replace* "([Mm])i ([a-zA-Z]*)" "mi cerveza Mi Mi Mi" (lambda (all one two) (string-append (string-downcase one) "y" (string-upcase two)))) (display (regexp-replace* #rx"x" "12x4x6" "\\\\")) ]} @defproc*[([(regexp-replace-quote [str string?]) string?] [(regexp-replace-quote [bstr bytes?]) bytes?])]{ Produces a string suitable for use as the third argument to @scheme[regexp-replace] to insert the literal sequence of characters in @scheme[str] or bytes in @scheme[bstr] as a replacement. Concretely, every @litchar{\} and @litchar{&} in @scheme[str] or @scheme[bstr] is protected by a quoting @litchar{\}. @examples[ (regexp-replace "UT" "Go UT!" "A&M") (regexp-replace "UT" "Go UT!" (regexp-replace-quote "A&M")) ]}