#lang scribble/doc @require[scribble/bnf] @require["mz.ss"] @require["rx.ss"] @title[#:tag "regexp"]{Regular Expressions} @declare-exporting[(lib "scheme/regexp")] @section-index{regexps} @section-index{pattern matching} @section-index["strings" "pattern matching"] @section-index["input ports" "pattern matching"] Regular expressions are specified as strings or byte strings, using the same pattern language as the Unix utility @exec{egrep} or Perl. A string-specified pattern produces a character regexp matcher, and a byte-string pattern produces a byte regexp matcher. If a character regexp is used with a byte string or input port, it matches UTF-8 encodings (see @secref["encodings"]) of matching character streams; if a byte regexp is used with a character string, it matches bytes in the UTF-8 encoding of the string. Regular expressions can be compiled into a @defterm{regexp value} for repeated matches. The @scheme[regexp] and @scheme[byte-regexp] procedures convert a string or byte string (respectively) into a regexp value using one syntax of regular expressions that is most compatible to @exec{egrep}. The @scheme[pregexp] and @scheme[byte-pregexp] procedures produce a regexp value using a slightly different syntax of regular expressions that is more compatible with Perl. In addition, Scheme constants written with @litchar{#rx} or @litchar{#px} (see @secref["reader"]) produce compiled regexp values. The internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 operators. @;------------------------------------------------------------------------ @section[#:tag "regexp-syntax"]{Regexp Syntax} The following syntax specifications describe the content of a string that represents a regular expression. The syntax of the corresponding string may involve extra escape characters. For example, the regular expression @litchar["(.*)\\1"] can be represented with the string @scheme["(.*)\\1"] or the regexp constant @scheme[#rx"(.*)\\1"]; the @litchar["\\"] in the regular expression must be escaped to include it in a string or regexp constant. The @scheme[regexp] and @scheme[pregexp] syntaxes share a common core: @common-table The following completes the grammar for @scheme[regexp], which treats @litchar["{"] and @litchar["}"] as literals, @litchar["\\"] as a literal within ranges, and @litchar["\\"] as a literal producer outside of ranges. @rx-table The following completes the grammar for @scheme[pregexp], which uses @litchar["{"] and @litchar["}"] bounded repetition and uses @litchar["\\"] for meta-characters both inside and outside of ranges. @px-table @;------------------------------------------------------------------------ @section{Additional Syntactic Constraints} In addition to matching a grammars, regular expressions must meet two syntactic restrictions: @itemize{ @item{In a @nonterm{repeat} other than @nonterm{atom}@litchar{?}, then @nonterm{atom} must not match an empty sequence.} @item{In a @litchar{(?<=}@nonterm{regexp}@litchar{)} or @litchar{(?