431 lines
19 KiB
Plaintext
431 lines
19 KiB
Plaintext
_parser-tools_
|
|
|
|
This documentation assumes familiarity with lex and yacc style lexer
|
|
and parser generators.
|
|
|
|
_lex.ss_
|
|
A _regular expression_ is one of the following:
|
|
|
|
identifier expands to the named lex abbreviation
|
|
string matches the sequence of characters
|
|
character matches the character
|
|
> (repetition lo hi re) matches re repeated between lo and hi times,
|
|
inclusive. hi = +inf.0 for unbounded repetitions
|
|
> (union re ...) matches if any of the sub-expressions match
|
|
> (intersection re ...) matches if all of the sub-expressions match
|
|
> (complement re) matches anything that the sub-expressions does not
|
|
> (concatenation re ...) matches each sub-expression in succession
|
|
> (char-range char char) matches any character between the two (inclusive)
|
|
(A single character string can be used here)
|
|
> (char-complement re) matches any character not matched by the sub-re.
|
|
The sub-expression must be a set of characters re.
|
|
> (char-set string) matches any character in the string
|
|
(op form ...) expands the re macro named op
|
|
|
|
(Both (concatenation) and "" match the empty string.
|
|
(union) matches nothing.
|
|
(intersection) matches any string.
|
|
(char-complement) matches any single character.)
|
|
|
|
This regular expression language is not designed to be used directly,
|
|
but rather as a basis for a user-friendly notation written with
|
|
regular expression macros. For example, _lex-sre.ss_ supplies
|
|
operators from Olin Shivers's SREs and _lex-plt-v200.ss_ supplies
|
|
operators from the previous version of this library. The old plt
|
|
syntax is deprecated. To use one of these syntaxes, use one of the
|
|
following commands:
|
|
|
|
(require (prefix : (lib "lex-sre.ss" "parser-tools")))
|
|
(require (prefix : (lib "lex-plt-v200.ss" "parser-tools")))
|
|
|
|
The regular expression operators occupy the same namespace as other
|
|
Scheme identifiers, so common operator names, such as * and +,
|
|
conflict with built-in identifiers. The suggested prefix form of
|
|
require prefixes each imported operator with :, :* and :+, for
|
|
example. Of course, a prefix other than : (such as re-) will work
|
|
too. Unfortunately, this causes the regexps in v20x lexers to need
|
|
some translation, even with the lex-plt-v200 operators. The exports
|
|
of each of these libraries is explained in their sections below.
|
|
|
|
Since negation is not a common operator on regular expressions, here
|
|
are a few examples, using : prefixed SRE syntax:
|
|
|
|
(complement "1") matches all strings except the string "1",
|
|
including "11", "111", "0", "01", "", and so on.
|
|
|
|
(complement (:* "1")) matches all strings that are not
|
|
sequences of "1", including "0", "00", "11110", "0111", "11001010"
|
|
and so on.
|
|
|
|
(:& (:: any-string "111" any-string)
|
|
(complement (:or (:: any-string "01") (:+ "1"))))
|
|
matches all strings that have 3 consecutive ones, but not those that
|
|
end in "01" and not those that are ones only. These include "1110",
|
|
"0001000111" and "0111" but not "", "11", "11101", "111" and
|
|
"11111".
|
|
|
|
(:: "/*" (complement (:: any-string "*/" any-string)) "*/")
|
|
matches Java/C block comments. "/**/", "/******/", "/*////*/",
|
|
"/*asg4*/" and so on. It does not match "/**/*/", "/* */ */" and so
|
|
on. (:: any-string "*/" any-string) matches any string that has a
|
|
"*/" in is, so (complement (:: any-string "*/" any-string)) matches
|
|
any string without a "*/" in it.
|
|
|
|
(:: "/*" (:* (complement "*/")) "*/")
|
|
matches any string that starts with "/*" and and ends with "*/",
|
|
including "/* */ */ */". (complement "*/") matches any string
|
|
except "*/". This includes "*" and "/" separately. Thus (:*
|
|
(complement "*/")) matches "*/" by first matching "*" and then
|
|
matching "/". Any other string is matched directly by (complement
|
|
"*/"). In other words, (:* (complement "xx")) = any-string. It is
|
|
usually not correct to place a :* around a complement.
|
|
|
|
|
|
The following require imports the _lexer generator_.
|
|
|
|
(require (lib "lex.ss" "parser-tools"))
|
|
|
|
|
|
"lex.ss" exports the following named regular expressions:
|
|
|
|
> any-char: matches any character
|
|
> any-string: matches any string
|
|
> nothing: matches no string
|
|
> alphabetic: see the mzscheme manual section 3.4
|
|
> lower-case: see the mzscheme manual section 3.4
|
|
> upper-case: see the mzscheme manual section 3.4
|
|
> title-case numeric: see the mzscheme manual section 3.4
|
|
> symbolic: see the mzscheme manual section 3.4
|
|
> punctuation: see the mzscheme manual section 3.4
|
|
> graphic: see the mzscheme manual section 3.4
|
|
> whitespace: see the mzscheme manual section 3.4
|
|
> blank: see the mzscheme manual section 3.4
|
|
> iso-control: see the mzscheme manual section 3.4
|
|
|
|
|
|
"lex.ss" exports the following variables:
|
|
|
|
> start-pos
|
|
> end-pos
|
|
> lexeme
|
|
> input-port
|
|
> return-without-pos
|
|
Use of these names outside of a lexer action is a syntax error.
|
|
See the lexer form below for their meaning when used inside a
|
|
lexer action.
|
|
|
|
"lex.ss" exports the following syntactic forms:
|
|
|
|
> (define-lex-abbrev name re) which associates a regular expression
|
|
with a name to be used in other regular expressions with the
|
|
identifier form. The definition of name has the same scoping
|
|
properties as a normal mzscheme macro definition.
|
|
|
|
> (define-lex-abbrevs (name re) ...) defines several lex-abbrevs
|
|
|
|
> (define-lex-trans name trans) define a regular expression macro.
|
|
When the name appears as an operator in a regular expression, the
|
|
expression is replaced with the result of the transformer. The
|
|
definition of name has the same scoping properties as a normal
|
|
mzscheme macro definition. For examples, see the file
|
|
${PLTHOME}/collects/parser-tools/lex-sre.ss which contains simple,
|
|
but useful, regular expression macro definitions.
|
|
|
|
|
|
> (lexer (re action) ...) expands into a function that takes an
|
|
input-port, matches the re's against the buffer, and returns the
|
|
result of executing the corresponding action. Each action is
|
|
scheme code that has the same scope as the enclosing lexer definition.
|
|
The following variables have special meaning inside of a lexer
|
|
action:
|
|
start-pos - a position struct for the first character matched
|
|
end-pos - a position struct for the character after the last
|
|
character in the match
|
|
lexeme - the matched string
|
|
input-port - the input-port being processed (this is useful for
|
|
matching input with multiple lexers)
|
|
(return-without-pos x) is a function (continuation) that
|
|
immediately returns the value of x from the lexer. This useful
|
|
in a src-pos lexer to prevent the lexer from adding source
|
|
information. For example:
|
|
(define get-token
|
|
(lexer-src-pos
|
|
...
|
|
((comment) (get-token input-port))
|
|
...))
|
|
would wrap the source location information for the comment around
|
|
the value of the recursive call. Using
|
|
((comment) (return-without-pos (get-token input-port)))
|
|
will cause the value of the recursive call to be returned without
|
|
wrapping position around it.
|
|
|
|
The lexer raises an exception (exn:read) if none of the regular
|
|
expressions match the input.
|
|
Hint: If (any-char custom-error-behavior) is the last rule,
|
|
then there will always be a match and custom-error-behavior will be
|
|
executed to handle the error situation as desired, only consuming
|
|
the first character from the input buffer.
|
|
|
|
The lexer treats the rules ((eof) action), ((special) action), and
|
|
((special-comment) action) specially. In addition to returning
|
|
characters, input ports can return eof-objects. Custom input ports
|
|
can also return a special-comment value to indicate a non-textual
|
|
comment, or return another arbitrary value (a special).
|
|
|
|
The eof rule is matched when the input port returns an eof value.
|
|
If no eof rule is present, the lexer returns the symbol 'eof when
|
|
the port returns an eof value.
|
|
|
|
The special-comment rule is matched when the input port returns a
|
|
special-comment structure. If no special-comment rule is present,
|
|
the lexer automatically tries to return the next token from the
|
|
input port.
|
|
|
|
The special rule is matched when the input port returns a value
|
|
other than a character, eof-object, or special-comment structure.
|
|
If no special rule is present, the lexer returns void.
|
|
|
|
Eofs, specials, special-comments and special-errors can never be
|
|
part of a lexeme with surrounding characters.
|
|
|
|
When peeking from the input port raises an exception (such as by an
|
|
embedded XML editor with malformed syntax), the exception can be
|
|
raised before all tokens preceding the exception have been
|
|
returned.
|
|
|
|
|
|
> (lexer-src-pos (re action) ...) is a lexer that returns
|
|
(make-position-token action-result start-pos end-pos) instead of
|
|
simply action-result.
|
|
|
|
> (define-tokens group-name (token-name ...)) binds group-name to the
|
|
group of tokens being defined. For each token-name, t, a function
|
|
(token-t expr) is created. It constructs a token structure with
|
|
name t and stores the value of expr in it. The definition of
|
|
group-name has the same scoping properties as a normal mzscheme
|
|
macro definition. A token cannot be named error since it has
|
|
special use in the parser.
|
|
|
|
> (define-empty-tokens group-name (token-name ...)) is like
|
|
define-tokens, except a resulting token constructor take no
|
|
arguments and returns the name as a symbol. (token-t) returns 't.
|
|
|
|
|
|
HINT: Token definitions are usually necessary for inter-operating with
|
|
a generated parser, but may not be the right choice when using the
|
|
lexer in other situations.
|
|
|
|
"lex.ss" exports the following token helper functions:
|
|
|
|
> (token-name token) returns the name of a token that is represented either
|
|
by a symbol or a token structure.
|
|
|
|
> (token-value token) returns the value of a token that is represented either
|
|
by a symbol or a token structure. It returns #f for a symbol token.
|
|
> (token? val) returns #t if val is a token structure and #f otherwise.
|
|
|
|
|
|
"lex.ss" exports the following structures:
|
|
|
|
> (struct position (offset line col))
|
|
These structures are bound to start-pos and end-pos.
|
|
Offset is the offset of the character in the input.
|
|
Line in the line number of the character.
|
|
Col is the offset in the current line.
|
|
|
|
> (struct position-token (token start-pos end-pos))
|
|
src-pos-lexers return these.
|
|
|
|
"lex.ss" exports the following parameter for tracking the source file:
|
|
|
|
> (file-path string) - sets the parameter file-path, which the lexer
|
|
will use as the source location if it raises a read error. This
|
|
allows DrScheme to open the file containing the error.
|
|
|
|
|
|
Each time the scheme code for a lexer is compiled (e.g. when a .ss
|
|
file containing a (lex ...) is loaded/required) the lexer generator is
|
|
run. To avoid this overhead place the lexer into a module and compile
|
|
the module to a .zo with 'mzc --zo --auto-dir filename'. This should
|
|
create a .zo file in the 'compiled' subdirectory.
|
|
|
|
Compiling the lex.ss file to an extension can produce a good speedup
|
|
in generated lexers since the lex.ss file contains the interpreter for
|
|
the generated lex tables. If mzscheme is able to compile extensions
|
|
(a c compiler must be available) run the commands:
|
|
cd ${PLTHOME}/collects/parser-tools
|
|
mzc --auto-dir lex.ss
|
|
|
|
NOTE: Since the lexer gets its source information from the port, use
|
|
port-count-lines! to enable the tracking of line and column information.
|
|
Otherwise the line and column information will return #f.
|
|
|
|
The file ${PLTHOME}/collects/syntax-color/scheme-lexer.ss contains a
|
|
lexer for Scheme, as extended by mzscheme. The files in
|
|
${PLTHOME}/collects/parser-tools/examples contain simpler example
|
|
lexers.
|
|
|
|
|
|
|
|
|
|
_lex-sre.ss_
|
|
The lex-sre.ss module exports the following regular expression
|
|
operators:
|
|
> (* re ...) repetition of regexps 0 or more times
|
|
> (+ re ...) repetition of regexps 1 or more times
|
|
> (? re ...) 0 or 1 occurrence of regexps
|
|
> (= natural-number re ...)
|
|
exactly n occurrences of regexps
|
|
> (>= natural-number re ...)
|
|
at least n occurrences of regexps
|
|
> (** natural-number natural-number-or-#f-or-+inf.0 re ...)
|
|
between m and n of regexps, inclusive
|
|
#f and +inf.0 both indicate no upper limit
|
|
> (or re ...) union
|
|
> (: re ...) concatenation
|
|
> (seq re ...) concatenation
|
|
> (& re ...) intersection
|
|
> (~ re ...) character set complement, each given regexp must match
|
|
exactly one character
|
|
> (- re re ...) character set difference, each given regexp must match
|
|
exactly one character
|
|
> (/ char-or-string ...)
|
|
character ranges, matches characters between successive
|
|
pairs of chars.
|
|
|
|
|
|
_yacc.ss_
|
|
|
|
To use the _parser generator_ (require (lib "yacc.ss" "parser-tools")).
|
|
This module provides the following syntactic form:
|
|
|
|
> (parser args ...) where the possible args may come in any order (as
|
|
long as there are no duplicates and all non-optional arguments are
|
|
present) and are:
|
|
|
|
> (debug filename) OPTIONAL causes the parser generator to write the
|
|
LALR table to the file named filename (unless the file exists).
|
|
filename must be a string. Additionally, if a debug file is
|
|
specified, when a running generated parser encounters a parse
|
|
error on some input file, after the user specified error
|
|
expression returns, the complete parse stack is printed to assist
|
|
in debugging the grammar of that particular parser. The numbers
|
|
in the stack printout correspond to the state numbers in the LALR
|
|
table file.
|
|
|
|
> (yacc-output filename) OPTIONAL causes the parser generator to
|
|
write a grammar file in the syntax of YACC/Bison. The file might
|
|
not be a valid YACC file because the scheme grammar can use
|
|
symbols that are invalid in C.
|
|
|
|
> (suppress) OPTIONAL causes the parser generator not to report
|
|
shift/reduce or reduce/reduce conflicts.
|
|
|
|
> (src-pos) OPTIONAL causes the generated parser to expect input in
|
|
the form (make-position-token token position position) instead of
|
|
simply token. Include this option when using the parser with a
|
|
lexer generated with lexer-src-pos.
|
|
|
|
> (error expression) expression should evaluate to a function which
|
|
will be executed for its side-effect whenever the parser
|
|
encounters an error. If the src-pos option is present, the
|
|
function should accept 5 arguments,
|
|
(lambda (token-ok token-name token-value start-pos end-pos) ...).
|
|
Otherwise it should accept 3,
|
|
(lambda (token-ok token-name token-value) ...).
|
|
The first argument will be #f iff the error is that an invalid
|
|
token was received. The second and third arguments will be the
|
|
name and the value of the token at which the error was detected.
|
|
The fourth and fifth arguments, if present, provide the source
|
|
positions of that token.
|
|
|
|
> (tokens group-name ...) declares that all of the tokens defined in
|
|
the groups can be handled by this parser.
|
|
|
|
> (start non-terminal-name ...) declares a list of starting
|
|
non-terminals for the grammar.
|
|
|
|
> (end token-name ...) specifies a set of tokens from which some
|
|
member must follow any valid parse. For example an EOF token
|
|
would be specified for a parser that parses entire files and a
|
|
NEWLINE token for a parser that parses entire lines individually.
|
|
|
|
> (precs (assoc token-name ...) ...) OPTIONAL precedence
|
|
declarations to resolve shift/reduce and reduce/reduce conflicts
|
|
as in YACC/BISON. assoc must be one of left, right or nonassoc.
|
|
States with multiple shift/reduce or reduce/reduce conflicts or
|
|
some combination thereof are not resolved with precedence.
|
|
|
|
> (grammar (non-terminal ((grammar-symbol ...) (prec token-name) expression)
|
|
...)
|
|
...)
|
|
|
|
declares the grammar to be parsed. Each grammar-symbol must be a
|
|
token-name or non-terminal. The prec declaration is optional.
|
|
expression is a semantic action which will be evaluated when the
|
|
input is found to match its corresponding production. Each
|
|
action is scheme code that has the same scope as its parser's
|
|
definition, except that the variables $1, ..., $n are bound in
|
|
the expression and may hide outside bindings of $1, ... $n. $x
|
|
is bound to the result of the action for the $xth grammar symbol
|
|
on the right of the production, if that grammar symbol is a
|
|
non-terminal, or the value stored in the token if the grammar
|
|
symbol is a terminal. Here n is the number of grammar-symbols on
|
|
the right of the production. If the src-pos option is present in
|
|
the parser, variables $1-start-pos, ..., $n-start-pos and
|
|
$1-end-pos, ..., $n-end-pos are also available and refer to the
|
|
position structures corresponding to the start and end of the
|
|
corresponding grammar-symbol. Grammar symbols defined as
|
|
empty-tokens have no $n associated, but do have $n-start-pos and
|
|
$n-end-pos. All of the productions for a given non-terminal must
|
|
be grouped with it, i.e. No non-terminal may appear twice on the
|
|
left hand side in a parser.
|
|
|
|
The result of a parser expression with one start non-terminal is a
|
|
function, f, that takes one argument. This argument must be a zero
|
|
argument function, t, that produces successive tokens of the input
|
|
each time it is called. If desired, the t may return symbols instead
|
|
of tokens. The parser will treat symbols as tokens of the
|
|
corresponding name (with #f as a value, so it is usual to return
|
|
symbols only in the case of empty tokens). f returns the value
|
|
associated with the parse tree by the semantic actions. If the parser
|
|
encounters an error, after invoking the supplied error function, it
|
|
will try to use error productions to continue parsing. If it cannot,
|
|
it raises a read error.
|
|
|
|
If multiple start non-terminals are provided, the parser expression
|
|
will result in a list of parsing functions (each one will individually
|
|
behave as if it were the result of a parser expression with only one
|
|
start non-terminal), one for each start non-terminal, in the same order.
|
|
|
|
Each time the scheme code for a parser is compiled (e.g. when a .ss
|
|
file containing a (parser ...) is loaded/required) the parser
|
|
generator is run. To avoid this overhead place the lexer into a
|
|
module and compile the module to a .zo with 'mzc --zo --auto-dir
|
|
filename'. This should create a .zo file in the 'compiled'
|
|
subdirectory.
|
|
|
|
Compiling the yacc.ss file to an extension can produce a good speedup
|
|
in generated parsers since the yacc.ss file contains the interpreter
|
|
for the generated parse tables. If mzscheme is able to compile
|
|
extensions (a c compiler must be available) run the commands:
|
|
cd ${PLTHOME}/collects/parser-tools
|
|
mzc --auto-dir yacc.ss
|
|
|
|
|
|
_yacc-to-scheme.ss_
|
|
This library provides one function:
|
|
> (trans filename) - reads a C YACC/BISON grammar from filename and
|
|
produces an s-expression that represents a scheme parser for use
|
|
with the yacc.ss module.
|
|
This library is intended to assist in the manual conversion of
|
|
grammars for use with yacc.ss and not as a fully automatic conversion
|
|
tool. It is not entirely robust. For example, if the C actions in
|
|
the original grammar have nested blocks the tool will fail.
|
|
|
|
|
|
Annotated examples are in the examples subdirectory of the parser-tools
|
|
collection directory.
|
|
|
|
|