423 lines
19 KiB
Plaintext
423 lines
19 KiB
Plaintext
HtmlPrag: Pragmatic Parsing and Emitting of HTML using SXML and SHTML
|
|
*********************************************************************
|
|
|
|
Version 0.16, 2005-12-18, `http://www.neilvandyke.org/htmlprag/'
|
|
|
|
by Neil W. Van Dyke <neil@neilvandyke.org>
|
|
|
|
Copyright (C) 2003 - 2005 Neil W. Van Dyke. This program is Free
|
|
Software; you can redistribute it and/or modify it under the terms
|
|
of the GNU Lesser General Public License as published by the Free
|
|
Software Foundation; either version 2.1 of the License, or (at
|
|
your option) any later version. This program is distributed in
|
|
the hope that it will be useful, but without any warranty; without
|
|
even the implied warranty of merchantability or fitness for a
|
|
particular purpose. See <http://www.gnu.org/copyleft/lesser.html>
|
|
for details. For other license options and consulting, contact
|
|
the author.
|
|
|
|
Introduction
|
|
************
|
|
|
|
HtmlPrag provides permissive HTML parsing and emitting capability to
|
|
Scheme programs. The parser is useful for software agent extraction of
|
|
information from Web pages, for programmatically transforming HTML
|
|
files, and for implementing interactive Web browsers. HtmlPrag emits
|
|
"SHTML," which is an encoding of HTML in SXML
|
|
(http://pobox.com/~oleg/ftp/Scheme/SXML.html), so that conventional
|
|
HTML may be processed with XML tools such as SXPath
|
|
(http://pair.com/lisovsky/query/sxpath/). Like Oleg Kiselyov's
|
|
SSAX-based HTML parser
|
|
(http://pobox.com/~oleg/ftp/Scheme/xml.html#HTML-parser), HtmlPrag
|
|
provides a permissive tokenizer, but also attempts to recover
|
|
structure. HtmlPrag also includes procedures for encoding SHTML in
|
|
HTML syntax.
|
|
|
|
The HtmlPrag parsing behavior is permissive in that it accepts
|
|
erroneous HTML, handling several classes of HTML syntax errors
|
|
gracefully, without yielding a parse error. This is crucial for
|
|
parsing arbitrary real-world Web pages, since many pages actually
|
|
contain syntax errors that would defeat a strict or validating parser.
|
|
HtmlPrag's handling of errors is intended to generally emulate popular
|
|
Web browsers' interpretation of the structure of erroneous HTML. We
|
|
euphemistically term this kind of parse "pragmatic."
|
|
|
|
HtmlPrag also has some support for XHTML, although XML namespace
|
|
qualifiers are currently accepted but stripped from the resulting
|
|
SHTML. Note that valid XHTML input is of course better handled by a
|
|
validating XML parser like Kiselyov's SSAX
|
|
(http://pobox.com/~oleg/ftp/Scheme/xml.html#XML-parser).
|
|
|
|
HtmlPrag requires R5RS, SRFI-6, and SRFI-23.
|
|
|
|
SHTML and SXML
|
|
**************
|
|
|
|
SHTML is a variant of SXML, with two minor but useful extensions:
|
|
|
|
1. The SXML keyword symbols, such as `*TOP*', are defined to be in all
|
|
uppercase, regardless of the case-sensitivity of the reader of the
|
|
hosting Scheme implementation in any context. This avoids several
|
|
pitfalls.
|
|
|
|
2. Since not all character entity references used in HTML can be
|
|
converted to Scheme characters in all R5RS Scheme implementations,
|
|
nor represented in conventional text files or other common
|
|
external text formats to which one might wish to write SHTML,
|
|
SHTML adds a special `&' syntax for non-ASCII (or
|
|
non-Extended-ASCII) characters. The syntax is `(& VAL)', where
|
|
VAL is a symbol or string naming with the symbolic name of the
|
|
character, or an integer with the numeric value of the character.
|
|
|
|
|
|
> shtml-comment-symbol
|
|
> shtml-decl-symbol
|
|
> shtml-empty-symbol
|
|
> shtml-end-symbol
|
|
> shtml-entity-symbol
|
|
> shtml-pi-symbol
|
|
> shtml-start-symbol
|
|
> shtml-text-symbol
|
|
> shtml-top-symbol
|
|
These variables are bound to the following case-sensitive symbols
|
|
used in SHTML, respectively: `*COMMENT*', `*DECL*', `*EMPTY*',
|
|
`*END*', `*ENTITY*', `*PI*', `*START*', `*TEXT*', and `*TOP*'.
|
|
These can be used in lieu of the literal symbols in programs read
|
|
by a case-insensitive Scheme reader.(1)
|
|
|
|
> shtml-named-char-id
|
|
> shtml-numeric-char-id
|
|
These variables are bound to the SHTML entity public identifier
|
|
strings used in SHTML `*ENTITY*' named and numeric character entity
|
|
references.
|
|
|
|
> (make-shtml-entity val)
|
|
Yields an SHTML character entity reference for VAL. For example:
|
|
|
|
(make-shtml-entity "rArr") => (& rArr)
|
|
(make-shtml-entity (string->symbol "rArr")) => (& rArr)
|
|
(make-shtml-entity 151) => (& 151)
|
|
|
|
> (shtml-entity-value obj)
|
|
Yields the value for the SHTML entity OBJ, or `#f' if OBJ is not a
|
|
recognized entity. Values of named entities are symbols, and
|
|
values of numeric entities are numbers. An error may raised if OBJ
|
|
is an entity with system ID inconsistent with its public ID. For
|
|
example:
|
|
|
|
(define (f s) (shtml-entity-value (cadr (html->shtml s))))
|
|
(f " ") => nbsp
|
|
(f "ߐ") => 2000
|
|
|
|
Tokenizing
|
|
**********
|
|
|
|
The tokenizer is used by the higher-level structural parser, but can
|
|
also be called directly for debugging purposes or unusual applications.
|
|
Some of the list structure of tokens, such as for start tag tokens, is
|
|
mutated and incorporated into the SHTML list structure emitted by the
|
|
parser.
|
|
|
|
> (make-html-tokenizer in normalized?)
|
|
Constructs an HTML tokenizer procedure on input port IN. If
|
|
boolean NORMALIZED? is true, then tokens will be in a format
|
|
conducive to use with a parser emitting normalized SXML. Each
|
|
call to the resulting procedure yields a successive token from the
|
|
input. When the tokens have been exhausted, the procedure returns
|
|
the null list. For example:
|
|
|
|
(define input (open-input-string "<a href=\"foo\">bar</a>"))
|
|
(define next (make-html-tokenizer input #f))
|
|
(next) => (a (@ (href "foo")))
|
|
(next) => "bar"
|
|
(next) => (*END* a)
|
|
(next) => ()
|
|
(next) => ()
|
|
|
|
> (tokenize-html in normalized?)
|
|
Returns a list of tokens from input port IN, normalizing according
|
|
to boolean NORMALIZED?. This is probably most useful as a
|
|
debugging convenience. For example:
|
|
|
|
(tokenize-html (open-input-string "<a href=\"foo\">bar</a>") #f)
|
|
=> ((a (@ (href "foo"))) "bar" (*END* a))
|
|
|
|
> (shtml-token-kind token)
|
|
Returns a symbol indicating the kind of tokenizer TOKEN:
|
|
`*COMMENT*', `*DECL*', `*EMPTY*', `*END*', `*ENTITY*', `*PI*',
|
|
`*START*', `*TEXT*'. This is used by higher-level parsing code.
|
|
For example:
|
|
|
|
(map shtml-token-kind
|
|
(tokenize-html (open-input-string "<a<b>><c</</c") #f))
|
|
=> (*START* *START* *TEXT* *START* *END* *END*)
|
|
|
|
Parsing
|
|
*******
|
|
|
|
Most applications will call a parser procedure such as `html->shtml'
|
|
rather than calling the tokenizer directly.
|
|
|
|
> (parse-html/tokenizer tokenizer normalized?)
|
|
Emits a parse tree like `html->shtml' and related procedures,
|
|
except using TOKENIZER as a source of tokens, rather than
|
|
tokenizing from an input port. This procedure is used internally,
|
|
and generally should not be called directly.
|
|
|
|
> (html->sxml-0nf input)
|
|
> (html->sxml-1nf input)
|
|
> (html->sxml-2nf input)
|
|
> (html->sxml input)
|
|
> (html->shtml input)
|
|
Permissively parse HTML from INPUT, which is either an input port
|
|
or a string, and emit an SHTML equivalent or approximation. To
|
|
borrow and slightly modify an example from Kiselyov's discussion
|
|
of his HTML parser:
|
|
|
|
(html->shtml
|
|
"<html><head><title></title><title>whatever</title></head><body>
|
|
<a href=\"url\">link</a><p align=center><ul compact style=\"aa\">
|
|
<p>BLah<!-- comment <comment> --> <i> italic <b> bold <tt> ened</i>
|
|
still < bold </b></body><P> But not done yet...")
|
|
=>
|
|
(*TOP* (html (head (title) (title "whatever"))
|
|
(body "\n"
|
|
(a (@ (href "url")) "link")
|
|
(p (@ (align "center"))
|
|
(ul (@ (compact) (style "aa")) "\n"))
|
|
(p "BLah"
|
|
(*COMMENT* " comment <comment> ")
|
|
" "
|
|
(i " italic " (b " bold " (tt " ened")))
|
|
"\n"
|
|
"still < bold "))
|
|
(p " But not done yet...")))
|
|
|
|
Note that in the emitted SHTML the text token `"still < bold"' is
|
|
_not_ inside the `b' element, which represents an unfortunate
|
|
failure to emulate all the quirks-handling behavior of some
|
|
popular Web browsers.
|
|
|
|
The procedures `html->sxml-Nnf' for N 0 through 2 correspond to
|
|
0th through 2nd normal forms of SXML as specified in SXML, and
|
|
indicate the minimal requirements of the emitted SXML.
|
|
|
|
`html->sxml' and `html->shtml' are currently aliases for
|
|
`html->sxml-0nf', and can be used in scripts and interactively,
|
|
when terseness is important and any normal form of SXML would
|
|
suffice.
|
|
|
|
Emitting HTML
|
|
*************
|
|
|
|
Two procedures encoding the SHTML representation as conventional HTML,
|
|
`write-shtml-as-html' and `shtml->html'. These are perhaps most useful
|
|
for emitting the result of parsed and transformed input HTML. They can
|
|
also be used for emitting HTML from generated or handwritten SHTML.
|
|
|
|
> (write-shtml-as-html shtml [out [foreign-filter]])
|
|
Writes a conventional HTML transliteration of the SHTML SHTML to
|
|
output port OUT. If OUT is not specified, the default is the
|
|
current output port. HTML elements of types that are always empty
|
|
are written using HTML4-compatible XHTML tag syntax.
|
|
|
|
If FOREIGN-FILTER is specified, it is a procedure of two argument
|
|
that is applied to any non-SHTML ("foreign") object encountered in
|
|
SHTML, and should yield SHTML. The first argument is the object,
|
|
and the second argument is a boolean for whether or not the object
|
|
is part of an attribute value.
|
|
|
|
No inter-tag whitespace or line breaks not explicit in SHTML is
|
|
emitted. The SHTML should normally include a newline at the end of
|
|
the document. For example:
|
|
|
|
(write-shtml-as-html
|
|
'((html (head (title "My Title"))
|
|
(body (@ (bgcolor "white"))
|
|
(h1 "My Heading")
|
|
(p "This is a paragraph.")
|
|
(p "This is another paragraph.")))))
|
|
-| <html><head><title>My Title</title></head><body bgcolor="whi
|
|
-| te"><h1>My Heading</h1><p>This is a paragraph.</p><p>This is
|
|
-| another paragraph.</p></body></html>
|
|
|
|
> (shtml->html shtml)
|
|
Yields an HTML encoding of SHTML SHTML as a string. For example:
|
|
|
|
(shtml->html
|
|
(html->shtml
|
|
"<P>This is<br<b<I>bold </foo>italic</ b > text.</p>"))
|
|
=> "<p>This is<br /><b><i>bold italic</i></b> text.</p>"
|
|
|
|
Note that, since this procedure constructs a string, it should
|
|
normally only be used when the HTML is relatively small. When
|
|
encoding HTML documents of conventional size and larger,
|
|
`write-shtml-as-html' is much more efficient.
|
|
|
|
Tests
|
|
*****
|
|
|
|
The HtmlPrag test suite can be enabled by editing the source code file
|
|
and loading Testeez (http://www.neilvandyke.org/testeez/).
|
|
|
|
History
|
|
*******
|
|
|
|
Version 0.16 -- 2005-12-18
|
|
Documentation fix.
|
|
|
|
Version 0.15 -- 2005-12-18
|
|
In the HTML parent element constraints that are used for structure
|
|
recovery, `div' is now always permitted as a parent, as a stopgap
|
|
measure until substantial time can be spent reworking the
|
|
algorithm to better support `div' (bug reported by Corey Sweeney
|
|
and Jepri). Also no longer convert to Scheme character any HTML
|
|
numeric character reference with value above 126, to avoid Unicode
|
|
problem with PLT 299/300 (bug reported by Corey Sweeney).
|
|
|
|
Version 0.14 -- 2005-06-16
|
|
XML CDATA sections are now tokenized. Thanks to Alejandro Forero
|
|
Cuervo for suggesting this feature. The deprecated procedures
|
|
`sxml->html' and `write-sxml-html' have been removed. Minor
|
|
documentation changes.
|
|
|
|
Version 0.13 -- 2005-02-23
|
|
HtmlPrag now requires `syntax-rules', and a reader that can read
|
|
`@' as a symbol. SHTML now has a special `&' element for
|
|
character entities, and it is emitted by the parser rather than
|
|
the old `*ENTITY*' kludge. `shtml-entity-value' supports both the
|
|
new and the old character entity representations.
|
|
`shtml-entity-value' now yields `#f' on invalid SHTML entity,
|
|
rather than raising an error. `write-shtml-as-html' now has a
|
|
third argument, `foreign-filter'. `write-shtml-as-html' now emits
|
|
SHTML `&' entity references. Changed `shtml-named-char-id' and
|
|
`shtml-numeric-char-id', as previously warned. Testeez is now
|
|
used for the test suite. Test procedure is now the internal
|
|
`%htmlprag:test'. Documentation changes. Notably, much
|
|
documentation about using HtmlPrag under various particular Scheme
|
|
implementations has been removed.
|
|
|
|
Version 0.12 -- 2004-07-12
|
|
Forward-slash in an unquoted attribute value is now considered a
|
|
value constituent rather than an unconsumed terminator of the
|
|
value (thanks to Maurice Davis for reporting and a suggested fix).
|
|
`xml:' is now preserved as a namespace qualifier (thanks to Peter
|
|
Barabas for reporting). Output port term of `write-shtml-as-html'
|
|
is now optional. Began documenting loading for particular
|
|
implementation-specific packagings.
|
|
|
|
Version 0.11 -- 2004-05-13
|
|
To reduce likely namespace collisions with SXML tools, and in
|
|
anticipation of a forthcoming set of new features, introduced the
|
|
concept of "SHTML," which will be elaborated upon in a future
|
|
version of HtmlPrag. Renamed `sxml-X-symbol' to `shtml-X-symbol',
|
|
`sxml-html-X' to `shtml-X', and `sxml-token-kind' to
|
|
`shtml-token-kind'. `html->shtml', `shtml->html', and
|
|
`write-shtml-as-html' have been added as names. Considered
|
|
deprecated but still defined (see the "Deprecated" section of this
|
|
documentation) are `sxml->html' and `write-sxml-html'. The
|
|
growing pains should now be all but over. Internally,
|
|
`htmlprag-internal:error' introduced for Bigloo portability. SISC
|
|
returned to the test list; thanks to Scott G. Miller for his
|
|
help. Fixed a new character `eq?' bug, thanks to SISC.
|
|
|
|
Version 0.10 -- 2004-05-11
|
|
All public identifiers have been renamed to drop the "`htmlprag:'"
|
|
prefix. The portability identifiers have been renamed to begin
|
|
with an `htmlprag-internal:' prefix, are now considered strictly
|
|
internal-use-only, and have otherwise been changed. `parse-html'
|
|
and `always-empty-html-elements' are no longer public.
|
|
`test-htmlprag' now tests `html->sxml' rather than `parse-html'.
|
|
SISC temporarily removed from the test list, until an open source
|
|
Java that works correctly is found.
|
|
|
|
Version 0.9 -- 2004-05-07
|
|
HTML encoding procedures added. Added
|
|
`htmlprag:sxml-html-entity-value'. Upper-case `X' in hexadecimal
|
|
character entities is now parsed, in addition to lower-case `x'.
|
|
Added `htmlprag:always-empty-html-elements'. Added additional
|
|
portability bindings. Added more test cases.
|
|
|
|
Version 0.8 -- 2004-04-27
|
|
Entity references (symbolic, decimal numeric, hexadecimal numeric)
|
|
are now parsed into `*ENTITY*' SXML. SXML symbols like `*TOP*'
|
|
are now always upper-case, regardless of the Scheme
|
|
implementation. Identifiers such as `htmlprag:sxml-top-symbol'
|
|
are bound to the upper-case symbols. Procedures
|
|
`htmlprag:html->sxml-0nf', `htmlprag:html->sxml-1nf', and
|
|
`htmlprag:html->sxml-2nf' have been added. `htmlprag:html->sxml'
|
|
now an alias for `htmlprag:html->sxml-0nf'. `htmlprag:parse' has
|
|
been refashioned as `htmlprag:parse-html' and should no longer be
|
|
directly. A number of identifiers have been renamed to be more
|
|
appropriate when the `htmlprag:' prefix is dropped in some
|
|
implementation-specific packagings of HtmlPrag:
|
|
`htmlprag:make-tokenizer' to `htmlprag:make-html-tokenizer',
|
|
`htmlprag:parse/tokenizer' to `htmlprag:parse-html/tokenizer',
|
|
`htmlprag:html->token-list' to `htmlprag:tokenize-html',
|
|
`htmlprag:token-kind' to `htmlprag:sxml-token-kind', and
|
|
`htmlprag:test' to `htmlprag:test-htmlprag'. Verbatim elements
|
|
with empty-element tag syntax are handled correctly. New versions
|
|
of Bigloo and RScheme tested.
|
|
|
|
Version 0.7 -- 2004-03-10
|
|
Verbatim pair elements like `script' and `xmp' are now parsed
|
|
correctly. Two Scheme implementations have temporarily been
|
|
dropped from regression testing: Kawa, due to a Java bytecode
|
|
verifier error likely due to a Java installation problem on the
|
|
test machine; and SXM 1.1, due to hitting a limit on the number of
|
|
literals late in the test suite code. Tested newer versions of
|
|
Bigloo, Chicken, Gauche, Guile, MIT Scheme, PLT MzScheme, RScheme,
|
|
SISC, and STklos. RScheme no longer requires the "`(define
|
|
get-output-string close-output-port)'" workaround.
|
|
|
|
Version 0.6 -- 2003-07-03
|
|
Fixed uses of `eq?' in character comparisons, thanks to Scott G.
|
|
Miller. Added `htmlprag:html->normalized-sxml' and
|
|
`htmlprag:html->nonnormalized-sxml'. Started to add
|
|
`close-output-port' to uses of output strings, then reverted due to
|
|
bug in one of the supported dialects. Tested newer versions of
|
|
Bigloo, Gauche, PLT MzScheme, RScheme.
|
|
|
|
Version 0.5 -- 2003-02-26
|
|
Removed uses of `call-with-values'. Re-ordered top-level
|
|
definitions, for portability. Now tests under Kawa 1.6.99,
|
|
RScheme 0.7.3.2, Scheme 48 0.57, SISC 1.7.4, STklos 0.54, and SXM
|
|
1.1.
|
|
|
|
Version 0.4 -- 2003-02-19
|
|
Apostrophe-quoted element attribute values are now handled. A bug
|
|
that incorrectly assumed left-to-right term evaluation order has
|
|
been fixed (thanks to MIT Scheme for confronting us with this).
|
|
Now also tests OK under Gauche 0.6.6 and MIT Scheme 7.7.1.
|
|
Portability improvement for implementations (e.g., RScheme
|
|
0.7.3.2.b6, Stalin 0.9) that cannot read `@' as a symbol (although
|
|
those implementations tend to present other portability issues, as
|
|
yet unresolved).
|
|
|
|
Version 0.3 -- 2003-02-05
|
|
A test suite with 66 cases has been added, and necessary changes
|
|
have been made for the suite to pass on five popular Scheme
|
|
implementations. XML processing instructions are now parsed.
|
|
Parent constraints have been added for `colgroup', `tbody', and
|
|
`thead' elements. Erroneous input, including invalid hexadecimal
|
|
entity reference syntax and extraneous double quotes in element
|
|
tags, is now parsed better. `htmlprag:token-kind' emits symbols
|
|
more consistent with SXML.
|
|
|
|
Version 0.2 -- 2003-02-02
|
|
Portability improvements.
|
|
|
|
Version 0.1 -- 2003-01-31
|
|
Dusted off old Guile-specific code from April 2001, converted to
|
|
emit SXML, mostly ported to R5RS and SRFI-6, added some XHTML
|
|
support and documentation. A little preliminary testing has been
|
|
done, and the package is already useful for some applications, but
|
|
this release should be considered a preview to invite comments.
|
|
|
|
|
|
---------- Footnotes ----------
|
|
|
|
(1) Scheme implementators who have not yet made `read'
|
|
case-sensitive by default are encouraged to do so.
|
|
|