459 lines
25 KiB
HTML
459 lines
25 KiB
HTML
<html lang="en">
|
|
<head>
|
|
<title>HtmlPrag: Pragmatic Parsing and Emitting of HTML using SXML and SHTML</title>
|
|
<meta http-equiv="Content-Type" content="text/html">
|
|
<meta name="description" content="HtmlPrag: Pragmatic Parsing and Emitting of HTML using SXML and SHTML">
|
|
<meta name="generator" content="makeinfo 4.7">
|
|
<link title="Top" rel="top" href="#Top">
|
|
<link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
|
|
<meta http-equiv="Content-Style-Type" content="text/css">
|
|
<style type="text/css"><!--
|
|
pre.display { font-family:inherit }
|
|
pre.format { font-family:inherit }
|
|
pre.smalldisplay { font-family:inherit; font-size:smaller }
|
|
pre.smallformat { font-family:inherit; font-size:smaller }
|
|
pre.smallexample { font-size:smaller }
|
|
pre.smalllisp { font-size:smaller }
|
|
span.sc { font-variant:small-caps }
|
|
span.roman { font-family: serif; font-weight: normal; }
|
|
--></style>
|
|
</head>
|
|
<body>
|
|
<a name="Top"></a>
|
|
|
|
<h2 class="unnumbered">HtmlPrag: Pragmatic Parsing and Emitting of HTML using SXML and SHTML</h2>
|
|
|
|
<p class="noindent">Version <b>0.16</b>, 2005-12-18, <a href="http://www.neilvandyke.org/htmlprag/">http://www.neilvandyke.org/htmlprag/</a>
|
|
|
|
<p class="noindent">by
|
|
Neil W. Van Dyke
|
|
<<code>neil@neilvandyke.org</code>>
|
|
|
|
<blockquote>
|
|
Copyright © 2003 - 2005 Neil W. Van Dyke. This program is Free
|
|
Software; you can redistribute it and/or modify it under the terms of the
|
|
GNU Lesser General Public License as published by the Free Software
|
|
Foundation; either version 2.1 of the License, or (at your option) any
|
|
later version. This program is distributed in the hope that it will be
|
|
useful, but without any warranty; without even the implied warranty of
|
|
merchantability or fitness for a particular purpose. See
|
|
<<code>http://www.gnu.org/copyleft/lesser.html</code>> for details. For
|
|
other license options and consulting, contact the author.
|
|
</blockquote>
|
|
|
|
<h2 class="chapter">Introduction</h2>
|
|
|
|
<p>HtmlPrag provides permissive HTML parsing and emitting capability to Scheme
|
|
programs. The parser is useful for software agent extraction of
|
|
information from Web pages, for programmatically transforming HTML files,
|
|
and for implementing interactive Web browsers. HtmlPrag emits “SHTML,”
|
|
which is an encoding of HTML in
|
|
<a href="http://pobox.com/~oleg/ftp/Scheme/SXML.html">SXML</a>, so that
|
|
conventional HTML may be processed with XML tools such as
|
|
<a href="http://pair.com/lisovsky/query/sxpath/">SXPath</a>. Like Oleg
|
|
Kiselyov's <a href="http://pobox.com/~oleg/ftp/Scheme/xml.html#HTML-parser">SSAX-based HTML parser</a>, HtmlPrag provides a permissive tokenizer, but also
|
|
attempts to recover structure. HtmlPrag also includes procedures for
|
|
encoding SHTML in HTML syntax.
|
|
|
|
<p>The HtmlPrag parsing behavior is permissive in that it accepts erroneous
|
|
HTML, handling several classes of HTML syntax errors gracefully, without
|
|
yielding a parse error. This is crucial for parsing arbitrary real-world
|
|
Web pages, since many pages actually contain syntax errors that would
|
|
defeat a strict or validating parser. HtmlPrag's handling of errors is
|
|
intended to generally emulate popular Web browsers' interpretation of the
|
|
structure of erroneous HTML. We euphemistically term this kind of parse
|
|
“pragmatic.”
|
|
|
|
<p>HtmlPrag also has some support for XHTML, although XML namespace qualifiers
|
|
are currently accepted but stripped from the resulting SHTML. Note that
|
|
valid XHTML input is of course better handled by a validating XML parser
|
|
like Kiselyov's
|
|
<a href="http://pobox.com/~oleg/ftp/Scheme/xml.html#XML-parser">SSAX</a>.
|
|
|
|
<p>HtmlPrag requires R5RS, SRFI-6, and SRFI-23.
|
|
|
|
<h2 class="chapter">SHTML and SXML</h2>
|
|
|
|
<p>SHTML is a variant of SXML, with two minor but useful extensions:
|
|
|
|
<ol type=1 start=1>
|
|
|
|
<li>The SXML keyword symbols, such as <code>*TOP*</code>, are defined to be in all
|
|
uppercase, regardless of the case-sensitivity of the reader of the hosting
|
|
Scheme implementation in any context. This avoids several pitfalls.
|
|
|
|
<li>Since not all character entity references used in HTML can be converted to
|
|
Scheme characters in all R5RS Scheme implementations, nor represented in
|
|
conventional text files or other common external text formats to which one
|
|
might wish to write SHTML, SHTML adds a special <code>&</code> syntax for
|
|
non-ASCII (or non-Extended-ASCII) characters. The syntax is <code>(&
|
|
</code><var>val</var><code>)</code>, where <var>val</var> is a symbol or string naming with the symbolic
|
|
name of the character, or an integer with the numeric value of the
|
|
character.
|
|
|
|
</ol>
|
|
|
|
<div class="defun">
|
|
— Variable: <b>shtml-comment-symbol</b><var><a name="index-shtml_002dcomment_002dsymbol-1"></a></var><br>
|
|
— Variable: <b>shtml-decl-symbol</b><var><a name="index-shtml_002ddecl_002dsymbol-2"></a></var><br>
|
|
— Variable: <b>shtml-empty-symbol</b><var><a name="index-shtml_002dempty_002dsymbol-3"></a></var><br>
|
|
— Variable: <b>shtml-end-symbol</b><var><a name="index-shtml_002dend_002dsymbol-4"></a></var><br>
|
|
— Variable: <b>shtml-entity-symbol</b><var><a name="index-shtml_002dentity_002dsymbol-5"></a></var><br>
|
|
— Variable: <b>shtml-pi-symbol</b><var><a name="index-shtml_002dpi_002dsymbol-6"></a></var><br>
|
|
— Variable: <b>shtml-start-symbol</b><var><a name="index-shtml_002dstart_002dsymbol-7"></a></var><br>
|
|
— Variable: <b>shtml-text-symbol</b><var><a name="index-shtml_002dtext_002dsymbol-8"></a></var><br>
|
|
— Variable: <b>shtml-top-symbol</b><var><a name="index-shtml_002dtop_002dsymbol-9"></a></var><br>
|
|
<blockquote>
|
|
<p>These variables are bound to the following case-sensitive symbols used in
|
|
SHTML, respectively: <code>*COMMENT*</code>, <code>*DECL*</code>, <code>*EMPTY*</code>,
|
|
<code>*END*</code>, <code>*ENTITY*</code>, <code>*PI*</code>, <code>*START*</code>, <code>*TEXT*</code>,
|
|
and <code>*TOP*</code>. These can be used in lieu of the literal symbols in
|
|
programs read by a case-insensitive Scheme reader.<a rel="footnote" href="#fn-1" name="fnd-1"><sup>1</sup></a>
|
|
</p></blockquote></div>
|
|
|
|
<div class="defun">
|
|
— Variable: <b>shtml-named-char-id</b><var><a name="index-shtml_002dnamed_002dchar_002did-10"></a></var><br>
|
|
— Variable: <b>shtml-numeric-char-id</b><var><a name="index-shtml_002dnumeric_002dchar_002did-11"></a></var><br>
|
|
<blockquote>
|
|
<p>These variables are bound to the SHTML entity public identifier strings
|
|
used in SHTML <code>*ENTITY*</code> named and numeric character entity
|
|
references.
|
|
</p></blockquote></div>
|
|
|
|
<div class="defun">
|
|
— Procedure: <b>make-shtml-entity</b><var> val<a name="index-make_002dshtml_002dentity-12"></a></var><br>
|
|
<blockquote>
|
|
<p>Yields an SHTML character entity reference for <var>val</var>. For example:
|
|
|
|
<pre class="lisp"> (make-shtml-entity "rArr") => (& rArr)
|
|
(make-shtml-entity (string->symbol "rArr")) => (& rArr)
|
|
(make-shtml-entity 151) => (& 151)
|
|
</pre>
|
|
</blockquote></div>
|
|
|
|
<div class="defun">
|
|
— Procedure: <b>shtml-entity-value</b><var> obj<a name="index-shtml_002dentity_002dvalue-13"></a></var><br>
|
|
<blockquote>
|
|
<p>Yields the value for the SHTML entity <var>obj</var>, or <code>#f</code> if <var>obj</var>
|
|
is not a recognized entity. Values of named entities are symbols, and
|
|
values of numeric entities are numbers. An error may raised if <var>obj</var>
|
|
is an entity with system ID inconsistent with its public ID. For example:
|
|
|
|
<pre class="lisp"> (define (f s) (shtml-entity-value (cadr (html->shtml s))))
|
|
(f "&nbsp;") => nbsp
|
|
(f "&#2000;") => 2000
|
|
</pre>
|
|
</blockquote></div>
|
|
|
|
<h2 class="chapter">Tokenizing</h2>
|
|
|
|
<p>The tokenizer is used by the higher-level structural parser, but can also
|
|
be called directly for debugging purposes or unusual applications. Some of
|
|
the list structure of tokens, such as for start tag tokens, is mutated and
|
|
incorporated into the SHTML list structure emitted by the parser.
|
|
|
|
<div class="defun">
|
|
— Procedure: <b>make-html-tokenizer</b><var> in normalized?<a name="index-make_002dhtml_002dtokenizer-14"></a></var><br>
|
|
<blockquote>
|
|
<p>Constructs an HTML tokenizer procedure on input port <var>in</var>. If boolean
|
|
<var>normalized?</var> is true, then tokens will be in a format conducive to use
|
|
with a parser emitting normalized SXML. Each call to the resulting
|
|
procedure yields a successive token from the input. When the tokens have
|
|
been exhausted, the procedure returns the null list. For example:
|
|
|
|
<pre class="lisp"> (define input (open-input-string "<a href=\"foo\">bar</a>"))
|
|
(define next (make-html-tokenizer input #f))
|
|
(next) => (a (@ (href "foo")))
|
|
(next) => "bar"
|
|
(next) => (*END* a)
|
|
(next) => ()
|
|
(next) => ()
|
|
</pre>
|
|
</blockquote></div>
|
|
|
|
<div class="defun">
|
|
— Procedure: <b>tokenize-html</b><var> in normalized?<a name="index-tokenize_002dhtml-15"></a></var><br>
|
|
<blockquote>
|
|
<p>Returns a list of tokens from input port <var>in</var>, normalizing according to
|
|
boolean <var>normalized?</var>. This is probably most useful as a debugging
|
|
convenience. For example:
|
|
|
|
<pre class="lisp"> (tokenize-html (open-input-string "<a href=\"foo\">bar</a>") #f)
|
|
=> ((a (@ (href "foo"))) "bar" (*END* a))
|
|
</pre>
|
|
</blockquote></div>
|
|
|
|
<div class="defun">
|
|
— Procedure: <b>shtml-token-kind</b><var> token<a name="index-shtml_002dtoken_002dkind-16"></a></var><br>
|
|
<blockquote>
|
|
<p>Returns a symbol indicating the kind of tokenizer <var>token</var>:
|
|
<code>*COMMENT*</code>, <code>*DECL*</code>, <code>*EMPTY*</code>, <code>*END*</code>,
|
|
<code>*ENTITY*</code>, <code>*PI*</code>, <code>*START*</code>, <code>*TEXT*</code>.
|
|
This is used by higher-level parsing code. For example:
|
|
|
|
<pre class="lisp"> (map shtml-token-kind
|
|
(tokenize-html (open-input-string "<a<b>><c</</c") #f))
|
|
=> (*START* *START* *TEXT* *START* *END* *END*)
|
|
</pre>
|
|
</blockquote></div>
|
|
|
|
<h2 class="chapter">Parsing</h2>
|
|
|
|
<p>Most applications will call a parser procedure such as
|
|
<code>html->shtml</code> rather than calling the tokenizer directly.
|
|
|
|
<div class="defun">
|
|
— Procedure: <b>parse-html/tokenizer</b><var> tokenizer normalized?<a name="index-parse_002dhtml_002ftokenizer-17"></a></var><br>
|
|
<blockquote>
|
|
<p>Emits a parse tree like <code>html->shtml</code> and related procedures, except
|
|
using <var>tokenizer</var> as a source of tokens, rather than tokenizing from an
|
|
input port. This procedure is used internally, and generally should not be
|
|
called directly.
|
|
</p></blockquote></div>
|
|
|
|
<div class="defun">
|
|
— Procedure: <b>html->sxml-0nf</b><var> input<a name="index-html_002d_003esxml_002d0nf-18"></a></var><br>
|
|
— Procedure: <b>html->sxml-1nf</b><var> input<a name="index-html_002d_003esxml_002d1nf-19"></a></var><br>
|
|
— Procedure: <b>html->sxml-2nf</b><var> input<a name="index-html_002d_003esxml_002d2nf-20"></a></var><br>
|
|
— Procedure: <b>html->sxml</b><var> input<a name="index-html_002d_003esxml-21"></a></var><br>
|
|
— Procedure: <b>html->shtml</b><var> input<a name="index-html_002d_003eshtml-22"></a></var><br>
|
|
<blockquote>
|
|
<p>Permissively parse HTML from <var>input</var>, which is either an input port or
|
|
a string, and emit an SHTML equivalent or approximation. To borrow and
|
|
slightly modify an example from Kiselyov's discussion of his HTML parser:
|
|
|
|
<pre class="lisp"> (html->shtml
|
|
"<html><head><title></title><title>whatever</title></head><body>
|
|
<a href=\"url\">link</a><p align=center><ul compact style=\"aa\">
|
|
<p>BLah<!-- comment <comment> --> <i> italic <b> bold <tt> ened</i>
|
|
still &lt; bold </b></body><P> But not done yet...")
|
|
=>
|
|
(*TOP* (html (head (title) (title "whatever"))
|
|
(body "\n"
|
|
(a (@ (href "url")) "link")
|
|
(p (@ (align "center"))
|
|
(ul (@ (compact) (style "aa")) "\n"))
|
|
(p "BLah"
|
|
(*COMMENT* " comment <comment> ")
|
|
" "
|
|
(i " italic " (b " bold " (tt " ened")))
|
|
"\n"
|
|
"still < bold "))
|
|
(p " But not done yet...")))
|
|
</pre>
|
|
<p>Note that in the emitted SHTML the text token <code>"still < bold"</code> is
|
|
<em>not</em> inside the <code>b</code> element, which represents an unfortunate
|
|
failure to emulate all the quirks-handling behavior of some popular Web
|
|
browsers.
|
|
|
|
<p>The procedures <code>html->sxml-</code><var>n</var><code>nf</code> for <var>n</var> 0 through 2
|
|
correspond to 0th through 2nd normal forms of SXML as specified in SXML,
|
|
and indicate the minimal requirements of the emitted SXML.
|
|
|
|
<p><code>html->sxml</code> and <code>html->shtml</code> are currently aliases for
|
|
<code>html->sxml-0nf</code>, and can be used in scripts and interactively, when
|
|
terseness is important and any normal form of SXML would suffice.
|
|
</p></blockquote></div>
|
|
|
|
<h2 class="chapter">Emitting HTML</h2>
|
|
|
|
<p>Two procedures encoding the SHTML representation as conventional HTML,
|
|
<code>write-shtml-as-html</code> and <code>shtml->html</code>. These are perhaps most
|
|
useful for emitting the result of parsed and transformed input HTML. They
|
|
can also be used for emitting HTML from generated or handwritten SHTML.
|
|
|
|
<div class="defun">
|
|
— Procedure: <b>write-shtml-as-html</b><var> shtml </var>[<var>out </var>[<var>foreign-filter</var>]]<var><a name="index-write_002dshtml_002das_002dhtml-23"></a></var><br>
|
|
<blockquote>
|
|
<p>Writes a conventional HTML transliteration of the SHTML <var>shtml</var> to
|
|
output port <var>out</var>. If <var>out</var> is not specified, the default is the
|
|
current output port. HTML elements of types that are always empty are
|
|
written using HTML4-compatible XHTML tag syntax.
|
|
|
|
<p>If <var>foreign-filter</var> is specified, it is a procedure of two argument
|
|
that is applied to any non-SHTML (“foreign”) object encountered in
|
|
<var>shtml</var>, and should yield SHTML. The first argument is the object, and
|
|
the second argument is a boolean for whether or not the object is part of
|
|
an attribute value.
|
|
|
|
<p>No inter-tag whitespace or line breaks not explicit in <var>shtml</var> is
|
|
emitted. The <var>shtml</var> should normally include a newline at the end of
|
|
the document. For example:
|
|
|
|
<pre class="lisp"> (write-shtml-as-html
|
|
'((html (head (title "My Title"))
|
|
(body (@ (bgcolor "white"))
|
|
(h1 "My Heading")
|
|
(p "This is a paragraph.")
|
|
(p "This is another paragraph.")))))
|
|
-| <html><head><title>My Title</title></head><body bgcolor="whi
|
|
-| te"><h1>My Heading</h1><p>This is a paragraph.</p><p>This is
|
|
-| another paragraph.</p></body></html>
|
|
</pre>
|
|
</blockquote></div>
|
|
|
|
<div class="defun">
|
|
— Procedure: <b>shtml->html</b><var> shtml<a name="index-shtml_002d_003ehtml-24"></a></var><br>
|
|
<blockquote>
|
|
<p>Yields an HTML encoding of SHTML <var>shtml</var> as a string. For example:
|
|
|
|
<pre class="lisp"> (shtml->html
|
|
(html->shtml
|
|
"<P>This is<br<b<I>bold </foo>italic</ b > text.</p>"))
|
|
=> "<p>This is<br /><b><i>bold italic</i></b> text.</p>"
|
|
</pre>
|
|
<p>Note that, since this procedure constructs a string, it should normally
|
|
only be used when the HTML is relatively small. When encoding HTML
|
|
documents of conventional size and larger, <code>write-shtml-as-html</code> is
|
|
much more efficient.
|
|
</p></blockquote></div>
|
|
|
|
<h2 class="chapter">Tests</h2>
|
|
|
|
<p>The HtmlPrag test suite can be enabled by editing the source code file and
|
|
loading <a href="http://www.neilvandyke.org/testeez/">Testeez</a>.
|
|
|
|
<h2 class="unnumbered">History</h2>
|
|
|
|
<dl>
|
|
<dt>Version 0.16 — 2005-12-18<dd>Documentation fix.
|
|
|
|
<br><dt>Version 0.15 — 2005-12-18<dd>In the HTML parent element constraints that are used for structure
|
|
recovery, <code>div</code> is now always permitted as a parent, as a stopgap
|
|
measure until substantial time can be spent reworking the algorithm to
|
|
better support <code>div</code> (bug reported by Corey Sweeney and Jepri). Also
|
|
no longer convert to Scheme character any HTML numeric character reference
|
|
with value above 126, to avoid Unicode problem with PLT 299/300 (bug
|
|
reported by Corey Sweeney).
|
|
|
|
<br><dt>Version 0.14 — 2005-06-16<dd>XML CDATA sections are now tokenized. Thanks to Alejandro Forero Cuervo
|
|
for suggesting this feature. The deprecated procedures <code>sxml->html</code>
|
|
and <code>write-sxml-html</code> have been removed. Minor documentation changes.
|
|
|
|
<br><dt>Version 0.13 — 2005-02-23<dd>HtmlPrag now requires <code>syntax-rules</code>, and a reader that can read
|
|
<code>@</code> as a symbol. SHTML now has a special <code>&</code> element for
|
|
character entities, and it is emitted by the parser rather than the old
|
|
<code>*ENTITY*</code> kludge. <code>shtml-entity-value</code> supports both the new
|
|
and the old character entity representations. <code>shtml-entity-value</code>
|
|
now yields <code>#f</code> on invalid SHTML entity, rather than raising an error.
|
|
<code>write-shtml-as-html</code> now has a third argument, <code>foreign-filter</code>.
|
|
<code>write-shtml-as-html</code> now emits SHTML <code>&</code> entity references.
|
|
Changed <code>shtml-named-char-id</code> and <code>shtml-numeric-char-id</code>, as
|
|
previously warned. Testeez is now used for the test suite. Test procedure
|
|
is now the internal <code>%htmlprag:test</code>. Documentation changes.
|
|
Notably, much documentation about using HtmlPrag under various particular
|
|
Scheme implementations has been removed.
|
|
|
|
<br><dt>Version 0.12 — 2004-07-12<dd>Forward-slash in an unquoted attribute value is now considered a value
|
|
constituent rather than an unconsumed terminator of the value (thanks to
|
|
Maurice Davis for reporting and a suggested fix). <code>xml:</code> is now
|
|
preserved as a namespace qualifier (thanks to Peter Barabas for
|
|
reporting). Output port term of <code>write-shtml-as-html</code> is now
|
|
optional. Began documenting loading for particular implementation-specific
|
|
packagings.
|
|
|
|
<br><dt>Version 0.11 — 2004-05-13<dd>To reduce likely namespace collisions with SXML tools, and in anticipation
|
|
of a forthcoming set of new features, introduced the concept of “SHTML,”
|
|
which will be elaborated upon in a future version of HtmlPrag. Renamed
|
|
<code>sxml-</code><var>x</var><code>-symbol</code> to <code>shtml-</code><var>x</var><code>-symbol</code>,
|
|
<code>sxml-html-</code><var>x</var> to <code>shtml-</code><var>x</var>, and
|
|
<code>sxml-token-kind</code> to <code>shtml-token-kind</code>. <code>html->shtml</code>,
|
|
<code>shtml->html</code>, and <code>write-shtml-as-html</code> have been added as
|
|
names. Considered deprecated but still defined (see the “Deprecated”
|
|
section of this documentation) are <code>sxml->html</code> and
|
|
<code>write-sxml-html</code>. The growing pains should now be all but over.
|
|
Internally, <code>htmlprag-internal:error</code> introduced for Bigloo
|
|
portability. SISC returned to the test list; thanks to Scott G. Miller
|
|
for his help. Fixed a new character <code>eq?</code> bug, thanks to SISC.
|
|
|
|
<br><dt>Version 0.10 — 2004-05-11<dd>All public identifiers have been renamed to drop the “<code>htmlprag:</code>”
|
|
prefix. The portability identifiers have been renamed to begin with an
|
|
<code>htmlprag-internal:</code> prefix, are now considered strictly
|
|
internal-use-only, and have otherwise been changed. <code>parse-html</code> and
|
|
<code>always-empty-html-elements</code> are no longer public.
|
|
<code>test-htmlprag</code> now tests <code>html->sxml</code> rather than
|
|
<code>parse-html</code>. SISC temporarily removed from the test list, until an
|
|
open source Java that works correctly is found.
|
|
|
|
<br><dt>Version 0.9 — 2004-05-07<dd>HTML encoding procedures added. Added
|
|
<code>htmlprag:sxml-html-entity-value</code>. Upper-case <code>X</code> in hexadecimal
|
|
character entities is now parsed, in addition to lower-case <code>x</code>.
|
|
Added <code>htmlprag:always-empty-html-elements</code>. Added additional
|
|
portability bindings. Added more test cases.
|
|
|
|
<br><dt>Version 0.8 — 2004-04-27<dd>Entity references (symbolic, decimal numeric, hexadecimal numeric) are now
|
|
parsed into <code>*ENTITY*</code> SXML. SXML symbols like <code>*TOP*</code> are now
|
|
always upper-case, regardless of the Scheme implementation. Identifiers
|
|
such as <code>htmlprag:sxml-top-symbol</code> are bound to the upper-case
|
|
symbols. Procedures <code>htmlprag:html->sxml-0nf</code>,
|
|
<code>htmlprag:html->sxml-1nf</code>, and <code>htmlprag:html->sxml-2nf</code> have
|
|
been added. <code>htmlprag:html->sxml</code> now an alias for
|
|
<code>htmlprag:html->sxml-0nf</code>. <code>htmlprag:parse</code> has been refashioned
|
|
as <code>htmlprag:parse-html</code> and should no longer be directly. A number
|
|
of identifiers have been renamed to be more appropriate when the
|
|
<code>htmlprag:</code> prefix is dropped in some implementation-specific
|
|
packagings of HtmlPrag: <code>htmlprag:make-tokenizer</code> to
|
|
<code>htmlprag:make-html-tokenizer</code>, <code>htmlprag:parse/tokenizer</code> to
|
|
<code>htmlprag:parse-html/tokenizer</code>, <code>htmlprag:html->token-list</code> to
|
|
<code>htmlprag:tokenize-html</code>, <code>htmlprag:token-kind</code> to
|
|
<code>htmlprag:sxml-token-kind</code>, and <code>htmlprag:test</code> to
|
|
<code>htmlprag:test-htmlprag</code>. Verbatim elements with empty-element tag
|
|
syntax are handled correctly. New versions of Bigloo and RScheme tested.
|
|
|
|
<br><dt>Version 0.7 — 2004-03-10<dd>Verbatim pair elements like <code>script</code> and <code>xmp</code> are now parsed
|
|
correctly. Two Scheme implementations have temporarily been dropped from
|
|
regression testing: Kawa, due to a Java bytecode verifier error likely due
|
|
to a Java installation problem on the test machine; and SXM 1.1, due to
|
|
hitting a limit on the number of literals late in the test suite code.
|
|
Tested newer versions of Bigloo, Chicken, Gauche, Guile, MIT Scheme, PLT
|
|
MzScheme, RScheme, SISC, and STklos. RScheme no longer requires the
|
|
“<code>(define get-output-string close-output-port)</code>” workaround.
|
|
|
|
<br><dt>Version 0.6 — 2003-07-03<dd>Fixed uses of <code>eq?</code> in character comparisons, thanks to Scott G.
|
|
Miller. Added <code>htmlprag:html->normalized-sxml</code> and
|
|
<code>htmlprag:html->nonnormalized-sxml</code>. Started to add
|
|
<code>close-output-port</code> to uses of output strings, then reverted due to
|
|
bug in one of the supported dialects. Tested newer versions of Bigloo,
|
|
Gauche, PLT MzScheme, RScheme.
|
|
|
|
<br><dt>Version 0.5 — 2003-02-26<dd>Removed uses of <code>call-with-values</code>. Re-ordered top-level definitions,
|
|
for portability. Now tests under Kawa 1.6.99, RScheme 0.7.3.2, Scheme 48
|
|
0.57, SISC 1.7.4, STklos 0.54, and SXM 1.1.
|
|
|
|
<br><dt>Version 0.4 — 2003-02-19<dd>Apostrophe-quoted element attribute values are now handled. A bug that
|
|
incorrectly assumed left-to-right term evaluation order has been fixed
|
|
(thanks to MIT Scheme for confronting us with this). Now also tests OK
|
|
under Gauche 0.6.6 and MIT Scheme 7.7.1. Portability improvement for
|
|
implementations (e.g., RScheme 0.7.3.2.b6, Stalin 0.9) that cannot read
|
|
<code>@</code> as a symbol (although those implementations tend to present other
|
|
portability issues, as yet unresolved).
|
|
|
|
<br><dt>Version 0.3 — 2003-02-05<dd>A test suite with 66 cases has been added, and necessary changes have been
|
|
made for the suite to pass on five popular Scheme implementations. XML
|
|
processing instructions are now parsed. Parent constraints have been added
|
|
for <code>colgroup</code>, <code>tbody</code>, and <code>thead</code> elements. Erroneous
|
|
input, including invalid hexadecimal entity reference syntax and extraneous
|
|
double quotes in element tags, is now parsed better.
|
|
<code>htmlprag:token-kind</code> emits symbols more consistent with SXML.
|
|
|
|
<br><dt>Version 0.2 — 2003-02-02<dd>Portability improvements.
|
|
|
|
<br><dt>Version 0.1 — 2003-01-31<dd>Dusted off old Guile-specific code from April 2001, converted to emit SXML,
|
|
mostly ported to R5RS and SRFI-6, added some XHTML support and
|
|
documentation. A little preliminary testing has been done, and the package
|
|
is already useful for some applications, but this release should be
|
|
considered a preview to invite comments.
|
|
|
|
</dl>
|
|
|
|
<div class="footnote">
|
|
<hr>
|
|
<a name="texinfo-footnotes-in-document"></a><h4>Footnotes</h4><p class="footnote"><small>[<a name="fn-1" href="#fnd-1">1</a>]</small> Scheme
|
|
implementators who have not yet made <code>read</code> case-sensitive by default
|
|
are encouraged to do so.</p>
|
|
|
|
<p><hr></div>
|
|
|
|
</body></html>
|
|
|