2018 lines
46 KiB
HTML
2018 lines
46 KiB
HTML
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<HTML><HEAD><TITLE>Man page of Parser</TITLE>
|
|
</HEAD><BODY>
|
|
<H1>Parser</H1>
|
|
Section: User Contributed Perl Documentation (3pm)<BR>Updated: 2020-02-18<BR><A HREF="#index">Index</A>
|
|
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<A NAME="lbAB"> </A>
|
|
<H2>NAME</H2>
|
|
|
|
HTML::Parser - HTML parser class
|
|
<A NAME="lbAC"> </A>
|
|
<H2>SYNOPSIS</H2>
|
|
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
use HTML::Parser ();
|
|
|
|
# Create parser object
|
|
$p = HTML::Parser->new( api_version => 3,
|
|
start_h => [\&start, "tagname, attr"],
|
|
end_h => [\&end, "tagname"],
|
|
marked_sections => 1,
|
|
);
|
|
|
|
# Parse document text chunk by chunk
|
|
$p->parse($chunk1);
|
|
$p->parse($chunk2);
|
|
#...
|
|
$p->eof; # signal end of document
|
|
|
|
# Parse directly from file
|
|
$p->parse_file("foo.html");
|
|
# or
|
|
open(my $fh, "<:utf8", "foo.html") || die;
|
|
$p->parse_file($fh);
|
|
|
|
</PRE>
|
|
|
|
|
|
<A NAME="lbAD"> </A>
|
|
<H2>DESCRIPTION</H2>
|
|
|
|
|
|
|
|
Objects of the <TT>"HTML::Parser"</TT> class will recognize markup and
|
|
separate it from plain text (alias data content) in <FONT SIZE="-1">HTML</FONT>
|
|
documents. As different kinds of markup and text are recognized, the
|
|
corresponding event handlers are invoked.
|
|
<P>
|
|
|
|
<TT>"HTML::Parser"</TT> is not a generic <FONT SIZE="-1">SGML</FONT> parser. We have tried to
|
|
make it able to deal with the <FONT SIZE="-1">HTML</FONT> that is actually ``out there'', and
|
|
it normally parses as closely as possible to the way the popular web
|
|
browsers do it instead of strictly following one of the many <FONT SIZE="-1">HTML</FONT>
|
|
specifications from W3C. Where there is disagreement, there is often
|
|
an option that you can enable to get the official behaviour.
|
|
<P>
|
|
|
|
The document to be parsed may be supplied in arbitrary chunks. This
|
|
makes on-the-fly parsing as documents are received from the network
|
|
possible.
|
|
<P>
|
|
|
|
If event driven parsing does not feel right for your application, you
|
|
might want to use <TT>"HTML::PullParser"</TT>. This is an <TT>"HTML::Parser"</TT>
|
|
subclass that allows a more conventional program structure.
|
|
<A NAME="lbAE"> </A>
|
|
<H2>METHODS</H2>
|
|
|
|
|
|
|
|
The following method is used to construct a new <TT>"HTML::Parser"</TT> object:
|
|
<DL COMPACT>
|
|
<DT id="1">$p = HTML::Parser->new( %options_and_handlers )<DD>
|
|
|
|
|
|
|
|
|
|
This class method creates a new <TT>"HTML::Parser"</TT> object and
|
|
returns it. Key/value argument pairs may be provided to assign event
|
|
handlers or initialize parser options. The handlers and parser
|
|
options can also be set or modified later by the method calls described below.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If a top level key is in the form ``<event>_h'' (e.g., ``text_h'') then it
|
|
assigns a handler to that event, otherwise it initializes a parser
|
|
option. The event handler specification value must be an array
|
|
reference. Multiple handlers may also be assigned with the 'handlers
|
|
=> [%handlers]' option. See examples below.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If <B>new()</B> is called without any arguments, it will create a parser that
|
|
uses callback methods compatible with version 2 of <TT>"HTML::Parser"</TT>.
|
|
See the section on ``version 2 compatibility'' below for details.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The special constructor option 'api_version => 2' can be used to
|
|
initialize version 2 callbacks while still setting other options and
|
|
handlers. The 'api_version => 3' option can be used if you don't want
|
|
to set any options and don't want to fall back to v2 compatible
|
|
mode.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p = HTML::Parser->new(api_version => 3,
|
|
text_h => [ sub {...}, "dtext" ]);
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This creates a new parser object with a text event handler subroutine
|
|
that receives the original text with general entities decoded.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p = HTML::Parser->new(api_version => 3,
|
|
start_h => [ 'my_start', "self,tokens" ]);
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This creates a new parser object with a start event handler method
|
|
that receives the <TT>$p</TT> and the tokens array.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p = HTML::Parser->new(api_version => 3,
|
|
handlers => { text => [\@array, "event,text"],
|
|
comment => [\@array, "event,text"],
|
|
});
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This creates a new parser object that stores the event type and the
|
|
original text in <TT>@array</TT> for text and comment events.
|
|
</DL>
|
|
<P>
|
|
|
|
The following methods feed the <FONT SIZE="-1">HTML</FONT> document
|
|
to the <TT>"HTML::Parser"</TT> object:
|
|
<DL COMPACT>
|
|
<DT id="2">$p->parse( $string )<DD>
|
|
|
|
|
|
|
|
|
|
Parse <TT>$string</TT> as the next chunk of the <FONT SIZE="-1">HTML</FONT> document. Handlers invoked should
|
|
not attempt to modify the <TT>$string</TT> in-place until <TT>$p</TT>->parse returns.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If an invoked event handler aborts parsing by calling <TT>$p</TT>->eof, then <TT>$p</TT>-><B>parse()</B>
|
|
will return a <FONT SIZE="-1">FALSE</FONT> value. Otherwise the return value is a reference to the
|
|
parser object ($p).
|
|
<DT id="3">$p->parse( $code_ref )<DD>
|
|
|
|
|
|
|
|
|
|
If a code reference is passed as the argument to be parsed, then the
|
|
chunks to be parsed are obtained by invoking this function repeatedly.
|
|
Parsing continues until the function returns an empty (or undefined)
|
|
result. When this happens <TT>$p</TT>->eof is automatically signaled.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Parsing will also abort if one of the event handlers calls <TT>$p</TT>->eof.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The effect of this is the same as:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
while (1) {
|
|
my $chunk = &$code_ref();
|
|
if (!defined($chunk) || !length($chunk)) {
|
|
$p->eof;
|
|
return $p;
|
|
}
|
|
$p->parse($chunk) || return undef;
|
|
}
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
But it is more efficient as this loop runs internally in <FONT SIZE="-1">XS</FONT> code.
|
|
<DT id="4">$p->parse_file( $file )<DD>
|
|
|
|
|
|
|
|
|
|
Parse text directly from a file. The <TT>$file</TT> argument can be a
|
|
filename, an open file handle, or a reference to an open file
|
|
handle.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If <TT>$file</TT> contains a filename and the file can't be opened, then the
|
|
method returns an undefined value and $! tells why it failed.
|
|
Otherwise the return value is a reference to the parser object.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If a file handle is passed as the <TT>$file</TT> argument, then the file will
|
|
normally be read until <FONT SIZE="-1">EOF,</FONT> but not closed.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If an invoked event handler aborts parsing by calling <TT>$p</TT>->eof,
|
|
then <TT>$p</TT>-><B>parse_file()</B> may not have read the entire file.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
On systems with multi-byte line terminators, the values passed for the
|
|
offset and length argspecs may be too low if <B>parse_file()</B> is called on
|
|
a file handle that is not in binary mode.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If a filename is passed in, then <B>parse_file()</B> will open the file in
|
|
binary mode.
|
|
<DT id="5">$p->eof<DD>
|
|
|
|
|
|
|
|
|
|
Signals the end of the <FONT SIZE="-1">HTML</FONT> document. Calling the <TT>$p</TT>->eof method
|
|
outside a handler callback will flush any remaining buffered text
|
|
(which triggers the <TT>"text"</TT> event if there is any remaining text).
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Calling <TT>$p</TT>->eof inside a handler will terminate parsing at that point
|
|
and cause <TT>$p</TT>->parse to return a <FONT SIZE="-1">FALSE</FONT> value. This also terminates
|
|
parsing by <TT>$p</TT>-><B>parse_file()</B>.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
After <TT>$p</TT>->eof has been called, the <B>parse()</B> and <B>parse_file()</B> methods
|
|
can be invoked to feed new documents with the parser object.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The return value from <B>eof()</B> is a reference to the parser object.
|
|
</DL>
|
|
<P>
|
|
|
|
Most parser options are controlled by boolean attributes.
|
|
Each boolean attribute is enabled by calling the corresponding method
|
|
with a <FONT SIZE="-1">TRUE</FONT> argument and disabled with a <FONT SIZE="-1">FALSE</FONT> argument. The
|
|
attribute value is left unchanged if no argument is given. The return
|
|
value from each method is the old attribute value.
|
|
<P>
|
|
|
|
Methods that can be used to get and/or set parser options are:
|
|
<DL COMPACT>
|
|
<DT id="6">$p->attr_encoded<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="7">$p->attr_encoded( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
By default, the <TT>"attr"</TT> and <TT>@attr</TT> argspecs will have general
|
|
entities for attribute values decoded. Enabling this attribute leaves
|
|
entities alone.
|
|
<DT id="8">$p->backquote<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="9">$p->backquote( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
By default, only ' and " are recognized as quote characters around
|
|
attribute values. <FONT SIZE="-1">MSIE</FONT> also recognizes backquotes for some reason.
|
|
Enabling this attribute provides compatibility with this behaviour.
|
|
<DT id="10">$p->boolean_attribute_value( $val )<DD>
|
|
|
|
|
|
|
|
|
|
This method sets the value reported for boolean attributes inside <FONT SIZE="-1">HTML</FONT>
|
|
start tags. By default, the name of the attribute is also used as its
|
|
value. This affects the values reported for <TT>"tokens"</TT> and <TT>"attr"</TT>
|
|
argspecs.
|
|
<DT id="11">$p->case_sensitive<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="12">$p->case_sensitive( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
By default, tagnames and attribute names are down-cased. Enabling this
|
|
attribute leaves them as found in the <FONT SIZE="-1">HTML</FONT> source document.
|
|
<DT id="13">$p->closing_plaintext<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="14">$p->closing_plaintext( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
By default, ``plaintext'' element can never be closed. Everything up to
|
|
the end of the document is parsed in <FONT SIZE="-1">CDATA</FONT> mode. This historical
|
|
behaviour is what at least <FONT SIZE="-1">MSIE</FONT> does. Enabling this attribute makes
|
|
closing ``</plaintext>'' tag effective and the parsing process will resume
|
|
after seeing this tag. This emulates early gecko-based browsers.
|
|
<DT id="15">$p->empty_element_tags<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="16">$p->empty_element_tags( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
By default, empty element tags are not recognized as such and the ``/''
|
|
before ``>'' is just treated like a normal name character (unless
|
|
<TT>"strict_names"</TT> is enabled). Enabling this attribute make
|
|
<TT>"HTML::Parser"</TT> recognize these tags.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Empty element tags look like start tags, but end with the character
|
|
sequence ``/>'' instead of ``>''. When recognized by <TT>"HTML::Parser"</TT> they
|
|
cause an artificial end event in addition to the start event. The
|
|
<TT>"text"</TT> for the artificial end event will be empty and the <TT>"tokenpos"</TT>
|
|
array will be undefined even though the token array will have one
|
|
element containing the tag name.
|
|
<DT id="17">$p->marked_sections<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="18">$p->marked_sections( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
By default, section markings like <![CDATA[...]]> are treated like
|
|
ordinary text. When this attribute is enabled section markings are
|
|
honoured.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
There are currently no events associated with the marked section
|
|
markup, but the text can be returned as <TT>"skipped_text"</TT>.
|
|
<DT id="19">$p->strict_comment<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="20">$p->strict_comment( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
By default, comments are terminated by the first occurrence of ``-->''.
|
|
This is the behaviour of most popular browsers (like Mozilla, Opera and
|
|
<FONT SIZE="-1">MSIE</FONT>), but it is not correct according to the official <FONT SIZE="-1">HTML</FONT>
|
|
standard. Officially, you need an even number of ``--'' tokens before
|
|
the closing ``>'' is recognized and there may not be anything but
|
|
whitespace between an even and an odd ``--''.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The official behaviour is enabled by enabling this attribute.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Enabling of 'strict_comment' also disables recognizing these forms as
|
|
comments:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
</ comment>
|
|
<! comment>
|
|
|
|
</PRE>
|
|
|
|
|
|
<DT id="21">$p->strict_end<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="22">$p->strict_end( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
By default, attributes and other junk are allowed to be present on end tags in a
|
|
manner that emulates <FONT SIZE="-1">MSIE</FONT>'s behaviour.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The official behaviour is enabled with this attribute. If enabled,
|
|
only whitespace is allowed between the tagname and the final ``>''.
|
|
<DT id="23">$p->strict_names<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="24">$p->strict_names( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
By default, almost anything is allowed in tag and attribute names.
|
|
This is the behaviour of most popular browsers and allows us to parse
|
|
some broken tags with invalid attribute values like:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
<IMG SRC=newprevlstGr.gif ALT=[PREV LIST] BORDER=0>
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
By default, ``<FONT SIZE="-1">LIST</FONT>]'' is parsed as a boolean attribute, not as
|
|
part of the <FONT SIZE="-1">ALT</FONT> value as was clearly intended. This is also what
|
|
Mozilla sees.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The official behaviour is enabled by enabling this attribute. If
|
|
enabled, it will cause the tag above to be reported as text
|
|
since ``<FONT SIZE="-1">LIST</FONT>]'' is not a legal attribute name.
|
|
<DT id="25">$p->unbroken_text<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="26">$p->unbroken_text( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
By default, blocks of text are given to the text handler as soon as
|
|
possible (but the parser takes care always to break text at a
|
|
boundary between whitespace and non-whitespace so single words and
|
|
entities can always be decoded safely). This might create breaks that
|
|
make it hard to do transformations on the text. When this attribute is
|
|
enabled, blocks of text are always reported in one piece. This will
|
|
delay the text event until the following (non-text) event has been
|
|
recognized by the parser.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Note that the <TT>"offset"</TT> argspec will give you the offset of the first
|
|
segment of text and <TT>"length"</TT> is the combined length of the segments.
|
|
Since there might be ignored tags in between, these numbers can't be
|
|
used to directly index in the original document file.
|
|
<DT id="27">$p->utf8_mode<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="28">$p->utf8_mode( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
Enable this option when parsing raw undecoded <FONT SIZE="-1">UTF-8.</FONT> This tells the
|
|
parser that the entities expanded for strings reported by <TT>"attr"</TT>,
|
|
<TT>@attr</TT> and <TT>"dtext"</TT> should be expanded as decoded <FONT SIZE="-1">UTF-8</FONT> so they end
|
|
up compatible with the surrounding text.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If <TT>"utf8_mode"</TT> is enabled then it is an error to pass strings
|
|
containing characters with code above 255 to the <B>parse()</B> method, and
|
|
the <B>parse()</B> method will croak if you try.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Example: The Unicode character ``\x{2665}'' is ``\xE2\x99\xA5'' when <FONT SIZE="-1">UTF-8</FONT>
|
|
encoded. The character can also be represented by the entity
|
|
``&hearts;'' or ``&#x2665''. If we feed the parser:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->parse("\xE2\x99\xA5&hearts;");
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
then <TT>"dtext"</TT> will be reported as ``\xE2\x99\xA5\x{2665}'' without
|
|
<TT>"utf8_mode"</TT> enabled, but as ``\xE2\x99\xA5\xE2\x99\xA5'' when enabled.
|
|
The later string is what you want.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This option is only available with perl-5.8 or better.
|
|
<DT id="29">$p->xml_mode<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="30">$p->xml_mode( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
Enabling this attribute changes the parser to allow some <FONT SIZE="-1">XML</FONT>
|
|
constructs. This enables the behaviour controlled by individually by
|
|
the <TT>"case_sensitive"</TT>, <TT>"empty_element_tags"</TT>, <TT>"strict_names"</TT> and
|
|
<TT>"xml_pic"</TT> attributes and also suppresses special treatment of
|
|
elements that are parsed as <FONT SIZE="-1">CDATA</FONT> for <FONT SIZE="-1">HTML.</FONT>
|
|
<DT id="31">$p->xml_pic<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="32">$p->xml_pic( $bool )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
By default, <I>processing instructions</I> are terminated by ``>''. When
|
|
this attribute is enabled, processing instructions are terminated by
|
|
``?>'' instead.
|
|
</DL>
|
|
<P>
|
|
|
|
As markup and text is recognized, handlers are invoked. The following
|
|
method is used to set up handlers for different events:
|
|
<DL COMPACT>
|
|
<DT id="33">$p->handler( event => \&subroutine, $argspec )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="34">$p->handler( event => $method_name, $argspec )<DD>
|
|
|
|
|
|
|
|
|
|
<DT id="35">$p->handler( event => \@accum, $argspec )<DD>
|
|
|
|
|
|
|
|
|
|
<DT id="36">$p->handler( event => "" );<DD>
|
|
|
|
|
|
|
|
|
|
<DT id="37">$p->handler( event => undef );<DD>
|
|
|
|
|
|
|
|
|
|
<DT id="38">$p->handler( event );<DD>
|
|
|
|
|
|
|
|
|
|
|
|
This method assigns a subroutine, method, or array to handle an event.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Event is one of <TT>"text"</TT>, <TT>"start"</TT>, <TT>"end"</TT>, <TT>"declaration"</TT>, <TT>"comment"</TT>,
|
|
<TT>"process"</TT>, <TT>"start_document"</TT>, <TT>"end_document"</TT> or <TT>"default"</TT>.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The <TT>"\&subroutine"</TT> is a reference to a subroutine which is called to handle
|
|
the event.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The <TT>$method_name</TT> is the name of a method of <TT>$p</TT> which is called to handle
|
|
the event.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The <TT>@accum</TT> is an array that will hold the event information as
|
|
sub-arrays.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If the second argument is "", the event is ignored.
|
|
If it is undef, the default handler is invoked for the event.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The <TT>$argspec</TT> is a string that describes the information to be reported
|
|
for the event. Any requested information that does not apply to a
|
|
specific event is passed as <TT>"undef"</TT>. If argspec is omitted, then it
|
|
is left unchanged.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The return value from <TT>$p</TT>->handler is the old callback routine or a
|
|
reference to the accumulator array.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Any return values from handler callback routines/methods are always
|
|
ignored. A handler callback can request parsing to be aborted by
|
|
invoking the <TT>$p</TT>->eof method. A handler callback is not allowed to
|
|
invoke the <TT>$p</TT>-><B>parse()</B> or <TT>$p</TT>-><B>parse_file()</B> method. An exception will
|
|
be raised if it tries.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->handler(start => "start", 'self, attr, attrseq, text' );
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This causes the ``start'' method of object <TT>$p</TT> to be called for 'start' events.
|
|
The callback signature is <TT>$p</TT>->start(\%attr, \@attr_seq, <TT>$text</TT>).
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->handler(start => \&start, 'attr, attrseq, text' );
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This causes subroutine <B>start()</B> to be called for 'start' events.
|
|
The callback signature is start(\%attr, \@attr_seq, <TT>$text</TT>).
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->handler(start => \@accum, '"S", attr, attrseq, text' );
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This causes 'start' event information to be saved in <TT>@accum</TT>.
|
|
The array elements will be ['S', \%attr, \@attr_seq, <TT>$text</TT>].
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->handler(start => "");
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This causes 'start' events to be ignored. It also suppresses
|
|
invocations of any default handler for start events. It is in most
|
|
cases equivalent to <TT>$p</TT>->handler(start => sub {}), but is more
|
|
efficient. It is different from the empty-sub-handler in that
|
|
<TT>"skipped_text"</TT> is not reset by it.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->handler(start => undef);
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This causes no handler to be associated with start events.
|
|
If there is a default handler it will be invoked.
|
|
</DL>
|
|
<P>
|
|
|
|
Filters based on tags can be set up to limit the number of events
|
|
reported. The main bottleneck during parsing is often the huge number
|
|
of callbacks made from the parser. Applying filters can improve
|
|
performance significantly.
|
|
<P>
|
|
|
|
The following methods control filters:
|
|
<DL COMPACT>
|
|
<DT id="39">$p->ignore_elements( @tags )<DD>
|
|
|
|
|
|
|
|
|
|
Both the <TT>"start"</TT> event and the <TT>"end"</TT> event as well as any events that
|
|
would be reported in between are suppressed. The ignored elements can
|
|
contain nested occurrences of itself. Example:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->ignore_elements(qw(script style));
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The <TT>"script"</TT> and <TT>"style"</TT> tags will always nest properly since their
|
|
content is parsed in <FONT SIZE="-1">CDATA</FONT> mode. For most other tags
|
|
<TT>"ignore_elements"</TT> must be used with caution since <FONT SIZE="-1">HTML</FONT> is often not
|
|
<I>well formed</I>.
|
|
<DT id="40">$p->ignore_tags( @tags )<DD>
|
|
|
|
|
|
|
|
|
|
Any <TT>"start"</TT> and <TT>"end"</TT> events involving any of the tags given are
|
|
suppressed. To reset the filter (i.e. don't suppress any <TT>"start"</TT> and
|
|
<TT>"end"</TT> events), call <TT>"ignore_tags"</TT> without an argument.
|
|
<DT id="41">$p->report_tags( @tags )<DD>
|
|
|
|
|
|
|
|
|
|
Any <TT>"start"</TT> and <TT>"end"</TT> events involving any of the tags <I>not</I> given
|
|
are suppressed. To reset the filter (i.e. report all <TT>"start"</TT> and
|
|
<TT>"end"</TT> events), call <TT>"report_tags"</TT> without an argument.
|
|
</DL>
|
|
<P>
|
|
|
|
Internally, the system has two filter lists, one for <TT>"report_tags"</TT>
|
|
and one for <TT>"ignore_tags"</TT>, and both filters are applied. This
|
|
effectively gives <TT>"ignore_tags"</TT> precedence over <TT>"report_tags"</TT>.
|
|
<P>
|
|
|
|
Examples:
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->ignore_tags(qw(style));
|
|
$p->report_tags(qw(script style));
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
results in only <TT>"script"</TT> events being reported.
|
|
<A NAME="lbAF"> </A>
|
|
<H3>Argspec</H3>
|
|
|
|
|
|
|
|
Argspec is a string containing a comma-separated list that describes
|
|
the information reported by the event. The following argspec
|
|
identifier names can be used:
|
|
<DL COMPACT>
|
|
<DT id="42">"attr"<DD>
|
|
|
|
|
|
|
|
|
|
Attr causes a reference to a hash of attribute name/value pairs to be
|
|
passed.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Boolean attributes' values are either the value set by
|
|
<TT>$p</TT>->boolean_attribute_value, or the attribute name if no value has been
|
|
set by <TT>$p</TT>->boolean_attribute_value.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This passes undef except for <TT>"start"</TT> events.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Unless <TT>"xml_mode"</TT> or <TT>"case_sensitive"</TT> is enabled, the attribute
|
|
names are forced to lower case.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
General entities are decoded in the attribute values and
|
|
one layer of matching quotes enclosing the attribute values is removed.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The Unicode character set is assumed for entity decoding.
|
|
<DT id="43">@attr<DD>
|
|
|
|
|
|
|
|
|
|
Basically the same as <TT>"attr"</TT>, but keys and values are passed as
|
|
individual arguments and the original sequence of the attributes is
|
|
kept. The parameters passed will be the same as the <TT>@attr</TT> calculated
|
|
here:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
@attr = map { $_ => $attr->{$_} } @$attrseq;
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
assuming <TT>$attr</TT> and <TT>$attrseq</TT> here are the hash and array passed as the
|
|
result of <TT>"attr"</TT> and <TT>"attrseq"</TT> argspecs.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This passes no values for events besides <TT>"start"</TT>.
|
|
<DT id="44">"attrseq"<DD>
|
|
|
|
|
|
|
|
|
|
Attrseq causes a reference to an array of attribute names to be
|
|
passed. This can be useful if you want to walk the <TT>"attr"</TT> hash in
|
|
the original sequence.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This passes undef except for <TT>"start"</TT> events.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Unless <TT>"xml_mode"</TT> or <TT>"case_sensitive"</TT> is enabled, the attribute
|
|
names are forced to lower case.
|
|
<DT id="45">"column"<DD>
|
|
|
|
|
|
|
|
|
|
Column causes the column number of the start of the event to be passed.
|
|
The first column on a line is 0.
|
|
<DT id="46">"dtext"<DD>
|
|
|
|
|
|
|
|
|
|
Dtext causes the decoded text to be passed. General entities are
|
|
automatically decoded unless the event was inside a <FONT SIZE="-1">CDATA</FONT> section or
|
|
was between literal start and end tags (<TT>"script"</TT>, <TT>"style"</TT>,
|
|
<TT>"xmp"</TT>, <TT>"iframe"</TT>, <TT>"title"</TT>, <TT>"textarea"</TT> and <TT>"plaintext"</TT>).
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The Unicode character set is assumed for entity decoding. With Perl
|
|
version 5.6 or earlier only the Latin-1 range is supported, and
|
|
entities for characters outside the range 0..255 are left unchanged.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This passes undef except for <TT>"text"</TT> events.
|
|
<DT id="47">"event"<DD>
|
|
|
|
|
|
|
|
|
|
Event causes the event name to be passed.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The event name is one of <TT>"text"</TT>, <TT>"start"</TT>, <TT>"end"</TT>, <TT>"declaration"</TT>,
|
|
<TT>"comment"</TT>, <TT>"process"</TT>, <TT>"start_document"</TT> or <TT>"end_document"</TT>.
|
|
<DT id="48">"is_cdata"<DD>
|
|
|
|
|
|
|
|
|
|
Is_cdata causes a <FONT SIZE="-1">TRUE</FONT> value to be passed if the event is inside a <FONT SIZE="-1">CDATA</FONT>
|
|
section or between literal start and end tags (<TT>"script"</TT>,
|
|
<TT>"style"</TT>, <TT>"xmp"</TT>, <TT>"iframe"</TT>, <TT>"title"</TT>, <TT>"textarea"</TT> and <TT>"plaintext"</TT>).
|
|
|
|
|
|
<P>
|
|
|
|
|
|
if the flag is <FONT SIZE="-1">FALSE</FONT> for a text event, then you should normally
|
|
either use <TT>"dtext"</TT> or decode the entities yourself before the text is
|
|
processed further.
|
|
<DT id="49">"length"<DD>
|
|
|
|
|
|
|
|
|
|
Length causes the number of bytes of the source text of the event to
|
|
be passed.
|
|
<DT id="50">"line"<DD>
|
|
|
|
|
|
|
|
|
|
Line causes the line number of the start of the event to be passed.
|
|
The first line in the document is 1. Line counting doesn't start
|
|
until at least one handler requests this value to be reported.
|
|
<DT id="51">"offset"<DD>
|
|
|
|
|
|
|
|
|
|
Offset causes the byte position in the <FONT SIZE="-1">HTML</FONT> document of the start of
|
|
the event to be passed. The first byte in the document has offset 0.
|
|
<DT id="52">"offset_end"<DD>
|
|
|
|
|
|
|
|
|
|
Offset_end causes the byte position in the <FONT SIZE="-1">HTML</FONT> document of the end of
|
|
the event to be passed. This is the same as <TT>"offset"</TT> + <TT>"length"</TT>.
|
|
<DT id="53">"self"<DD>
|
|
|
|
|
|
|
|
|
|
Self causes the current object to be passed to the handler. If the
|
|
handler is a method, this must be the first element in the argspec.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
An alternative to passing self as an argspec is to register closures
|
|
that capture <TT>$self</TT> by themselves as handlers. Unfortunately this
|
|
creates circular references which prevent the HTML::Parser object
|
|
from being garbage collected. Using the <TT>"self"</TT> argspec avoids this
|
|
problem.
|
|
<DT id="54">"skipped_text"<DD>
|
|
|
|
|
|
|
|
|
|
Skipped_text returns the concatenated text of all the events that have
|
|
been skipped since the last time an event was reported. Events might
|
|
be skipped because no handler is registered for them or because some
|
|
filter applies. Skipped text also includes marked section markup,
|
|
since there are no events that can catch it.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If an <TT>""</TT>-handler is registered for an event, then the text for this
|
|
event is not included in <TT>"skipped_text"</TT>. Skipped text both before
|
|
and after the <TT>""</TT>-event is included in the next reported
|
|
<TT>"skipped_text"</TT>.
|
|
<DT id="55">"tag"<DD>
|
|
|
|
|
|
|
|
|
|
Same as <TT>"tagname"</TT>, but prefixed with ``/'' if it belongs to an <TT>"end"</TT>
|
|
event and ``!'' for a declaration. The <TT>"tag"</TT> does not have any prefix
|
|
for <TT>"start"</TT> events, and is in this case identical to <TT>"tagname"</TT>.
|
|
<DT id="56">"tagname"<DD>
|
|
|
|
|
|
|
|
|
|
This is the element name (or <I>generic identifier</I> in <FONT SIZE="-1">SGML</FONT> jargon) for
|
|
start and end tags. Since <FONT SIZE="-1">HTML</FONT> is case insensitive, this name is
|
|
forced to lower case to ease string matching.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Since <FONT SIZE="-1">XML</FONT> is case sensitive, the tagname case is not changed when
|
|
<TT>"xml_mode"</TT> is enabled. The same happens if the <TT>"case_sensitive"</TT> attribute
|
|
is set.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The declaration type of declaration elements is also passed as a tagname,
|
|
even if that is a bit strange.
|
|
In fact, in the current implementation tagname is
|
|
identical to <TT>"token0"</TT> except that the name may be forced to lower case.
|
|
<DT id="57">"token0"<DD>
|
|
|
|
|
|
|
|
|
|
Token0 causes the original text of the first token string to be
|
|
passed. This should always be the same as <TT>$tokens</TT>->[0].
|
|
|
|
|
|
<P>
|
|
|
|
|
|
For <TT>"declaration"</TT> events, this is the declaration type.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
For <TT>"start"</TT> and <TT>"end"</TT> events, this is the tag name.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
For <TT>"process"</TT> and non-strict <TT>"comment"</TT> events, this is everything
|
|
inside the tag.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This passes undef if there are no tokens in the event.
|
|
<DT id="58">"tokenpos"<DD>
|
|
|
|
|
|
|
|
|
|
Tokenpos causes a reference to an array of token positions to be
|
|
passed. For each string that appears in <TT>"tokens"</TT>, this array
|
|
contains two numbers. The first number is the offset of the start of
|
|
the token in the original <TT>"text"</TT> and the second number is the length
|
|
of the token.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Boolean attributes in a <TT>"start"</TT> event will have (0,0) for the
|
|
attribute value offset and length.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This passes undef if there are no tokens in the event (e.g., <TT>"text"</TT>)
|
|
and for artificial <TT>"end"</TT> events triggered by empty element tags.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If you are using these offsets and lengths to modify <TT>"text"</TT>, you
|
|
should either work from right to left, or be very careful to calculate
|
|
the changes to the offsets.
|
|
<DT id="59">"tokens"<DD>
|
|
|
|
|
|
|
|
|
|
Tokens causes a reference to an array of token strings to be passed.
|
|
The strings are exactly as they were found in the original text,
|
|
no decoding or case changes are applied.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
For <TT>"declaration"</TT> events, the array contains each word, comment, and
|
|
delimited string starting with the declaration type.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
For <TT>"comment"</TT> events, this contains each sub-comment. If
|
|
<TT>$p</TT>->strict_comments is disabled, there will be only one sub-comment.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
For <TT>"start"</TT> events, this contains the original tag name followed by
|
|
the attribute name/value pairs. The values of boolean attributes will
|
|
be either the value set by <TT>$p</TT>->boolean_attribute_value, or the
|
|
attribute name if no value has been set by
|
|
<TT>$p</TT>->boolean_attribute_value.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
For <TT>"end"</TT> events, this contains the original tag name (always one token).
|
|
|
|
|
|
<P>
|
|
|
|
|
|
For <TT>"process"</TT> events, this contains the process instructions (always one
|
|
token).
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This passes <TT>"undef"</TT> for <TT>"text"</TT> events.
|
|
<DT id="60">"text"<DD>
|
|
|
|
|
|
|
|
|
|
Text causes the source text (including markup element delimiters) to be
|
|
passed.
|
|
<DT id="61">"undef"<DD>
|
|
|
|
|
|
|
|
|
|
Pass an undefined value. Useful as padding where the same handler
|
|
routine is registered for multiple events.
|
|
<DT id="62">'...'<DD>
|
|
|
|
|
|
|
|
|
|
A literal string of 0 to 255 characters enclosed
|
|
in single (') or double (") quotes is passed as entered.
|
|
</DL>
|
|
<P>
|
|
|
|
The whole argspec string can be wrapped up in <TT>'@{...}'</TT> to signal
|
|
that the resulting event array should be flattened. This only makes a
|
|
difference if an array reference is used as the handler target.
|
|
Consider this example:
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->handler(text => [], 'text');
|
|
$p->handler(text => [], '@{text}']);
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
With two text events; <TT>"foo"</TT>, <TT>"bar"</TT>; then the first example will end
|
|
up with [[``foo''], [``bar'']] and the second with [``foo'', ``bar''] in
|
|
the handler target array.
|
|
<A NAME="lbAG"> </A>
|
|
<H3>Events</H3>
|
|
|
|
|
|
|
|
Handlers for the following events can be registered:
|
|
<DL COMPACT>
|
|
<DT id="63">"comment"<DD>
|
|
|
|
|
|
|
|
|
|
This event is triggered when a markup comment is recognized.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Example:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
<!-- This is a comment -- -- So is this -->
|
|
|
|
</PRE>
|
|
|
|
|
|
<DT id="64">"declaration"<DD>
|
|
|
|
|
|
|
|
|
|
This event is triggered when a <I>markup declaration</I> is recognized.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
For typical <FONT SIZE="-1">HTML</FONT> documents, the only declaration you are
|
|
likely to find is <!DOCTYPE ...>.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Example:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
|
|
"<A HREF="http://www.w3.org/TR/html4/strict.dtd">http://www.w3.org/TR/html4/strict.dtd</A>">
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
DTDs inside <!DOCTYPE ...> will confuse HTML::Parser.
|
|
<DT id="65">"default"<DD>
|
|
|
|
|
|
|
|
|
|
This event is triggered for events that do not have a specific
|
|
handler. You can set up a handler for this event to catch stuff you
|
|
did not want to catch explicitly.
|
|
<DT id="66">"end"<DD>
|
|
|
|
|
|
|
|
|
|
This event is triggered when an end tag is recognized.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Example:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
</A>
|
|
|
|
</PRE>
|
|
|
|
|
|
<DT id="67">"end_document"<DD>
|
|
|
|
|
|
|
|
|
|
This event is triggered when <TT>$p</TT>->eof is called and after any remaining
|
|
text is flushed. There is no document text associated with this event.
|
|
<DT id="68">"process"<DD>
|
|
|
|
|
|
|
|
|
|
This event is triggered when a processing instructions markup is
|
|
recognized.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The format and content of processing instructions are system and
|
|
application dependent.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
<? HTML processing instructions >
|
|
<? XML processing instructions ?>
|
|
|
|
</PRE>
|
|
|
|
|
|
<DT id="69">"start"<DD>
|
|
|
|
|
|
|
|
|
|
This event is triggered when a start tag is recognized.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Example:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
<A HREF="<A HREF="http://www.perl.com/">http://www.perl.com/</A>">
|
|
|
|
</PRE>
|
|
|
|
|
|
<DT id="70">"start_document"<DD>
|
|
|
|
|
|
|
|
|
|
This event is triggered before any other events for a new document. A
|
|
handler for it can be used to initialize stuff. There is no document
|
|
text associated with this event.
|
|
<DT id="71">"text"<DD>
|
|
|
|
|
|
|
|
|
|
This event is triggered when plain text (characters) is recognized.
|
|
The text may contain multiple lines. A sequence of text may be broken
|
|
between several text events unless <TT>$p</TT>->unbroken_text is enabled.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The parser will make sure that it does not break a word or a sequence
|
|
of whitespace between two text events.
|
|
</DL>
|
|
<A NAME="lbAH"> </A>
|
|
<H3>Unicode</H3>
|
|
|
|
|
|
|
|
<TT>"HTML::Parser"</TT> can parse Unicode strings when running under
|
|
perl-5.8 or better. If Unicode is passed to <TT>$p</TT>-><B>parse()</B> then chunks
|
|
of Unicode will be reported to the handlers. The offset and length
|
|
argspecs will also report their position in terms of characters.
|
|
<P>
|
|
|
|
It is safe to parse raw undecoded <FONT SIZE="-1">UTF-8</FONT> if you either avoid decoding
|
|
entities and make sure to not use <I>argspecs</I> that do, or enable the
|
|
<TT>"utf8_mode"</TT> for the parser. Parsing of undecoded <FONT SIZE="-1">UTF-8</FONT> might be
|
|
useful when parsing from a file where you need the reported offsets
|
|
and lengths to match the byte offsets in the file.
|
|
<P>
|
|
|
|
If a filename is passed to <TT>$p</TT>-><B>parse_file()</B> then the file will be read
|
|
in binary mode. This will be fine if the file contains only <FONT SIZE="-1">ASCII</FONT> or
|
|
Latin-1 characters. If the file contains <FONT SIZE="-1">UTF-8</FONT> encoded text then care
|
|
must be taken when decoding entities as described in the previous
|
|
paragraph, but better is to open the file with the <FONT SIZE="-1">UTF-8</FONT> layer so that
|
|
it is decoded properly:
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
open(my $fh, "<:utf8", "index.html") || die "...: $!";
|
|
$p->parse_file($fh);
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
If the file contains text encoded in a charset besides <FONT SIZE="-1">ASCII,</FONT> Latin-1
|
|
or <FONT SIZE="-1">UTF-8</FONT> then decoding will always be needed.
|
|
<A NAME="lbAI"> </A>
|
|
<H2>VERSION 2 COMPATIBILITY</H2>
|
|
|
|
|
|
|
|
When an <TT>"HTML::Parser"</TT> object is constructed with no arguments, a set
|
|
of handlers is automatically provided that is compatible with the old
|
|
HTML::Parser version 2 callback methods.
|
|
<P>
|
|
|
|
This is equivalent to the following method calls:
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->handler(start => "start", "self, tagname, attr, attrseq, text");
|
|
$p->handler(end => "end", "self, tagname, text");
|
|
$p->handler(text => "text", "self, text, is_cdata");
|
|
$p->handler(process => "process", "self, token0, text");
|
|
$p->handler(comment =>
|
|
sub {
|
|
my($self, $tokens) = @_;
|
|
for (@$tokens) {$self->comment($_);}},
|
|
"self, tokens");
|
|
$p->handler(declaration =>
|
|
sub {
|
|
my $self = shift;
|
|
$self->declaration(substr($_[0], 2, -1));},
|
|
"self, text");
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
Setting up these handlers can also be requested with the ``api_version =>
|
|
2'' constructor option.
|
|
<A NAME="lbAJ"> </A>
|
|
<H2>SUBCLASSING</H2>
|
|
|
|
|
|
|
|
The <TT>"HTML::Parser"</TT> class is subclassable. Parser objects are plain
|
|
hashes and <TT>"HTML::Parser"</TT> reserves only hash keys that start with
|
|
``_hparser''. The parser state can be set up by invoking the <B>init()</B>
|
|
method, which takes the same arguments as <B>new()</B>.
|
|
<A NAME="lbAK"> </A>
|
|
<H2>EXAMPLES</H2>
|
|
|
|
|
|
|
|
The first simple example shows how you might strip out comments from
|
|
an <FONT SIZE="-1">HTML</FONT> document. We achieve this by setting up a comment handler that
|
|
does nothing and a default handler that will print out anything else:
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
use HTML::Parser;
|
|
HTML::Parser->new(default_h => [sub { print shift }, 'text'],
|
|
comment_h => [""],
|
|
)->parse_file(shift || die) || die $!;
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
An alternative implementation is:
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
use HTML::Parser;
|
|
HTML::Parser->new(end_document_h => [sub { print shift },
|
|
'skipped_text'],
|
|
comment_h => [""],
|
|
)->parse_file(shift || die) || die $!;
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
This will in most cases be much more efficient since only a single
|
|
callback will be made.
|
|
<P>
|
|
|
|
The next example prints out the text that is inside the <title>
|
|
element of an <FONT SIZE="-1">HTML</FONT> document. Here we start by setting up a start
|
|
handler. When it sees the title start tag it enables a text handler
|
|
that prints any text found and an end handler that will terminate
|
|
parsing as soon as the title end tag is seen:
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
use HTML::Parser ();
|
|
|
|
sub start_handler
|
|
{
|
|
return if shift ne "title";
|
|
my $self = shift;
|
|
$self->handler(text => sub { print shift }, "dtext");
|
|
$self->handler(end => sub { shift->eof if shift eq "title"; },
|
|
"tagname,self");
|
|
}
|
|
|
|
my $p = HTML::Parser->new(api_version => 3);
|
|
$p->handler( start => \&start_handler, "tagname,self");
|
|
$p->parse_file(shift || die) || die $!;
|
|
print "\n";
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
On a Debian box, more examples can be found in the
|
|
/usr/share/doc/libhtml-parser-perl/examples directory.
|
|
The program <TT>"hrefsub"</TT> shows how you can edit all links
|
|
found in a document and <TT>"htextsub"</TT> how to edit the text only; the
|
|
program <TT>"hstrip"</TT> shows how you can strip out certain tags/elements
|
|
and/or attributes; and the program <TT>"htext"</TT> show how to obtain the
|
|
plain text, but not any script/style content.
|
|
<P>
|
|
|
|
You can browse the <I>eg/</I> directory online from the <I>[Browse]</I> link on
|
|
the <A HREF="http://search.cpan.org/~gaas/HTML-Parser/">http://search.cpan.org/~gaas/HTML-Parser/</A> page.
|
|
<A NAME="lbAL"> </A>
|
|
<H2>BUGS</H2>
|
|
|
|
|
|
|
|
The <style> and <script> sections do not end with the first ``</'', but
|
|
need the complete corresponding end tag. The standard behaviour is
|
|
not really practical.
|
|
<P>
|
|
|
|
When the <I>strict_comment</I> option is enabled, we still recognize
|
|
comments where there is something other than whitespace between even
|
|
and odd ``--'' markers.
|
|
<P>
|
|
|
|
Once <TT>$p</TT>->boolean_attribute_value has been set, there is no way to
|
|
restore the default behaviour.
|
|
<P>
|
|
|
|
There is currently no way to get both quote characters
|
|
into the same literal argspec.
|
|
<P>
|
|
|
|
Empty tags, e.g. ``<>'' and ``</>'', are not recognized. <FONT SIZE="-1">SGML</FONT> allows them
|
|
to repeat the previous start tag or close the previous start tag
|
|
respectively.
|
|
<P>
|
|
|
|
<FONT SIZE="-1">NET</FONT> tags, e.g. ``code/.../'' are not recognized. This is <FONT SIZE="-1">SGML</FONT>
|
|
shorthand for ``<code>...</code>''.
|
|
<P>
|
|
|
|
Unclosed start or end tags, e.g. ``<tt<b>...</b</tt>'' are not
|
|
recognized.
|
|
<A NAME="lbAM"> </A>
|
|
<H2>DIAGNOSTICS</H2>
|
|
|
|
|
|
|
|
The following messages may be produced by HTML::Parser. The notation
|
|
in this listing is the same as used in perldiag:
|
|
<DL COMPACT>
|
|
<DT id="72">Not a reference to a hash<DD>
|
|
|
|
|
|
(F) The object blessed into or subclassed from HTML::Parser is not a
|
|
hash as required by the HTML::Parser methods.
|
|
<DT id="73">Bad signature in parser state object at %p<DD>
|
|
|
|
|
|
|
|
|
|
(F) The _hparser_xs_state element does not refer to a valid state structure.
|
|
Something must have changed the internal value
|
|
stored in this hash element, or the memory has been overwritten.
|
|
<DT id="74">_hparser_xs_state element is not a reference<DD>
|
|
|
|
|
|
(F) The _hparser_xs_state element has been destroyed.
|
|
<DT id="75">Can't find '_hparser_xs_state' element in HTML::Parser hash<DD>
|
|
|
|
|
|
(F) The _hparser_xs_state element is missing from the parser hash.
|
|
It was either deleted, or not created when the object was created.
|
|
<DT id="76"><FONT SIZE="-1">API</FONT> version %s not supported by HTML::Parser %s<DD>
|
|
|
|
|
|
|
|
|
|
(F) The constructor option 'api_version' with an argument greater than
|
|
or equal to 4 is reserved for future extensions.
|
|
<DT id="77">Bad constructor option '%s'<DD>
|
|
|
|
|
|
(F) An unknown constructor option key was passed to the <B>new()</B> or
|
|
<B>init()</B> methods.
|
|
<DT id="78">Parse loop not allowed<DD>
|
|
|
|
|
|
(F) A handler invoked the <B>parse()</B> or <B>parse_file()</B> method.
|
|
This is not permitted.
|
|
<DT id="79">marked sections not supported<DD>
|
|
|
|
|
|
(F) The <TT>$p</TT>-><B>marked_sections()</B> method was invoked in a HTML::Parser
|
|
module that was compiled without support for marked sections.
|
|
<DT id="80">Unknown boolean attribute (%d)<DD>
|
|
|
|
|
|
(F) Something is wrong with the internal logic that set up aliases for
|
|
boolean attributes.
|
|
<DT id="81">Only code or array references allowed as handler<DD>
|
|
|
|
|
|
(F) The second argument for <TT>$p</TT>->handler must be either a subroutine
|
|
reference, then name of a subroutine or method, or a reference to an
|
|
array.
|
|
<DT id="82">No handler for %s events<DD>
|
|
|
|
|
|
|
|
|
|
(F) The first argument to <TT>$p</TT>->handler must be a valid event name; i.e. one
|
|
of ``start'', ``end'', ``text'', ``process'', ``declaration'' or ``comment''.
|
|
<DT id="83">Unrecognized identifier %s in argspec<DD>
|
|
|
|
|
|
|
|
|
|
(F) The identifier is not a known argspec name.
|
|
Use one of the names mentioned in the argspec section above.
|
|
<DT id="84">Literal string is longer than 255 chars in argspec<DD>
|
|
|
|
|
|
(F) The current implementation limits the length of literals in
|
|
an argspec to 255 characters. Make the literal shorter.
|
|
<DT id="85">Backslash reserved for literal string in argspec<DD>
|
|
|
|
|
|
(F) The backslash character ``\'' is not allowed in argspec literals.
|
|
It is reserved to permit quoting inside a literal in a later version.
|
|
<DT id="86">Unterminated literal string in argspec<DD>
|
|
|
|
|
|
(F) The terminating quote character for a literal was not found.
|
|
<DT id="87">Bad argspec (%s)<DD>
|
|
|
|
|
|
(F) Only identifier names, literals, spaces and commas
|
|
are allowed in argspecs.
|
|
<DT id="88">Missing comma separator in argspec<DD>
|
|
|
|
|
|
(F) Identifiers in an argspec must be separated with ``,''.
|
|
<DT id="89">Parsing of undecoded <FONT SIZE="-1">UTF-8</FONT> will give garbage when decoding entities<DD>
|
|
|
|
|
|
(W) The first chunk parsed appears to contain undecoded <FONT SIZE="-1">UTF-8</FONT> and one
|
|
or more argspecs that decode entities are used for the callback
|
|
handlers.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The result of decoding will be a mix of encoded and decoded characters
|
|
for any entities that expand to characters with code above 127. This
|
|
is not a good thing.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The recommended solution is to apply <B>Encode::decode_utf8()</B> on the data before
|
|
feeding it to the <TT>$p</TT>-><B>parse()</B>. For <TT>$p</TT>-><B>parse_file()</B> pass a file that has been
|
|
opened in ``:utf8'' mode.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The alternative solution is to enable the <TT>"utf8_mode"</TT> and not decode before
|
|
passing strings to <TT>$p</TT>-><B>parse()</B>. The parser can process raw undecoded <FONT SIZE="-1">UTF-8</FONT>
|
|
sanely if the <TT>"utf8_mode"</TT> is enabled, or if the ``attr'', ``@attr'' or ``dtext''
|
|
argspecs are avoided.
|
|
<DT id="90">Parsing string decoded with wrong endianness<DD>
|
|
|
|
|
|
(W) The first character in the document is U+FFFE. This is not a
|
|
legal Unicode character but a byte swapped <FONT SIZE="-1">BOM.</FONT> The result of parsing
|
|
will likely be garbage.
|
|
<DT id="91">Parsing of undecoded <FONT SIZE="-1">UTF-32</FONT><DD>
|
|
|
|
|
|
(W) The parser found the Unicode <FONT SIZE="-1">UTF-32 BOM</FONT> signature at the start
|
|
of the document. The result of parsing will likely be garbage.
|
|
<DT id="92">Parsing of undecoded <FONT SIZE="-1">UTF-16</FONT><DD>
|
|
|
|
|
|
(W) The parser found the Unicode <FONT SIZE="-1">UTF-16 BOM</FONT> signature at the start of
|
|
the document. The result of parsing will likely be garbage.
|
|
</DL>
|
|
<A NAME="lbAN"> </A>
|
|
<H2>SEE ALSO</H2>
|
|
|
|
|
|
|
|
HTML::Entities, HTML::PullParser, HTML::TokeParser, HTML::HeadParser,
|
|
HTML::LinkExtor, HTML::Form
|
|
<P>
|
|
|
|
HTML::TreeBuilder (part of the <I>HTML-Tree</I> distribution)
|
|
<P>
|
|
|
|
<<A HREF="http://www.w3.org/TR/html4/">http://www.w3.org/TR/html4/</A>>
|
|
<P>
|
|
|
|
More information about marked sections and processing instructions may
|
|
be found at <<A HREF="http://www.is-thought.co.uk/book/sgml-8.htm">http://www.is-thought.co.uk/book/sgml-8.htm</A>>.
|
|
<A NAME="lbAO"> </A>
|
|
<H2>COPYRIGHT</H2>
|
|
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
Copyright 1996-2016 Gisle Aas. All rights reserved.
|
|
Copyright 1999-2000 Michael A. Chase. All rights reserved.
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
This library is free software; you can redistribute it and/or
|
|
modify it under the same terms as Perl itself.
|
|
<P>
|
|
|
|
<HR>
|
|
<A NAME="index"> </A><H2>Index</H2>
|
|
<DL>
|
|
<DT id="93"><A HREF="#lbAB">NAME</A><DD>
|
|
<DT id="94"><A HREF="#lbAC">SYNOPSIS</A><DD>
|
|
<DT id="95"><A HREF="#lbAD">DESCRIPTION</A><DD>
|
|
<DT id="96"><A HREF="#lbAE">METHODS</A><DD>
|
|
<DL>
|
|
<DT id="97"><A HREF="#lbAF">Argspec</A><DD>
|
|
<DT id="98"><A HREF="#lbAG">Events</A><DD>
|
|
<DT id="99"><A HREF="#lbAH">Unicode</A><DD>
|
|
</DL>
|
|
<DT id="100"><A HREF="#lbAI">VERSION 2 COMPATIBILITY</A><DD>
|
|
<DT id="101"><A HREF="#lbAJ">SUBCLASSING</A><DD>
|
|
<DT id="102"><A HREF="#lbAK">EXAMPLES</A><DD>
|
|
<DT id="103"><A HREF="#lbAL">BUGS</A><DD>
|
|
<DT id="104"><A HREF="#lbAM">DIAGNOSTICS</A><DD>
|
|
<DT id="105"><A HREF="#lbAN">SEE ALSO</A><DD>
|
|
<DT id="106"><A HREF="#lbAO">COPYRIGHT</A><DD>
|
|
</DL>
|
|
<HR>
|
|
This document was created by
|
|
<A HREF="/cgi-bin/man/man2html">man2html</A>,
|
|
using the manual pages.<BR>
|
|
Time: 00:05:45 GMT, March 31, 2021
|
|
</BODY>
|
|
</HTML>
|