474 lines
9.4 KiB
HTML
474 lines
9.4 KiB
HTML
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<HTML><HEAD><TITLE>Man page of HTML::TokeParser</TITLE>
|
|
</HEAD><BODY>
|
|
<H1>HTML::TokeParser</H1>
|
|
Section: User Contributed Perl Documentation (3pm)<BR>Updated: 2020-02-18<BR><A HREF="#index">Index</A>
|
|
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<A NAME="lbAB"> </A>
|
|
<H2>NAME</H2>
|
|
|
|
HTML::TokeParser - Alternative HTML::Parser interface
|
|
<A NAME="lbAC"> </A>
|
|
<H2>SYNOPSIS</H2>
|
|
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
require HTML::TokeParser;
|
|
$p = HTML::TokeParser->new("index.html") ||
|
|
die "Can't open: $!";
|
|
$p-><A HREF="/cgi-bin/man/man2html?1+empty_element_tags">empty_element_tags</A>(1); # configure its behaviour
|
|
|
|
while (my $token = $p->get_token) {
|
|
#...
|
|
}
|
|
|
|
</PRE>
|
|
|
|
|
|
<A NAME="lbAD"> </A>
|
|
<H2>DESCRIPTION</H2>
|
|
|
|
|
|
|
|
The <TT>"HTML::TokeParser"</TT> is an alternative interface to the
|
|
<TT>"HTML::Parser"</TT> class. It is an <TT>"HTML::PullParser"</TT> subclass with a
|
|
predeclared set of token types. If you wish the tokens to be reported
|
|
differently you probably want to use the <TT>"HTML::PullParser"</TT> directly.
|
|
<P>
|
|
|
|
The following methods are available:
|
|
<DL COMPACT>
|
|
<DT id="1">$p = HTML::TokeParser->new( $filename, %opt );<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="2">$p = HTML::TokeParser->new( $filehandle, %opt );<DD>
|
|
|
|
|
|
|
|
|
|
<DT id="3">$p = HTML::TokeParser->new( \$document, %opt );<DD>
|
|
|
|
|
|
|
|
|
|
|
|
The object constructor argument is either a file name, a file handle
|
|
object, or the complete document to be parsed. Extra options can be
|
|
provided as key/value pairs and are processed as documented by the base
|
|
classes.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If the argument is a plain scalar, then it is taken as the name of a
|
|
file to be opened and parsed. If the file can't be opened for
|
|
reading, then the constructor will return <TT>"undef"</TT> and $! will tell
|
|
you why it failed.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If the argument is a reference to a plain scalar, then this scalar is
|
|
taken to be the literal document to parse. The value of this
|
|
scalar should not be changed before all tokens have been extracted.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Otherwise the argument is taken to be some object that the
|
|
<TT>"HTML::TokeParser"</TT> can <B>read()</B> from when it needs more data. Typically
|
|
it will be a filehandle of some kind. The stream will be <B>read()</B> until
|
|
<FONT SIZE="-1">EOF,</FONT> but not closed.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
A newly constructed <TT>"HTML::TokeParser"</TT> differ from its base classes
|
|
by having the <TT>"unbroken_text"</TT> attribute enabled by default. See
|
|
HTML::Parser for a description of this and other attributes that
|
|
influence how the document is parsed. It is often a good idea to enable
|
|
<TT>"empty_element_tags"</TT> behaviour.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Note that the parsing result will likely not be valid if raw undecoded
|
|
<FONT SIZE="-1">UTF-8</FONT> is used as a source. When parsing <FONT SIZE="-1">UTF-8</FONT> encoded files turn
|
|
on <FONT SIZE="-1">UTF-8</FONT> decoding:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
open(my $fh, "<:utf8", "index.html") || die "Can't open 'index.html': $!";
|
|
my $p = HTML::TokeParser->new( $fh );
|
|
# ...
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If a <TT>$filename</TT> is passed to the constructor the file will be opened in
|
|
raw mode and the parsing result will only be valid if its content is
|
|
Latin-1 or pure <FONT SIZE="-1">ASCII.</FONT>
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If parsing from an <FONT SIZE="-1">UTF-8</FONT> encoded string buffer decode it first:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
utf8::decode($document);
|
|
my $p = HTML::TokeParser->new( \$document );
|
|
# ...
|
|
|
|
</PRE>
|
|
|
|
|
|
<DT id="4">$p->get_token<DD>
|
|
|
|
|
|
|
|
|
|
This method will return the next <I>token</I> found in the <FONT SIZE="-1">HTML</FONT> document,
|
|
or <TT>"undef"</TT> at the end of the document. The token is returned as an
|
|
array reference. The first element of the array will be a string
|
|
denoting the type of this token: ``S'' for start tag, ``E'' for end tag,
|
|
``T'' for text, ``C'' for comment, ``D'' for declaration, and ``<FONT SIZE="-1">PI''</FONT> for
|
|
process instructions. The rest of the token array depend on the type
|
|
like this:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
["S", $tag, $attr, $attrseq, $text]
|
|
["E", $tag, $text]
|
|
["T", $text, $is_data]
|
|
["C", $text]
|
|
["D", $text]
|
|
["PI", $token0, $text]
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
where <TT>$attr</TT> is a hash reference, <TT>$attrseq</TT> is an array reference and
|
|
the rest are plain scalars. The ``Argspec'' in HTML::Parser explains the
|
|
details.
|
|
<DT id="5">$p->unget_token( @tokens )<DD>
|
|
|
|
|
|
|
|
|
|
If you find you have read too many tokens you can push them back,
|
|
so that they are returned the next time <TT>$p</TT>->get_token is called.
|
|
<DT id="6">$p->get_tag<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="7">$p->get_tag( @tags )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
This method returns the next start or end tag (skipping any other
|
|
tokens), or <TT>"undef"</TT> if there are no more tags in the document. If
|
|
one or more arguments are given, then we skip tokens until one of the
|
|
specified tag types is found. For example:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->get_tag("font", "/font");
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
will find the next start or end tag for a font-element.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The tag information is returned as an array reference in the same form
|
|
as for <TT>$p</TT>->get_token above, but the type code (first element) is
|
|
missing. A start tag will be returned like this:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
[$tag, $attr, $attrseq, $text]
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The tagname of end tags are prefixed with ``/'', i.e. end tag is
|
|
returned like this:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
["/$tag", $text]
|
|
|
|
</PRE>
|
|
|
|
|
|
<DT id="8">$p->get_text<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="9">$p->get_text( @endtags )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
This method returns all text found at the current position. It will
|
|
return a zero length string if the next token is not text. Any
|
|
entities will be converted to their corresponding character.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
If one or more arguments are given, then we return all text occurring
|
|
before the first of the specified tags found. For example:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p->get_text("p", "br");
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
will return the text up to either a paragraph of linebreak element.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The text might span tags that should be <I>textified</I>. This is
|
|
controlled by the <TT>$p</TT>->{textify} attribute, which is a hash that
|
|
defines how certain tags can be treated as text. If the name of a
|
|
start tag matches a key in this hash then this tag is converted to
|
|
text. The hash value is used to specify which tag attribute to obtain
|
|
the text from. If this tag attribute is missing, then the upper case
|
|
name of the tag enclosed in brackets is returned, e.g. ``[<FONT SIZE="-1">IMG</FONT>]''. The
|
|
hash value can also be a subroutine reference. In this case the
|
|
routine is called with the start tag token content as its argument and
|
|
the return value is treated as the text.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The default <TT>$p</TT>->{textify} value is:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
{img => "alt", applet => "alt"}
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
This means that <<FONT SIZE="-1">IMG</FONT>> and <<FONT SIZE="-1">APPLET</FONT>> tags are treated as text, and that
|
|
the text to substitute can be found in the <FONT SIZE="-1">ALT</FONT> attribute.
|
|
<DT id="10">$p->get_trimmed_text<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="11">$p->get_trimmed_text( @endtags )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
Same as <TT>$p</TT>->get_text above, but will collapse any sequences of white
|
|
space to a single space character. Leading and trailing white space is
|
|
removed.
|
|
<DT id="12">$p->get_phrase<DD>
|
|
|
|
|
|
|
|
|
|
This will return all text found at the current position ignoring any
|
|
phrasal-level tags. Text is extracted until the first non
|
|
phrasal-level tag. Textification of tags is the same as for
|
|
<B>get_text()</B>. This method will collapse white space in the same way as
|
|
<B>get_trimmed_text()</B> does.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The definition of <i>phrasal-level tags</i> is obtained from the
|
|
HTML::Tagset module.
|
|
</DL>
|
|
<A NAME="lbAE"> </A>
|
|
<H2>EXAMPLES</H2>
|
|
|
|
|
|
|
|
This example extracts all links from a document. It will print one
|
|
line for each link, containing the <FONT SIZE="-1">URL</FONT> and the textual description
|
|
between the <A>...</A> tags:
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
use HTML::TokeParser;
|
|
$p = HTML::TokeParser->new(shift||"index.html");
|
|
|
|
while (my $token = $p->get_tag("a")) {
|
|
my $url = $token->[1]{href} || "-";
|
|
my $text = $p->get_trimmed_text("/a");
|
|
print "$url\t$text\n";
|
|
}
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
This example extract the <<FONT SIZE="-1">TITLE</FONT>> from the document:
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
use HTML::TokeParser;
|
|
$p = HTML::TokeParser->new(shift||"index.html");
|
|
if ($p->get_tag("title")) {
|
|
my $title = $p->get_trimmed_text;
|
|
print "Title: $title\n";
|
|
}
|
|
|
|
</PRE>
|
|
|
|
|
|
<A NAME="lbAF"> </A>
|
|
<H2>SEE ALSO</H2>
|
|
|
|
|
|
|
|
HTML::PullParser, HTML::Parser
|
|
<A NAME="lbAG"> </A>
|
|
<H2>COPYRIGHT</H2>
|
|
|
|
|
|
|
|
Copyright 1998-2005 Gisle Aas.
|
|
<P>
|
|
|
|
This library is free software; you can redistribute it and/or
|
|
modify it under the same terms as Perl itself.
|
|
<P>
|
|
|
|
<HR>
|
|
<A NAME="index"> </A><H2>Index</H2>
|
|
<DL>
|
|
<DT id="13"><A HREF="#lbAB">NAME</A><DD>
|
|
<DT id="14"><A HREF="#lbAC">SYNOPSIS</A><DD>
|
|
<DT id="15"><A HREF="#lbAD">DESCRIPTION</A><DD>
|
|
<DT id="16"><A HREF="#lbAE">EXAMPLES</A><DD>
|
|
<DT id="17"><A HREF="#lbAF">SEE ALSO</A><DD>
|
|
<DT id="18"><A HREF="#lbAG">COPYRIGHT</A><DD>
|
|
</DL>
|
|
<HR>
|
|
This document was created by
|
|
<A HREF="/cgi-bin/man/man2html">man2html</A>,
|
|
using the manual pages.<BR>
|
|
Time: 00:05:45 GMT, March 31, 2021
|
|
</BODY>
|
|
</HTML>
|