184 lines
4.7 KiB
HTML
184 lines
4.7 KiB
HTML
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<HTML><HEAD><TITLE>Man page of HTML::PullParser</TITLE>
|
|
</HEAD><BODY>
|
|
<H1>HTML::PullParser</H1>
|
|
Section: User Contributed Perl Documentation (3pm)<BR>Updated: 2020-02-18<BR><A HREF="#index">Index</A>
|
|
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<A NAME="lbAB"> </A>
|
|
<H2>NAME</H2>
|
|
|
|
HTML::PullParser - Alternative HTML::Parser interface
|
|
<A NAME="lbAC"> </A>
|
|
<H2>SYNOPSIS</H2>
|
|
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
use HTML::PullParser;
|
|
|
|
$p = HTML::PullParser->new(file => "index.html",
|
|
start => 'event, tagname, @attr',
|
|
end => 'event, tagname',
|
|
ignore_elements => [qw(script style)],
|
|
) || die "Can't open: $!";
|
|
while (my $token = $p->get_token) {
|
|
#...do something with $token
|
|
}
|
|
|
|
</PRE>
|
|
|
|
|
|
<A NAME="lbAD"> </A>
|
|
<H2>DESCRIPTION</H2>
|
|
|
|
|
|
|
|
The HTML::PullParser is an alternative interface to the HTML::Parser class.
|
|
It basically turns the HTML::Parser inside out. You associate a file
|
|
(or any IO::Handle object or string) with the parser at construction time and
|
|
then repeatedly call <TT>$parser</TT>->get_token to obtain the tags and text
|
|
found in the parsed document.
|
|
<P>
|
|
|
|
The following methods are provided:
|
|
<DL COMPACT>
|
|
<DT id="1">$p = HTML::PullParser->new( file => $file, %options )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="2">$p = HTML::PullParser->new( doc => \$doc, %options )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
A <TT>"HTML::PullParser"</TT> can be made to parse from either a file or a
|
|
literal document based on whether the <TT>"file"</TT> or <TT>"doc"</TT> option is
|
|
passed to the parser's constructor.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The <TT>"file"</TT> passed in can either be a file name or a file handle
|
|
object. If a file name is passed, and it can't be opened for reading,
|
|
then the constructor will return an undefined value and $! will tell
|
|
you why it failed. Otherwise the argument is taken to be some object
|
|
that the <TT>"HTML::PullParser"</TT> can <B>read()</B> from when it needs more data.
|
|
The stream will be <B>read()</B> until <FONT SIZE="-1">EOF,</FONT> but not closed.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
A <TT>"doc"</TT> can be passed plain or as a reference
|
|
to a scalar. If a reference is passed then the value of this scalar
|
|
should not be changed before all tokens have been extracted.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Next the information to be returned for the different token types must
|
|
be set up. This is done by simply associating an argspec (as defined
|
|
in HTML::Parser) with the events you have an interest in. For
|
|
instance, if you want <TT>"start"</TT> tokens to be reported as the string
|
|
<TT>'S'</TT> followed by the tagname and the attributes you might pass an
|
|
<TT>"start"</TT>-option like this:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$p = HTML::PullParser->new(
|
|
doc => $document_to_parse,
|
|
start => '"S", tagname, @attr',
|
|
end => '"E", tagname',
|
|
);
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
At last other <TT>"HTML::Parser"</TT> options, like <TT>"ignore_tags"</TT>, and
|
|
<TT>"unbroken_text"</TT>, can be passed in. Note that you should not use the
|
|
<I>event</I>_h options to set up parser handlers. That would confuse the
|
|
inner logic of <TT>"HTML::PullParser"</TT>.
|
|
<DT id="3">$token = $p->get_token<DD>
|
|
|
|
|
|
|
|
|
|
This method will return the next <I>token</I> found in the <FONT SIZE="-1">HTML</FONT> document,
|
|
or <TT>"undef"</TT> at the end of the document. The token is returned as an
|
|
array reference. The content of this array match the argspec set up
|
|
during <TT>"HTML::PullParser"</TT> construction.
|
|
<DT id="4">$p->unget_token( @tokens )<DD>
|
|
|
|
|
|
|
|
|
|
If you find out you have read too many tokens you can push them back,
|
|
so that they are returned again the next time <TT>$p</TT>->get_token is called.
|
|
</DL>
|
|
<A NAME="lbAE"> </A>
|
|
<H2>EXAMPLES</H2>
|
|
|
|
|
|
|
|
The 'eg/hform' script shows how we might parse the form section of
|
|
HTML::Documents using HTML::PullParser.
|
|
<A NAME="lbAF"> </A>
|
|
<H2>SEE ALSO</H2>
|
|
|
|
|
|
|
|
HTML::Parser, HTML::TokeParser
|
|
<A NAME="lbAG"> </A>
|
|
<H2>COPYRIGHT</H2>
|
|
|
|
|
|
|
|
Copyright 1998-2001 Gisle Aas.
|
|
<P>
|
|
|
|
This library is free software; you can redistribute it and/or
|
|
modify it under the same terms as Perl itself.
|
|
<P>
|
|
|
|
<HR>
|
|
<A NAME="index"> </A><H2>Index</H2>
|
|
<DL>
|
|
<DT id="5"><A HREF="#lbAB">NAME</A><DD>
|
|
<DT id="6"><A HREF="#lbAC">SYNOPSIS</A><DD>
|
|
<DT id="7"><A HREF="#lbAD">DESCRIPTION</A><DD>
|
|
<DT id="8"><A HREF="#lbAE">EXAMPLES</A><DD>
|
|
<DT id="9"><A HREF="#lbAF">SEE ALSO</A><DD>
|
|
<DT id="10"><A HREF="#lbAG">COPYRIGHT</A><DD>
|
|
</DL>
|
|
<HR>
|
|
This document was created by
|
|
<A HREF="/cgi-bin/man/man2html">man2html</A>,
|
|
using the manual pages.<BR>
|
|
Time: 00:05:45 GMT, March 31, 2021
|
|
</BODY>
|
|
</HTML>
|