man-pages/man3/HTML::PullParser.3pm.html
2021-03-31 01:06:50 +01:00

184 lines
4.7 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML><HEAD><TITLE>Man page of HTML::PullParser</TITLE>
</HEAD><BODY>
<H1>HTML::PullParser</H1>
Section: User Contributed Perl Documentation (3pm)<BR>Updated: 2020-02-18<BR><A HREF="#index">Index</A>
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
<A NAME="lbAB">&nbsp;</A>
<H2>NAME</H2>
HTML::PullParser - Alternative HTML::Parser interface
<A NAME="lbAC">&nbsp;</A>
<H2>SYNOPSIS</H2>
<PRE>
use HTML::PullParser;
$p = HTML::PullParser-&gt;new(file =&gt; &quot;index.html&quot;,
start =&gt; 'event, tagname, @attr',
end =&gt; 'event, tagname',
ignore_elements =&gt; [qw(script style)],
) || die &quot;Can't open: $!&quot;;
while (my $token = $p-&gt;get_token) {
#...do something with $token
}
</PRE>
<A NAME="lbAD">&nbsp;</A>
<H2>DESCRIPTION</H2>
The HTML::PullParser is an alternative interface to the HTML::Parser class.
It basically turns the HTML::Parser inside out. You associate a file
(or any IO::Handle object or string) with the parser at construction time and
then repeatedly call <TT>$parser</TT>-&gt;get_token to obtain the tags and text
found in the parsed document.
<P>
The following methods are provided:
<DL COMPACT>
<DT id="1">$p = HTML::PullParser-&gt;new( file =&gt; $file, %options )<DD>
<DT id="2">$p = HTML::PullParser-&gt;new( doc =&gt; \$doc, %options )<DD>
A <TT>&quot;HTML::PullParser&quot;</TT> can be made to parse from either a file or a
literal document based on whether the <TT>&quot;file&quot;</TT> or <TT>&quot;doc&quot;</TT> option is
passed to the parser's constructor.
<P>
The <TT>&quot;file&quot;</TT> passed in can either be a file name or a file handle
object. If a file name is passed, and it can't be opened for reading,
then the constructor will return an undefined value and $! will tell
you why it failed. Otherwise the argument is taken to be some object
that the <TT>&quot;HTML::PullParser&quot;</TT> can <B>read()</B> from when it needs more data.
The stream will be <B>read()</B> until <FONT SIZE="-1">EOF,</FONT> but not closed.
<P>
A <TT>&quot;doc&quot;</TT> can be passed plain or as a reference
to a scalar. If a reference is passed then the value of this scalar
should not be changed before all tokens have been extracted.
<P>
Next the information to be returned for the different token types must
be set up. This is done by simply associating an argspec (as defined
in HTML::Parser) with the events you have an interest in. For
instance, if you want <TT>&quot;start&quot;</TT> tokens to be reported as the string
<TT>'S'</TT> followed by the tagname and the attributes you might pass an
<TT>&quot;start&quot;</TT>-option like this:
<P>
<PRE>
$p = HTML::PullParser-&gt;new(
doc =&gt; $document_to_parse,
start =&gt; '&quot;S&quot;, tagname, @attr',
end =&gt; '&quot;E&quot;, tagname',
);
</PRE>
<P>
At last other <TT>&quot;HTML::Parser&quot;</TT> options, like <TT>&quot;ignore_tags&quot;</TT>, and
<TT>&quot;unbroken_text&quot;</TT>, can be passed in. Note that you should not use the
<I>event</I>_h options to set up parser handlers. That would confuse the
inner logic of <TT>&quot;HTML::PullParser&quot;</TT>.
<DT id="3">$token = $p-&gt;get_token<DD>
This method will return the next <I>token</I> found in the <FONT SIZE="-1">HTML</FONT> document,
or <TT>&quot;undef&quot;</TT> at the end of the document. The token is returned as an
array reference. The content of this array match the argspec set up
during <TT>&quot;HTML::PullParser&quot;</TT> construction.
<DT id="4">$p-&gt;unget_token( @tokens )<DD>
If you find out you have read too many tokens you can push them back,
so that they are returned again the next time <TT>$p</TT>-&gt;get_token is called.
</DL>
<A NAME="lbAE">&nbsp;</A>
<H2>EXAMPLES</H2>
The 'eg/hform' script shows how we might parse the form section of
HTML::Documents using HTML::PullParser.
<A NAME="lbAF">&nbsp;</A>
<H2>SEE ALSO</H2>
HTML::Parser, HTML::TokeParser
<A NAME="lbAG">&nbsp;</A>
<H2>COPYRIGHT</H2>
Copyright 1998-2001 Gisle Aas.
<P>
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
<P>
<HR>
<A NAME="index">&nbsp;</A><H2>Index</H2>
<DL>
<DT id="5"><A HREF="#lbAB">NAME</A><DD>
<DT id="6"><A HREF="#lbAC">SYNOPSIS</A><DD>
<DT id="7"><A HREF="#lbAD">DESCRIPTION</A><DD>
<DT id="8"><A HREF="#lbAE">EXAMPLES</A><DD>
<DT id="9"><A HREF="#lbAF">SEE ALSO</A><DD>
<DT id="10"><A HREF="#lbAG">COPYRIGHT</A><DD>
</DL>
<HR>
This document was created by
<A HREF="/cgi-bin/man/man2html">man2html</A>,
using the manual pages.<BR>
Time: 00:05:45 GMT, March 31, 2021
</BODY>
</HTML>