man-pages/man3/HTML::PullParser.3pm.html


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML><HEAD><TITLE>Man page of HTML::PullParser</TITLE>
</HEAD><BODY>
<H1>HTML::PullParser</H1>
Section: User Contributed Perl Documentation (3pm)<BR>Updated: 2020-02-18<BR><A HREF="#index">Index</A>
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>


<A NAME="lbAB">&nbsp;</A>
<H2>NAME</H2>

HTML::PullParser - Alternative HTML::Parser interface
<A NAME="lbAC">&nbsp;</A>
<H2>SYNOPSIS</H2>


<PRE>
 use HTML::PullParser;

 $p = HTML::PullParser-&gt;new(file =&gt; &quot;index.html&quot;,
                            start =&gt; 'event, tagname, @attr',
                            end   =&gt; 'event, tagname',
                            ignore_elements =&gt; [qw(script style)],
                           ) || die &quot;Can't open: $!&quot;;
 while (my $token = $p-&gt;get_token) {
     #...do something with $token
 }

</PRE>


<A NAME="lbAD">&nbsp;</A>
<H2>DESCRIPTION</H2>


The HTML::PullParser is an alternative interface to the HTML::Parser class.
It basically turns the HTML::Parser inside out.  You associate a file
(or any IO::Handle object or string) with the parser at construction time and
then repeatedly call <TT>$parser</TT>-&gt;get_token to obtain the tags and text
found in the parsed document.
<P>

The following methods are provided:
<DL COMPACT>
<DT id="1">$p = HTML::PullParser-&gt;new( file =&gt; $file, %options )<DD>


<DT id="2">$p = HTML::PullParser-&gt;new( doc =&gt; \$doc, %options )<DD>


A <TT>&quot;HTML::PullParser&quot;</TT> can be made to parse from either a file or a
literal document based on whether the <TT>&quot;file&quot;</TT> or <TT>&quot;doc&quot;</TT> option is
passed to the parser's constructor.


<P>


The <TT>&quot;file&quot;</TT> passed in can either be a file name or a file handle
object.  If a file name is passed, and it can't be opened for reading,
then the constructor will return an undefined value and $!  will tell
you why it failed.  Otherwise the argument is taken to be some object
that the <TT>&quot;HTML::PullParser&quot;</TT> can <B>read()</B> from when it needs more data.
The stream will be <B>read()</B> until <FONT SIZE="-1">EOF,</FONT> but not closed.


<P>


A <TT>&quot;doc&quot;</TT> can be passed plain or as a reference
to a scalar.  If a reference is passed then the value of this scalar
should not be changed before all tokens have been extracted.


<P>


Next the information to be returned for the different token types must
be set up.  This is done by simply associating an argspec (as defined
in HTML::Parser) with the events you have an interest in.  For
instance, if you want <TT>&quot;start&quot;</TT> tokens to be reported as the string
<TT>'S'</TT> followed by the tagname and the attributes you might pass an
<TT>&quot;start&quot;</TT>-option like this:


<P>


<PRE>
   $p = HTML::PullParser-&gt;new(
          doc   =&gt; $document_to_parse,
          start =&gt; '&quot;S&quot;, tagname, @attr',
          end   =&gt; '&quot;E&quot;, tagname',
        );

</PRE>


<P>


At last other <TT>&quot;HTML::Parser&quot;</TT> options, like <TT>&quot;ignore_tags&quot;</TT>, and
<TT>&quot;unbroken_text&quot;</TT>, can be passed in.  Note that you should not use the
<I>event</I>_h options to set up parser handlers.  That would confuse the
inner logic of <TT>&quot;HTML::PullParser&quot;</TT>.
<DT id="3">$token = $p-&gt;get_token<DD>


This method will return the next <I>token</I> found in the <FONT SIZE="-1">HTML</FONT> document,
or <TT>&quot;undef&quot;</TT> at the end of the document.  The token is returned as an
array reference.  The content of this array match the argspec set up
during <TT>&quot;HTML::PullParser&quot;</TT> construction.
<DT id="4">$p-&gt;unget_token( @tokens )<DD>


If you find out you have read too many tokens you can push them back,
so that they are returned again the next time <TT>$p</TT>-&gt;get_token is called.
</DL>
<A NAME="lbAE">&nbsp;</A>
<H2>EXAMPLES</H2>


The 'eg/hform' script shows how we might parse the form section of
HTML::Documents using HTML::PullParser.
<A NAME="lbAF">&nbsp;</A>
<H2>SEE ALSO</H2>


HTML::Parser, HTML::TokeParser
<A NAME="lbAG">&nbsp;</A>
<H2>COPYRIGHT</H2>


Copyright 1998-2001 Gisle Aas.
<P>

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
<P>

<HR>
<A NAME="index">&nbsp;</A><H2>Index</H2>
<DL>
<DT id="5"><A HREF="#lbAB">NAME</A><DD>
<DT id="6"><A HREF="#lbAC">SYNOPSIS</A><DD>
<DT id="7"><A HREF="#lbAD">DESCRIPTION</A><DD>
<DT id="8"><A HREF="#lbAE">EXAMPLES</A><DD>
<DT id="9"><A HREF="#lbAF">SEE ALSO</A><DD>
<DT id="10"><A HREF="#lbAG">COPYRIGHT</A><DD>
</DL>
<HR>
This document was created by
<A HREF="/cgi-bin/man/man2html">man2html</A>,
using the manual pages.<BR>
Time: 00:05:45 GMT, March 31, 2021
</BODY>
</HTML>