199 lines
4.4 KiB
HTML
199 lines
4.4 KiB
HTML
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<HTML><HEAD><TITLE>Man page of HTML::LinkExtor</TITLE>
|
|
</HEAD><BODY>
|
|
<H1>HTML::LinkExtor</H1>
|
|
Section: User Contributed Perl Documentation (3pm)<BR>Updated: 2020-02-18<BR><A HREF="#index">Index</A>
|
|
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<A NAME="lbAB"> </A>
|
|
<H2>NAME</H2>
|
|
|
|
HTML::LinkExtor - Extract links from an HTML document
|
|
<A NAME="lbAC"> </A>
|
|
<H2>SYNOPSIS</H2>
|
|
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
require HTML::LinkExtor;
|
|
$p = HTML::LinkExtor->new(\&cb, "<A HREF="http://www.perl.org/">http://www.perl.org/</A>");
|
|
sub cb {
|
|
my($tag, %links) = @_;
|
|
print "$tag @{[%links]}\n";
|
|
}
|
|
$p->parse_file("index.html");
|
|
|
|
</PRE>
|
|
|
|
|
|
<A NAME="lbAD"> </A>
|
|
<H2>DESCRIPTION</H2>
|
|
|
|
|
|
|
|
<I>HTML::LinkExtor</I> is an <FONT SIZE="-1">HTML</FONT> parser that extracts links from an
|
|
<FONT SIZE="-1">HTML</FONT> document. The <I>HTML::LinkExtor</I> is a subclass of
|
|
<I>HTML::Parser</I>. This means that the document should be given to the
|
|
parser by calling the <TT>$p</TT>-><B>parse()</B> or <TT>$p</TT>-><B>parse_file()</B> methods.
|
|
<DL COMPACT>
|
|
<DT id="1">$p = HTML::LinkExtor->new<DD>
|
|
|
|
|
|
|
|
|
|
|
|
<DT id="2">$p = HTML::LinkExtor->new( $callback )<DD>
|
|
|
|
|
|
|
|
|
|
<DT id="3">$p = HTML::LinkExtor->new( $callback, $base )<DD>
|
|
|
|
|
|
|
|
|
|
|
|
The constructor takes two optional arguments. The first is a reference
|
|
to a callback routine. It will be called as links are found. If a
|
|
callback is not provided, then links are just accumulated internally
|
|
and can be retrieved by calling the <TT>$p</TT>-><B>links()</B> method.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The <TT>$base</TT> argument is an optional base <FONT SIZE="-1">URL</FONT> used to absolutize all URLs found.
|
|
You need to have the <I></I><FONT SIZE="-1"><I>URI</I></FONT><I></I> module installed if you provide <TT>$base</TT>.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The callback is called with the lowercase tag name as first argument,
|
|
and then all link attributes as separate key/value pairs. All
|
|
non-link attributes are removed.
|
|
<DT id="4">$p->links<DD>
|
|
|
|
|
|
|
|
|
|
Returns a list of all links found in the document. The returned
|
|
values will be anonymous arrays with the following elements:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
[$tag, $attr => $url1, $attr2 => $url2,...]
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The <TT>$p</TT>->links method will also truncate the internal link list. This
|
|
means that if the method is called twice without any parsing
|
|
between them the second call will return an empty list.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Also note that <TT>$p</TT>->links will always be empty if a callback routine
|
|
was provided when the <I>HTML::LinkExtor</I> was created.
|
|
</DL>
|
|
<A NAME="lbAE"> </A>
|
|
<H2>EXAMPLE</H2>
|
|
|
|
|
|
|
|
This is an example showing how you can extract links from a document
|
|
received using <FONT SIZE="-1">LWP:</FONT>
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
use LWP::UserAgent;
|
|
use HTML::LinkExtor;
|
|
use URI::URL;
|
|
|
|
$url = "<A HREF="http://www.perl.org/">http://www.perl.org/</A>"; # for instance
|
|
$ua = LWP::UserAgent->new;
|
|
|
|
# Set up a callback that collect image links
|
|
my @imgs = ();
|
|
sub callback {
|
|
my($tag, %attr) = @_;
|
|
return if $tag ne 'img'; # we only look closer at <img ...>
|
|
push(@imgs, values %attr);
|
|
}
|
|
|
|
# Make the parser. Unfortunately, we don't know the base yet
|
|
# (it might be different from $url)
|
|
$p = HTML::LinkExtor->new(\&callback);
|
|
|
|
# Request document and parse it as it arrives
|
|
$res = $ua->request(HTTP::Request->new(GET => $url),
|
|
sub {$p->parse($_[0])});
|
|
|
|
# Expand all image URLs to absolute ones
|
|
my $base = $res->base;
|
|
@imgs = map { $_ = url($_, $base)->abs; } @imgs;
|
|
|
|
# Print them out
|
|
print join("\n", @imgs), "\n";
|
|
|
|
</PRE>
|
|
|
|
|
|
<A NAME="lbAF"> </A>
|
|
<H2>SEE ALSO</H2>
|
|
|
|
|
|
|
|
HTML::Parser, HTML::Tagset, <FONT SIZE="-1">LWP</FONT>, <FONT SIZE="-1">URI::URL</FONT>
|
|
<A NAME="lbAG"> </A>
|
|
<H2>COPYRIGHT</H2>
|
|
|
|
|
|
|
|
Copyright 1996-2001 Gisle Aas.
|
|
<P>
|
|
|
|
This library is free software; you can redistribute it and/or
|
|
modify it under the same terms as Perl itself.
|
|
<P>
|
|
|
|
<HR>
|
|
<A NAME="index"> </A><H2>Index</H2>
|
|
<DL>
|
|
<DT id="5"><A HREF="#lbAB">NAME</A><DD>
|
|
<DT id="6"><A HREF="#lbAC">SYNOPSIS</A><DD>
|
|
<DT id="7"><A HREF="#lbAD">DESCRIPTION</A><DD>
|
|
<DT id="8"><A HREF="#lbAE">EXAMPLE</A><DD>
|
|
<DT id="9"><A HREF="#lbAF">SEE ALSO</A><DD>
|
|
<DT id="10"><A HREF="#lbAG">COPYRIGHT</A><DD>
|
|
</DL>
|
|
<HR>
|
|
This document was created by
|
|
<A HREF="/cgi-bin/man/man2html">man2html</A>,
|
|
using the manual pages.<BR>
|
|
Time: 00:05:45 GMT, March 31, 2021
|
|
</BODY>
|
|
</HTML>
|