man-pages/man3/LWP::RobotUA.3pm.html
2021-03-31 01:06:50 +01:00

258 lines
5.7 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML><HEAD><TITLE>Man page of LWP::RobotUA</TITLE>
</HEAD><BODY>
<H1>LWP::RobotUA</H1>
Section: User Contributed Perl Documentation (3pm)<BR>Updated: 2019-11-29<BR><A HREF="#index">Index</A>
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
<A NAME="lbAB">&nbsp;</A>
<H2>NAME</H2>
LWP::RobotUA - a class for well-behaved Web robots
<A NAME="lbAC">&nbsp;</A>
<H2>SYNOPSIS</H2>
<PRE>
use LWP::RobotUA;
my $ua = LWP::RobotUA-&gt;new('my-robot/0.1', '<A HREF="mailto:me@foo.com">me@foo.com</A>');
$ua-&gt;delay(10); # be very nice -- max one hit every ten minutes!
...
# Then just use it just like a normal LWP::UserAgent:
my $response = $ua-&gt;get('<A HREF="http://whatever.int/...');">http://whatever.int/...');</A>
...
</PRE>
<A NAME="lbAD">&nbsp;</A>
<H2>DESCRIPTION</H2>
This class implements a user agent that is suitable for robot
applications. Robots should be nice to the servers they visit. They
should consult the <I>/robots.txt</I> file to ensure that they are welcomed
and they should not make requests too frequently.
<P>
But before you consider writing a robot, take a look at
&lt;URL:<A HREF="http://www.robotstxt.org/">http://www.robotstxt.org/</A>&gt;.
<P>
When you use an <I>LWP::RobotUA</I> object as your user agent, then you do not
really have to think about these things yourself; <TT>&quot;robots.txt&quot;</TT> files
are automatically consulted and obeyed, the server isn't queried
too rapidly, and so on. Just send requests
as you do when you are using a normal <I>LWP::UserAgent</I>
object (using <TT>&quot;$ua-&gt;get(...)&quot;</TT>, <TT>&quot;$ua-&gt;head(...)&quot;</TT>,
<TT>&quot;$ua-&gt;request(...)&quot;</TT>, etc.), and this
special agent will make sure you are nice.
<A NAME="lbAE">&nbsp;</A>
<H2>METHODS</H2>
The LWP::RobotUA is a sub-class of LWP::UserAgent and implements the
same methods. In addition the following methods are provided:
<A NAME="lbAF">&nbsp;</A>
<H3>new</H3>
<PRE>
my $ua = LWP::RobotUA-&gt;new( %options )
my $ua = LWP::RobotUA-&gt;new( $agent, $from )
my $ua = LWP::RobotUA-&gt;new( $agent, $from, $rules )
</PRE>
<P>
The LWP::UserAgent options <TT>&quot;agent&quot;</TT> and <TT>&quot;from&quot;</TT> are mandatory. The
options <TT>&quot;delay&quot;</TT>, <TT>&quot;use_sleep&quot;</TT> and <TT>&quot;rules&quot;</TT> initialize attributes
private to the RobotUA. If <TT>&quot;rules&quot;</TT> are not provided, then
<TT>&quot;WWW::RobotRules&quot;</TT> is instantiated providing an internal database of
<I>robots.txt</I>.
<P>
It is also possible to just pass the value of <TT>&quot;agent&quot;</TT>, <TT>&quot;from&quot;</TT> and
optionally <TT>&quot;rules&quot;</TT> as plain positional arguments.
<A NAME="lbAG">&nbsp;</A>
<H3>delay</H3>
<PRE>
my $delay = $ua-&gt;delay;
$ua-&gt;delay( $minutes );
</PRE>
<P>
Get/set the minimum delay between requests to the same server, in
<I>minutes</I>. The default is <TT>1</TT> minute. Note that this number doesn't
have to be an integer; for example, this sets the delay to <TT>10</TT> seconds:
<P>
<PRE>
$ua-&gt;delay(10/60);
</PRE>
<A NAME="lbAH">&nbsp;</A>
<H3>use_sleep</H3>
<PRE>
my $bool = $ua-&gt;use_sleep;
$ua-&gt;use_sleep( $boolean );
</PRE>
<P>
Get/set a value indicating whether the <FONT SIZE="-1">UA</FONT> should ``sleep'' in LWP::RobotUA if
requests arrive too fast, defined as <TT>&quot;$ua-&gt;delay&quot;</TT> minutes not passed since
last request to the given server. The default is true. If this value is
false then an internal <TT>&quot;SERVICE_UNAVAILABLE&quot;</TT> response will be generated.
It will have a <TT>&quot;Retry-After&quot;</TT> header that indicates when it is <FONT SIZE="-1">OK</FONT> to
send another request to this server.
<A NAME="lbAI">&nbsp;</A>
<H3>rules</H3>
<PRE>
my $rules = $ua-&gt;rules;
$ua-&gt;rules( $rules );
</PRE>
<P>
Set/get which <I>WWW::RobotRules</I> object to use.
<A NAME="lbAJ">&nbsp;</A>
<H3>no_visits</H3>
<PRE>
my $num = $ua-&gt;no_visits( $netloc )
</PRE>
<P>
Returns the number of documents fetched from this server host. Yeah I
know, this method should probably have been named <TT>&quot;num_visits&quot;</TT> or
something like that. :-(
<A NAME="lbAK">&nbsp;</A>
<H3>host_wait</H3>
<PRE>
my $num = $ua-&gt;host_wait( $netloc )
</PRE>
<P>
Returns the number of <I>seconds</I> (from now) you must wait before you can
make a new request to this host.
<A NAME="lbAL">&nbsp;</A>
<H3>as_string</H3>
<PRE>
my $string = $ua-&gt;as_string;
</PRE>
<P>
Returns a string that describes the state of the <FONT SIZE="-1">UA.</FONT>
Mainly useful for debugging.
<A NAME="lbAM">&nbsp;</A>
<H2>SEE ALSO</H2>
LWP::UserAgent, WWW::RobotRules
<A NAME="lbAN">&nbsp;</A>
<H2>COPYRIGHT</H2>
Copyright 1996-2004 Gisle Aas.
<P>
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
<P>
<HR>
<A NAME="index">&nbsp;</A><H2>Index</H2>
<DL>
<DT id="1"><A HREF="#lbAB">NAME</A><DD>
<DT id="2"><A HREF="#lbAC">SYNOPSIS</A><DD>
<DT id="3"><A HREF="#lbAD">DESCRIPTION</A><DD>
<DT id="4"><A HREF="#lbAE">METHODS</A><DD>
<DL>
<DT id="5"><A HREF="#lbAF">new</A><DD>
<DT id="6"><A HREF="#lbAG">delay</A><DD>
<DT id="7"><A HREF="#lbAH">use_sleep</A><DD>
<DT id="8"><A HREF="#lbAI">rules</A><DD>
<DT id="9"><A HREF="#lbAJ">no_visits</A><DD>
<DT id="10"><A HREF="#lbAK">host_wait</A><DD>
<DT id="11"><A HREF="#lbAL">as_string</A><DD>
</DL>
<DT id="12"><A HREF="#lbAM">SEE ALSO</A><DD>
<DT id="13"><A HREF="#lbAN">COPYRIGHT</A><DD>
</DL>
<HR>
This document was created by
<A HREF="/cgi-bin/man/man2html">man2html</A>,
using the manual pages.<BR>
Time: 00:05:47 GMT, March 31, 2021
</BODY>
</HTML>