racket/collects/net/scribblings/url.scrbl
Matthew Flatt b8b8260379 net/url: add `current-url-encode-mode'
Changing `current-url-encode-mode' from 'recommended to 'unreserved
causes `url->string' to encode !, *, ', (, and ) using %, which
can avoid confusing some parsers.

See also https://github.com/plt/racket/pull/198
2012-12-17 06:56:38 -07:00

459 lines
17 KiB
Racket

#lang scribble/doc
@(require "common.rkt" scribble/bnf
(for-label net/url net/url-unit net/url-sig
net/head net/uri-codec net/tcp-sig
(only-in net/url-connect current-https-protocol)
openssl))
@title[#:tag "url"]{URLs and HTTP}
@defmodule[net/url]{The @racketmodname[net/url] library provides
utilities to parse and manipulate URIs, as specified in RFC 2396
@cite["RFC2396"], and to use the HTTP protocol.}
To access the text of a document from the web, first obtain its URL as
a string. Convert the address into a @racket[url] structure using
@racket[string->url]. Then, open the document using
@racket[get-pure-port] or @racket[get-impure-port], depending on
whether or not you wish to examine its MIME headers. At this point,
you have a regular input port with which to process the document, as with
any other file.
Currently the only supported protocols are @racket["http"],
@racket["https"], and sometimes @racket["file"].
@section{URL Structure}
@declare-exporting[net/url-structs net/url]
@defmodule*/no-declare[(net/url-structs)]{The URL structure types are
provided by the @racketmodname[net/url-structs] library, and
re-exported by @racketmodname[net/url].}
@; ----------------------------------------
@defstruct[url ([scheme (or/c false/c string?)]
[user (or/c false/c string?)]
[host (or/c false/c string?)]
[port (or/c false/c exact-nonnegative-integer?)]
[path-absolute? boolean?]
[path (listof path/param?)]
[query (listof (cons/c symbol? (or/c false/c string?)))]
[fragment (or/c false/c string?)])]{
The basic structure for all URLs, which is explained in RFC 3986
@cite["RFC3986"]. The following diagram illustrates the parts:
@verbatim[#:indent 2]|{
http://sky@www:801/cgi-bin/finger;xyz?name=shriram;host=nw#top
{-1} {2} {3} {4}{---5---------} {6} {----7-------------} {8}
1 = scheme, 2 = user, 3 = host, 4 = port,
5 = path (two elements), 6 = param (of second path element),
7 = query, 8 = fragment
}|
The strings inside the @racket[user], @racket[path], @racket[query],
and @racket[fragment] fields are represented directly as Racket
strings, without URL-syntax-specific quoting. The procedures
@racket[string->url] and @racket[url->string] translate encodings such
as @litchar{%20} into spaces and back again.
By default, query associations are parsed with either @litchar{;} or
@litchar{&} as a separator, and they are generated with @litchar{&} as
a separator. The @racket[current-alist-separator-mode] parameter
adjusts the behavior.
An empty string at the end of the @racket[path] list corresponds to a
URL that ends in a slash. For example, the result of
@racket[(string->url "http://racket-lang.org/a/")] has a
@racket[path] field with strings @racket["a"] and @racket[""], while
the result of @racket[(string->url "http://racket-lang.org/a")] has a
@racket[path] field with only the string @racket["a"].
When a @racket["file"] URL is represented by a @racket[url] structure,
the @racket[path] field is mostly a list of path elements. For Unix
paths, the root directory is not included in @racket[path]; its
presence or absence is implicit in the @racket[path-absolute?] flag.
For Windows paths, the first element typically represents a drive, but
a UNC path is represented by a first element that is @racket[""] and
then successive elements complete the drive components that are
separated by @litchar{/} or @litchar{\}.}
@defstruct[path/param ([path (or/c string? (or/c 'up 'same))]
[param (listof string?)])]{
A pair that joins a path segment with its params in a URL.}
@; ----------------------------------------
@section{URL Functions}
An HTTP connection is created as a @deftech{pure port} or a
@deftech{impure port}. A pure port is one from which the MIME headers
have been removed, so that what remains is purely the first content
fragment. An impure port is one that still has its MIME headers.
@defproc[(string->url [str (or/c (not/c #rx"^([^:/?#]*):")
#rx"^[a-zA-Z][a-zA-Z0-9+.-]*:")])
url?]{
Parses the URL specified by @racket[str] into a @racket[url]
struct. The @racket[string->url] procedure uses
@racket[form-urlencoded->alist] when parsing the query, so it is
sensitive to the @racket[current-alist-separator-mode] parameter for
determining the association separator.
The contract on @racket[str] insists that, if the url has a scheme,
then the scheme begins with a letter and consists only of letters,
numbers, @litchar{+}, @litchar{-}, and @litchar{.} characters.
If @racket[str] starts with @racket["file:"], then the path is always
parsed as an absolute path, and the parsing details depend on
@racket[file-url-path-convention-type]:
@itemize[
@item{@racket['unix] : If @racket["file:"] is followed by
@litchar{//} and a non-@litchar{/}, then the first element
after the @litchar{//} is parsed as a host (and maybe port);
otherwise, the first element starts the path, and the host is
@racket[""].}
@item{@racket['windows] : If @racket["file:"] is followed by
@litchar{//}, then the @litchar{//} is stripped; the remainder
parsed as a Windows path. The host is always @racket[""] and
the port is always @racket[#f].}
]}
@defproc[(combine-url/relative [base url?] [relative string?]) url?]{
Given a base URL and a relative path, combines the two and returns a
new URL as per the URL combination specification. They are combined
according to the rules in RFC 3986 @cite["RFC3986"].
This function does not raise any exceptions.}
@defproc[(netscape/string->url [str string?]) url?]{
Turns a string into a URL, applying (what appear to be) Netscape's
conventions on automatically specifying the scheme: a string starting
with a slash gets the scheme @racket["file"], while all others get the
scheme @racket["http"].}
@defproc[(url->string [URL url?]) string?]{
Generates a string corresponding to the contents of a @racket[url]
struct. For a @racket["file:"] URL, the URL must not be relative, the
result always starts @litchar{file://}, and the interpretation of the
path depends on the value of @racket[file-url-path-convention-type]:
@itemize[
@item{@racket['unix] : Elements in @racket[URL] are treated as path
elements. Empty strings in the path list are treated like
@racket['same].}
@item{@racket['windows] : If the first element is @racket[""] then
the next two elements define the UNC root, and the rest of the
elements are treated as path elements. Empty strings in the
path list are treated like @racket['same].}
]
The @racket[url->string] procedure uses
@racket[alist->form-urlencoded] when formatting the query, so it is
sensitive to the @racket[current-alist-separator-mode] parameter for
determining the association separator. The default is to separate
associations with a @litchar{&}.
The encoding of path segments and fragment is sensitive to the
@racket[current-url-encode-mode] parameter.}
@defproc[(path->url [path (or/c path-string? path-for-some-system?)])
url?]{
Converts a path to a @racket[url].}
@defproc[(url->path [URL url?]
[kind (or/c 'unix 'windows) (system-path-convention-type)])
path-for-some-system?]{
Converts @racket[URL], which is assumed to be a @racket["file"] URL,
to a path.}
@defparam[file-url-path-convention-type kind (or/c 'unix 'windows)]{
Determines the default conversion to and from strings for
@racket["file"] URLs. See @racket[string->url] and @racket[url->string].}
@defparam[current-url-encode-mode mode (or/c 'recommended 'unreserved)]{
Determines how @racket[url->string] encodes @litchar{!}, @litchar{*}, @litchar{'}, @litchar{(},
and @litchar{)} in path segments and fragments:
@racket['recommended] leave them as-is, while @racket['unreserved]
encodes them using @litchar{%}. The @racket['recommended] mode corresponds
to the recommendations of RFC 2396 @cite["RFC2396"], but @racket['unreserved]
avoids characters that are in some contexts mistaken for delimiters around
URLs.
Internally, @racket['recommended] mode uses
@racket[uri-path-segment-encode] and @racket[uri-encode], while
@racket['unreserved] mode uses
@racket[uri-path-segment-unreserved-encode] and
@racket[uri-unreserved-encode].}
@deftogether[(
@defproc[(get-pure-port [URL url?]
[header (listof string?) null]
[#:redirections redirections exact-nonnegative-integer? 0])
input-port?]
@defproc[(head-pure-port [URL url?]
[header (listof string?) null])
input-port?]
@defproc[(delete-pure-port [URL url?]
[header (listof string?) null])
input-port?]
)]{
Initiates a GET/HEAD/DELETE request for @racket[URL] and returns a
@tech{pure port} corresponding to the body of the response. The
optional list of strings can be used to send header lines to the
server.
The GET method is used to retrieve whatever information is identified
by @racket[URL]. If @racket[redirections] is not @racket[0], then
@racket[get-pure-port] will follow redirections from the server,
up to the limit given by @racket[redirections].
The HEAD method is identical to GET, except the server must not return
a message body. The meta-information returned in a response to a HEAD
request should be identical to the information in a response to a GET
request.
The DELETE method is used to delete the entity identified by
@racket[URL].
@bold{Beware:} By default, @racket["https"] scheme handling does not
verify a server's certificate (i.e., it's equivalent of clicking
through a browser's warnings), so communication is safe, but the
identity of the server is not verified. To validate the server's
certificate, set @racket[current-https-protocol] to a context created
with @racket[ssl-make-client-context], and enable certificate validation
in the context with @racket[ssl-set-verify!].
The @racket["file"] scheme for URLs is handled only by
@racket[get-pure-port], which uses @racket[open-input-file], does not
handle exceptions, and ignores the optional strings.}
@deftogether[(
@defproc[(get-impure-port [URL url?]
[header (listof string?) null])
input-port?]
@defproc[(head-impure-port [URL url?]
[header (listof string?) null])
input-port?]
@defproc[(delete-impure-port [URL url?]
[header (listof string?) null])
input-port?]
)]{
Like @racket[get-pure-port], etc., but the resulting @tech{impure
port} contains both the returned headers and the body. The
@racket["file"] URL scheme is not handled by these functions.}
@deftogether[(
@defproc[(post-pure-port [URL url?]
[post bytes?]
[header (listof string?) null])
input-port?]
@defproc[(put-pure-port [URL url?]
[post bytes?]
[header (listof string?) null])
input-port?]
)]{
Initiates a POST/PUT request for @racket[URL] and sends the
@racket[post] byte string. The result is a @tech{pure port}, which
contains the body of the response is returned. The optional list of
strings can be used to send header lines to the server.
@bold{Beware:} See @racket[get-pure-port] for warnings about
@racket["https"] certificate validation.}
@deftogether[(
@defproc[(post-impure-port [URL url?]
[post bytes?]
[header (listof string?) null])
input-port?]
@defproc[(put-impure-port [URL url?]
[post bytes?]
[header (listof string?) null])
input-port?]
)]{
Like @racket[post-pure-port] and @racket[put-pure-port], but the
resulting @tech{impure port} contains both the returned headers and
body.}
@defproc[(display-pure-port [in input-port?]) void?]{
Writes the output of a pure port, which is useful for debugging purposes.}
@defproc[(purify-port [in input-port?]) string?]{
Purifies a port, returning the MIME headers, plus a leading line for
the form @litchar{HTTP/}@nonterm{vers}@litchar{
}@nonterm{code}@litchar{ }@nonterm{message}, where @nonterm{vers} is
something like @litchar{1.0} or @litchar{1.1}, @nonterm{code} is an
exact integer for the response code, and @nonterm{message} is
arbitrary text without a return or newline.
The @racketmodname[net/head] library provides procedures, such as
@racket[extract-field] for manipulating the header.
Since web servers sometimes return mis-formatted replies,
@racket[purify-port] is liberal in what it accepts as a header. as a
result, the result string may be ill formed, but it will either be the
empty string, or it will be a string matching the following regexp:
@racketblock[
#rx"^HTTP/.*?(\r\n\r\n|\n\n|\r\r)"
]}
@defproc[(get-pure-port/headers
[url url?]
[headers (listof string?) '()]
[#:redirections redirections exact-nonnegative-integer? 0]
[#:status? status? boolean? #f])
(values input-port? string?)]{
This function is an alternative to calling @racket[get-impure-port] and
@racket[purify-port] when needing to follow redirections.
That is, it does a GET request on @racket[url], follows up to
@racket[redirections] redirections and returns a port containing
the data as well as the headers for the final connection. If
@racket[status?] is true, then the status line is included in the
result string.
}
@defproc*[([(call/input-url [URL url?]
[connect (url? . -> . input-port?)]
[handle (input-port? . -> . any)])
any]
[(call/input-url [URL url?]
[connect (url? (listof string?) . -> . input-port?)]
[handle (input-port? . -> . any)]
[header (listof string?)])
any])]{
Given a URL and a @racket[connect] procedure like
@racket[get-pure-port] to convert the URL to an input port (either a
@tech{pure port} or @tech{impure port}), calls the @racket[handle]
procedure on the port and closes the port on return. The result of the
@racket[handle] procedure is the result of @racket[call/input-url].
When a @racket[header] argument is supplied, it is passed along to the
@racket[connect] procedure.
The connection is made in such a way that the port is closed before
@racket[call/input-url] returns, no matter how it returns. In
particular, it is closed if @racket[handle] raises an exception, or if
the connection process is interruped by an asynchronous break
exception.}
@defparam[current-proxy-servers mapping (listof (list/c string? string? (integer-in 0 65535)))]{
A parameter that determines a mapping of proxy servers used for
connections. Each mapping is a list of three elements:
@itemize[
@item{the URL scheme, such as @racket["http"];}
@item{the proxy server address; and}
@item{the proxy server port number.}
]
Currently, the only proxiable scheme is @racket["http"]. The default
mapping is the empty list (i.e., no proxies).}
@defproc[(url-exception? [x any/c])
boolean?]{
Identifies an error thrown by URL functions.
}
@section{URL HTTPS mode}
@defmodule[net/url-connect]
These bindings are provided by the @racketmodname[net/url-connect] library, and
used by @racketmodname[net/url].
@defparam[current-https-protocol protocol (or/c ssl-client-context? symbol?)]{
A parameter that determines the connection mode for @racket["https"]
connections; the parameter value is passed as the third argument to
@racket[ssl-connect] when creating an @racket["https"] connection.
Set this parameter to validate a server's certificates, for example,
as described with @racket[get-pure-port].}
@; ----------------------------------------
@section{URL Unit}
@margin-note{@racket[url@], @racket[url^], and @racket[url+scheme^] are deprecated.
They exist for backward-compatibility and will likely be removed in
the future. New code should use the @racketmodname[net/url] module.}
@defmodule[net/url-unit]
@defthing[url@ unit?]{
Imports @racket[tcp^], exports @racket[url+scheme^].
The @racket[url+scheme^] signature contains
@racket[current-connect-scheme], which @racket[url@] binds to a
parameter. The parameter is set to the scheme of a URL when
@racket[tcp-connect] is called to create a connection. A
@racket[tcp-connect] variant linked to @racket[url@] can check this
parameter to choose the connection mode; in particular, @racket[net/url]
supplies a @racket[tcp-connect] that actually uses @racket[ssl-connect]
when @racket[(current-connect-scheme)] produces @racket["https"].
Note that @racket[net/url] does not provide the
@racket[current-connect-scheme] parameter.}
@; ----------------------------------------
@section{URL Signature}
@defmodule[net/url-sig]
@defsignature[url^ ()]{
Includes everything exported by the @racketmodname[net/url] module
except @racket[current-https-protocol] and @racket[current-url-encode-mode].
Note that the exports of
@racketmodname[net/url] and the @racket[url^] signature do not include
@racket[current-connect-scheme].}
@defsignature[url+scheme^ (url^)]{
Adds @racket[current-connect-scheme] to @racket[url^].}