377 lines
14 KiB
Scheme
377 lines
14 KiB
Scheme
;; 1/2/2006: Added a mapping for uri path segments
|
|
;; that allows more characters to remain decoded
|
|
;; -robby
|
|
|
|
|
|
#|
|
|
|
|
People often seem to wonder why semicolons are the default in this code,
|
|
and not ampersands. Here's are the best answers we have:
|
|
|
|
From: Doug Orleans <dougorleans@gmail.com>
|
|
To: plt-scheme@list.cs.brown.edu
|
|
Subject: Re: [plt-scheme] Problem fetching a URL
|
|
Date: Wed, 11 Oct 2006 16:18:40 -0400
|
|
X-Mailer: VM 7.19 under 21.4 (patch 19) "Constant Variable" XEmacs Lucid
|
|
|
|
Robby Findler writes:
|
|
> Do you (or does anyone else) have a reference to an rfc or similar that
|
|
> actually says what the syntax for queries is supposed to be?
|
|
> rfc3986.txt (the latest url syntax rfc I know of) doesn't seem to say.
|
|
|
|
The HTML 4.01 spec defines the MIME type application/x-www-form-urlencoded:
|
|
|
|
http://www.w3.org/TR/html401/interact/forms.html#form-content-type
|
|
|
|
See also XForms, which uses a "separator" attribute whose default
|
|
value is a semicolon:
|
|
|
|
http://www.w3.org/TR/xforms/slice11.html#serialize-urlencode
|
|
http://www.w3.org/TR/xforms/slice3.html#structure-model-submission
|
|
|
|
--dougorleans@gmail.com
|
|
_________________________________________________
|
|
For list-related administrative tasks:
|
|
http://list.cs.brown.edu/mailman/listinfo/plt-scheme
|
|
|
|
|
|
|
|
|
|
|
|
|
|
From: John David Stone <stone@math.grinnell.edu>
|
|
To: plt-scheme@list.cs.brown.edu
|
|
Subject: Re: [plt-scheme] Problem fetching a URL
|
|
Date: Wed, 11 Oct 2006 11:36:14 -0500
|
|
X-Mailer: VM 7.19 under Emacs 21.4.1
|
|
|
|
-----BEGIN PGP SIGNED MESSAGE-----
|
|
Hash: SHA1
|
|
|
|
Danny Yoo:
|
|
|
|
> > Just out of curiosity, why is current-alist-separator-mode using
|
|
> > semicolons by default rather than ampersands? I understand that
|
|
> > flexibility is nice, but this is the fifth time I've seen people hit this
|
|
> > as a roadblock; shouldn't the default be what's most commonly used?
|
|
|
|
Robby Findler:
|
|
|
|
> It is my understanding that semi-colons are more standards compliant.
|
|
> That's why it is the default.
|
|
|
|
According to the RFC1738 (http://www.ietf.org/rfc/rfc1738.txt),
|
|
semicolons and ampersands are equally acceptable in URLs that begin with
|
|
`http://' (section 5, page 17), but semicolons are ``reserved'' characters
|
|
(section 3.3, page 8) and so are allowed to appear unencoded in the URL
|
|
(section 2.2, page 3), whereas ampersands are not reserved in such URLs and
|
|
so must be encoded (as, say, &). RFC2141
|
|
(http://www.ietf.org/rfc/rfc2141.txt) extends this rule to Uniform Resource
|
|
Names generally, classifying ampersands as ``excluded'' characters that
|
|
must be encoded whenever used in URNs (section 2.4, page 3).
|
|
|
|
The explanation and rationale is given in Appendix B
|
|
(``Performance, implementation, and design notes,''
|
|
http://www.w3.org/TR/html401/appendix/notes.html) of the World Wide Web
|
|
Consortium's technical report defining HTML 4.01. Section B.2.2
|
|
(``Ampersands in URI attribute values'') of that appendix notes that the
|
|
use of ampersands in URLs to carry information derived from forms is, in
|
|
practice, a serious glitch and source of errors, since it tempts careless
|
|
implementers and authors of HTML documents to insert those ampersands in
|
|
URLs without encoding them. This practice conflicts with the simple and
|
|
otherwise standard convention, derived from SGML, that an ampersand is
|
|
always the opening delimiter of a character entity reference. So the World
|
|
Wide Web Consortium encourages implementers to use and recognize semicolons
|
|
rather than ampersands in URLs that carry information derived from forms.
|
|
|
|
-----BEGIN PGP SIGNATURE-----
|
|
Version: GnuPG v1.4.5 (GNU/Linux)
|
|
Comment: Processed by Mailcrypt 3.5.8+ <http://mailcrypt.sourceforge.net/>
|
|
|
|
iD4DBQFFLR16bBGsCPR0ElQRAizVAJddgT63LKc6UWqRyHh57aqWjSXGAJ4wyseS
|
|
JALQefhDMCATcl2/bZL0bw==
|
|
=W2uS
|
|
-----END PGP SIGNATURE-----
|
|
|
|
|#
|
|
|
|
|
|
;;;
|
|
;;; <uri-codec-unit.ss> ---- En/Decode URLs and form-urlencoded data
|
|
;;; Time-stamp: <03/04/25 10:31:31 noel>
|
|
;;;
|
|
;;; Copyright (C) 2002 by Noel Welsh.
|
|
;;;
|
|
;;; This file is part of Net.
|
|
|
|
;;; Net is free software; you can redistribute it and/or
|
|
;;; modify it under the terms of the GNU Lesser General Public
|
|
;;; License as published by the Free Software Foundation; either
|
|
;;; version 2.1 of the License, or (at your option) any later version.
|
|
|
|
;;; Net is distributed in the hope that it will be useful,
|
|
;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
|
;;; Lesser General Public License for more details.
|
|
|
|
;;; You should have received a copy of the GNU Lesser General Public
|
|
;;; License along with Net; if not, write to the Free Software
|
|
;;; Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
|
|
|
|
;;; Author: Noel Welsh <noelwelsh@yahoo.com>
|
|
;;
|
|
;;
|
|
;; Commentary:
|
|
|
|
;; The module provides functions to encode and decode strings using
|
|
;; the URI encoding rules given in RFC 2396, and to encode and decode
|
|
;; name/value pairs using the application/x-www-form-urlencoded
|
|
;; mimetype given the in HTML 4.0 specification. There are minor
|
|
;; differences between the two encodings.
|
|
|
|
;; The URI encoding uses allows a few characters to be represented `as
|
|
;; is': a-Z, A-Z, 0-9, -, _, ., !, ~, *, ', ( and ). The remaining
|
|
;; characters are encoded as %xx, where xx is the hex representation
|
|
;; of the integer value of the character (where the mapping
|
|
;; character<->integer is determined by US-ASCII if the integer is
|
|
;; <128).
|
|
|
|
;; The encoding, inline with RFC 2396's recommendation, represents a
|
|
;; character as is, if possible. The decoding allows any characters
|
|
;; to be represented by their hex values, and allows characters to be
|
|
;; incorrectly represented `as is'.
|
|
|
|
;; The rules for the application/x-www-form-urlencoded mimetype given
|
|
;; in the HTML 4.0 spec are:
|
|
|
|
;; 1. Control names and values are escaped. Space characters are
|
|
;; replaced by `+', and then reserved characters are escaped as
|
|
;; described in [RFC1738], section 2.2: Non-alphanumeric characters
|
|
;; are replaced by `%HH', a percent sign and two hexadecimal digits
|
|
;; representing the ASCII code of the character. Line breaks are
|
|
;; represented as "CR LF" pairs (i.e., `%0D%0A').
|
|
|
|
;; 2. The control names/values are listed in the order they appear
|
|
;; in the document. The name is separated from the value by `=' and
|
|
;; name/value pairs are separated from each other by `&'.
|
|
|
|
;; NB: RFC 2396 supersedes RFC 1738.
|
|
|
|
;; This differs slightly from the straight encoding in RFC 2396 in
|
|
;; that `+' is allowed, and represents a space. We follow this
|
|
;; convention, encoding a space as `+' and decoding `+' as a space.
|
|
;; There appear to be some brain-dead decoders on the web, so we also
|
|
;; encode `!', `~', `'', `(' and ')' using their hex representation.
|
|
;; This is the same choice as made by the Java URLEncoder.
|
|
|
|
;; Draws inspiration from encode-decode.scm by Kurt Normark and a code
|
|
;; sample provided by Eli Barzilay
|
|
|
|
(module uri-codec-unit (lib "a-unit.ss")
|
|
|
|
(require (lib "match.ss")
|
|
(lib "string.ss")
|
|
(lib "list.ss")
|
|
(lib "etc.ss")
|
|
"uri-codec-sig.ss")
|
|
|
|
(import)
|
|
(export uri-codec^)
|
|
|
|
(define (self-map-char ch) (cons ch ch))
|
|
(define (self-map-chars str) (map self-map-char (string->list str)))
|
|
|
|
;; The characters that always map to themselves
|
|
(define alphanumeric-mapping
|
|
(self-map-chars
|
|
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"))
|
|
|
|
;; Characters that sometimes map to themselves
|
|
(define safe-mapping (self-map-chars "-_.!~*'()"))
|
|
|
|
;; The strict URI mapping
|
|
(define uri-mapping (append alphanumeric-mapping safe-mapping))
|
|
|
|
;; The uri path segment mapping from RFC 3986
|
|
(define uri-path-segment-mapping
|
|
(append alphanumeric-mapping
|
|
safe-mapping
|
|
(map (λ (c) (cons c c)) (string->list "@+,=$&:"))))
|
|
|
|
;; The form-urlencoded mapping
|
|
(define form-urlencoded-mapping
|
|
`(,@(self-map-chars ".-*_") (#\space . #\+) ,@alphanumeric-mapping))
|
|
|
|
(define (number->hex-string number)
|
|
(define (hex n) (string-ref "0123456789ABCDEF" n))
|
|
(string #\% (hex (quotient number 16)) (hex (modulo number 16))))
|
|
|
|
(define (hex-string->number hex-string)
|
|
(string->number (substring hex-string 1 3) 16))
|
|
|
|
(define ascii-size 128)
|
|
|
|
;; (listof (cons char char)) -> (values (vectorof string) (vectorof string))
|
|
(define (make-codec-tables alist)
|
|
(let ([encoding-table (build-vector ascii-size number->hex-string)]
|
|
[decoding-table (build-vector ascii-size values)])
|
|
(for-each (match-lambda
|
|
[(orig . enc)
|
|
(vector-set! encoding-table
|
|
(char->integer orig)
|
|
(string enc))
|
|
(vector-set! decoding-table
|
|
(char->integer enc)
|
|
(char->integer orig))])
|
|
alist)
|
|
(values encoding-table decoding-table)))
|
|
|
|
(define-values (uri-encoding-vector uri-decoding-vector)
|
|
(make-codec-tables uri-mapping))
|
|
|
|
(define-values (uri-path-segment-encoding-vector
|
|
uri-path-segment-decoding-vector)
|
|
(make-codec-tables uri-path-segment-mapping))
|
|
|
|
(define-values (form-urlencoded-encoding-vector
|
|
form-urlencoded-decoding-vector)
|
|
(make-codec-tables form-urlencoded-mapping))
|
|
|
|
;; vector string -> string
|
|
(define (encode table str)
|
|
(apply string-append
|
|
(map (lambda (byte)
|
|
(cond
|
|
[(< byte ascii-size)
|
|
(vector-ref table byte)]
|
|
[else (number->hex-string byte)]))
|
|
(bytes->list (string->bytes/utf-8 str)))))
|
|
|
|
;; vector string -> string
|
|
(define (decode table str)
|
|
(define internal-decode
|
|
(match-lambda
|
|
[() (list)]
|
|
[(#\% (? hex-digit? char1) (? hex-digit? char2) . rest)
|
|
;; This used to consult the table again, but I think that's
|
|
;; wrong. For example %2b should produce +, not a space.
|
|
(cons (string->number (string char1 char2) 16)
|
|
(internal-decode rest))]
|
|
[((? ascii-char? char) . rest)
|
|
(cons
|
|
(vector-ref table (char->integer char))
|
|
(internal-decode rest))]
|
|
[(char . rest)
|
|
(append
|
|
(bytes->list (string->bytes/utf-8 (string char)))
|
|
(internal-decode rest))]))
|
|
(bytes->string/utf-8
|
|
(apply bytes (internal-decode (string->list str)))))
|
|
|
|
(define (ascii-char? c)
|
|
(< (char->integer c) ascii-size))
|
|
|
|
(define (hex-digit? c)
|
|
(or (char<=? #\0 c #\9)
|
|
(char<=? #\a c #\f)
|
|
(char<=? #\A c #\F)))
|
|
|
|
;; string -> string
|
|
(define (uri-encode str)
|
|
(encode uri-encoding-vector str))
|
|
|
|
;; string -> string
|
|
(define (uri-decode str)
|
|
(decode uri-decoding-vector str))
|
|
|
|
;; string -> string
|
|
(define (uri-path-segment-encode str)
|
|
(encode uri-path-segment-encoding-vector str))
|
|
|
|
;; string -> string
|
|
(define (uri-path-segment-decode str)
|
|
(decode uri-path-segment-decoding-vector str))
|
|
|
|
;; string -> string
|
|
(define (form-urlencoded-encode str)
|
|
(encode form-urlencoded-encoding-vector str))
|
|
|
|
;; string -> string
|
|
(define (form-urlencoded-decode str)
|
|
(decode form-urlencoded-decoding-vector str))
|
|
|
|
;; listof (cons string string) -> string
|
|
;; http://www.w3.org/TR/html401/appendix/notes.html#ampersands-in-uris
|
|
;; listof (cons symbol string) -> string
|
|
(define (alist->form-urlencoded args)
|
|
(let* ([mode (current-alist-separator-mode)]
|
|
[format-one
|
|
(lambda (arg)
|
|
(let* ([name (car arg)]
|
|
[value (cdr arg)])
|
|
(string-append (form-urlencoded-encode (symbol->string name))
|
|
"="
|
|
(form-urlencoded-encode value))))]
|
|
[strs (let loop ([args args])
|
|
(cond
|
|
[(null? args) null]
|
|
[(null? (cdr args)) (list (format-one (car args)))]
|
|
[else (list* (format-one (car args))
|
|
(if (eq? mode 'amp) "&" ";")
|
|
(loop (cdr args)))]))])
|
|
(apply string-append strs)))
|
|
|
|
;; string -> listof (cons string string)
|
|
;; http://www.w3.org/TR/html401/appendix/notes.html#ampersands-in-uris
|
|
(define (form-urlencoded->alist str)
|
|
(define key-regexp #rx"[^=]*")
|
|
(define value-regexp (case (current-alist-separator-mode)
|
|
[(semi) #rx"[^;]*"]
|
|
[(amp) #rx"[^&]*"]
|
|
[else #rx"[^&;]*"]))
|
|
(define (next-key str start)
|
|
(and (< start (string-length str))
|
|
(match (regexp-match-positions key-regexp str start)
|
|
[((start . end))
|
|
(vector (let ([s (form-urlencoded-decode
|
|
(substring str start end))])
|
|
(string->symbol s))
|
|
(add1 end))]
|
|
[#f #f])))
|
|
(define (next-value str start)
|
|
(and (< start (string-length str))
|
|
(match (regexp-match-positions value-regexp str start)
|
|
[((start . end))
|
|
(vector (form-urlencoded-decode (substring str start end))
|
|
(add1 end))]
|
|
[#f #f])))
|
|
(define (next-pair str start)
|
|
(match (next-key str start)
|
|
[#(key start)
|
|
(match (next-value str start)
|
|
[#(value start)
|
|
(vector (cons key value) start)]
|
|
[#f
|
|
(vector (cons key "") (string-length str))])]
|
|
[#f #f]))
|
|
(let loop ([start 0]
|
|
[end (string-length str)]
|
|
[make-alist (lambda (x) x)])
|
|
(if (>= start end)
|
|
(make-alist '())
|
|
(match (next-pair str start)
|
|
[#(pair next-start)
|
|
(loop next-start end (lambda (x) (make-alist (cons pair x))))]
|
|
[#f (make-alist '())]))))
|
|
|
|
(define current-alist-separator-mode
|
|
(make-parameter 'amp-or-semi
|
|
(lambda (s)
|
|
(unless (memq s '(amp semi amp-or-semi))
|
|
(raise-type-error 'current-alist-separator-mode
|
|
"'amp, 'semi, or 'amp-or-semi"
|
|
s))
|
|
s))))
|
|
|
|
;;; uri-codec-unit.ss ends here
|