Always convert string<->paths with UTF-8 on Windows

Also, document representation information on paths. In particular,
explain that Unix and Mac OS X paths are natively byte strings, while
Windows paths are natively UTF-16 code-unit sequences. The byte-string
representation of a Windows path is a UTF-8-like encoding of the UTF-16
code-unit sequence, which is why it makes no sense to convert it using
the current locale's encoding.
This commit is contained in:
Matthew Flatt 2014-10-10 11:30:28 -06:00
parent 039eac7440
commit d8793e5b8b
5 changed files with 95 additions and 34 deletions

View File

@ -463,10 +463,10 @@ Certain encoding combinations are always available:
include UTF-16 code units that are unpaired surrogates, and the include UTF-16 code units that are unpaired surrogates, and the
corresponding output includes an encoding of each surrogate in a corresponding output includes an encoding of each surrogate in a
natural extension of UTF-8. On @|AllUnix|, surrogates are natural extension of UTF-8. On @|AllUnix|, surrogates are
assumed to be paired: a pair of bytes with the bits @racket[#xD800] assumed to be paired: a pair of bytes with the bits @code{#xD800}
starts a surrogate pair, and the @racket[#x03FF] bits are used from starts a surrogate pair, and the @code{#x03FF} bits are used from
the pair and following pair (independent of the value of the the pair and following pair (independent of the value of the
@racket[#xDC00] bits). On all platforms, performance may be poor @code{#xDC00} bits). On all platforms, performance may be poor
when decoding from an odd offset within an input byte string.} when decoding from an odd offset within an input byte string.}
] ]

View File

@ -6,7 +6,10 @@
When a Racket procedure takes a filesystem path as an argument, the When a Racket procedure takes a filesystem path as an argument, the
path can be provided either as a string or as an instance of the path can be provided either as a string or as an instance of the
@deftech{path} datatype. If a string is provided, it is converted to a @deftech{path} datatype. If a string is provided, it is converted to a
path using @racket[string->path]. A Racket procedure that generates a path using @racket[string->path]. Beware that some paths may not
be representable as strings; see @secref["unixpathrep"] and
@secref["windowspathrep"] for more information.
A Racket procedure that generates a
filesystem path always generates a @tech{path} value. filesystem path always generates a @tech{path} value.
By default, paths are created and manipulated for the current By default, paths are created and manipulated for the current
@ -33,8 +36,8 @@ path before using it. Procedures that build paths or merely check the
form of a path do not cleanse paths, with the exceptions of form of a path do not cleanse paths, with the exceptions of
@racket[cleanse-path], @racket[expand-user-path], and @racket[cleanse-path], @racket[expand-user-path], and
@racket[simplify-path]. For more information about path cleansing and @racket[simplify-path]. For more information about path cleansing and
other platform-specific details, see @secref["unixpaths"] for other platform-specific details, see @secref["unixpaths"] and
@|AllUnix| paths and @secref["windowspaths"] for Windows paths. @secref["windowspaths"].
@;------------------------------------------------------------------------ @;------------------------------------------------------------------------
@section{Manipulating Paths} @section{Manipulating Paths}
@ -56,35 +59,49 @@ current platform or a non-empty string without nul characters,
Returns @racket[#t] if @racket[v] is a path value for some platform Returns @racket[#t] if @racket[v] is a path value for some platform
(not a string), @racket[#f] otherwise.} (not a string), @racket[#f] otherwise.}
@defproc[(string->path [str string?]) path?]{ @defproc[(string->path [str string?]) path?]{
Produces a path whose byte-string name is Produces a path whose byte-string encoding is
@racket[(string->bytes/locale string (char->integer #\?))]. @racket[(string->bytes/locale str (char->integer #\?))] on @|AllUnix|
or @racket[(string->bytes/utf-8 str)] on Windows.
Beware that the current locale might not encode every string, in which Beware that the current locale might not encode every string, in which
case @racket[string->path] can produce the same path for different case @racket[string->path] can produce the same path for different
@racket[str]s. See also @racket[string->path-element], which should be @racket[str]s. See also @racket[string->path-element], which should be
used instead of @racket[string->path] when a string represents a used instead of @racket[string->path] when a string represents a
single @tech{path element}. single @tech{path element}. For information on how strings and byte
strings encode paths, see @secref["unixpathrep"] and
@secref["windowspathrep"].
See also @racket[string->some-system-path], and see
@secref["unixpathrep"] and @secref["windowspathrep"] for information
on how strings encode paths.
@history[#:changed "6.1.1.1" @elem{Changed Windows conversion to always use UTF-8.}]}
See also @racket[string->some-system-path].}
@defproc[(bytes->path [bstr bytes?] @defproc[(bytes->path [bstr bytes?]
[type (or/c 'unix 'windows) (system-path-convention-type)]) [type (or/c 'unix 'windows) (system-path-convention-type)])
path?]{ path?]{
Produces a path (for some platform) whose byte-string name is Produces a path (for some platform) whose byte-string encoding is
@racket[bstr]. The optional @racket[type] specifies the convention to @racket[bstr]. The optional @racket[type] specifies the convention to
use for the path. use for the path.
For converting relative @tech{path elements} from literals, use instead For converting relative @tech{path elements} from literals, use instead
@racket[bytes->path-element], which applies a suitable encoding for @racket[bytes->path-element], which applies a suitable encoding for
individual elements.} individual elements.
For information on how byte strings encode paths, see
@secref["unixpathrep"] and @secref["windowspathrep"].}
@defproc[(path->string [path path?]) string?]{ @defproc[(path->string [path path?]) string?]{
Produces a string that represents @racket[path] by decoding Produces a string that represents @racket[path] by decoding
@racket[path]'s byte-string name using the current locale's encoding; @racket[path]'s byte-string encoding using the current locale
on @|AllUnix| and by using UTF-8 on Windows. In the former case,
@litchar{?} is used in the result string where encoding fails, and if @litchar{?} is used in the result string where encoding fails, and if
the encoding result is the empty string, then the result is the encoding result is the empty string, then the result is
@racket["?"]. @racket["?"].
@ -101,11 +118,14 @@ instead, to avoid special encodings use to represent some relative
paths. See @secref["windowspaths"] for specific information about paths. See @secref["windowspaths"] for specific information about
the conversion of Windows paths. the conversion of Windows paths.
See also @racket[some-system-path->string].} See also @racket[some-system-path->string].
@history[#:changed "6.1.1.1" @elem{Changed Windows conversion to always use UTF-8.}]}
@defproc[(path->bytes [path path-for-some-system?]) bytes?]{ @defproc[(path->bytes [path path-for-some-system?]) bytes?]{
Produces @racket[path]'s byte string representation. No information is Produces @racket[path]'s byte-string representation. No information is
lost in this translation, so that @racket[(bytes->path (path->bytes lost in this translation, so that @racket[(bytes->path (path->bytes
path) (path-convention-type path))] always produces a path that is path) (path-convention-type path))] always produces a path that is
@racket[equal?] to @racket[path]. The @racket[path] argument can be a @racket[equal?] to @racket[path]. The @racket[path] argument can be a
@ -116,23 +136,26 @@ unmarshaling paths, but manipulating the byte form of a path is
generally a mistake. In particular, the byte string may start with a generally a mistake. In particular, the byte string may start with a
@litchar{\\?\REL} encoding for Windows paths. Instead of @litchar{\\?\REL} encoding for Windows paths. Instead of
@racket[path->bytes], use @racket[split-path] and @racket[path->bytes], use @racket[split-path] and
@racket[path-element->bytes] to manipulate individual @tech{path elements}.} @racket[path-element->bytes] to manipulate individual @tech{path elements}.
For information on how byte strings encode paths, see
@secref["unixpathrep"] and @secref["windowspathrep"].}
@defproc[(string->path-element [str string?]) path?]{ @defproc[(string->path-element [str string?]) path?]{
Like @racket[string->path], except that @racket[str] corresponds to a Like @racket[string->path], except that @racket[str] corresponds to a
single relative element in a path, and it is encoded as necessary to single relative element in a path, and it is encoded as necessary to
convert it to a path. See @secref["unixpaths"] for more information convert it to a path. See @secref["unixpaths"] and
on the conversion for @|AllUnix| paths, and see @secref["windowspaths"] for more information on the conversion of
@secref["windowspaths"] for more information on the conversion for paths.
Windows paths.
If @racket[str] does not correspond to any @tech{path element} If @racket[str] does not correspond to any @tech{path element}
(e.g., it is an absolute path, or it can be split), or if it (e.g., it is an absolute path, or it can be split), or if it
corresponds to an up-directory or same-directory indicator on corresponds to an up-directory or same-directory indicator on
@|AllUnix|, then @exnraise[exn:fail:contract]. @|AllUnix|, then @exnraise[exn:fail:contract].
As for @racket[path->string], information can be lost from Like @racket[path->string], information can be lost from
@racket[str] in the locale-specific conversion to a path.} @racket[str] in the locale-specific conversion to a path.}
@ -157,7 +180,7 @@ other path is deconstructed with @racket[split-path] and
Like @racket[path->string], except that trailing path separators are Like @racket[path->string], except that trailing path separators are
removed (as by @racket[split-path]). On Windows, any removed (as by @racket[split-path]). On Windows, any
@litchar{\\?\REL} encoding prefix is also removed; see @litchar{\\?\REL} encoding prefix is also removed; see
@secref["windowspaths"] for more information on Windows paths. @secref["windowspaths"] for more information.
The @racket[path] argument must be such that @racket[split-path] The @racket[path] argument must be such that @racket[split-path]
applied to @racket[path] would return @racket['relative] as its first applied to @racket[path] would return @racket['relative] as its first
@ -245,9 +268,8 @@ is empty or contains a nul character), the
The @racket[build-path] procedure builds a path @italic{without} The @racket[build-path] procedure builds a path @italic{without}
checking the validity of the path or accessing the filesystem. checking the validity of the path or accessing the filesystem.
See @secref["unixpaths"] for more information on the construction See @secref["unixpaths"] and @secref["windowspaths"] for more
of @|AllUnix| paths, and see @secref["windowspaths"] for more information on the construction of paths.
information on the construction of Windows paths.
The following examples assume that the current directory is The following examples assume that the current directory is
@filepath{/home/joeuser} for Unix examples and @filepath{C:\Joe's Files} for @filepath{/home/joeuser} for Unix examples and @filepath{C:\Joe's Files} for
@ -420,9 +442,8 @@ true, but the source or simplified path might be a non-existent path. If
still involve a cycle of links if the cycle did not inhibit the still involve a cycle of links if the cycle did not inhibit the
simplification). simplification).
See @secref["unixpaths"] for more information on simplifying See @secref["unixpaths"] and @secref["windowspaths"] for more
@|AllUnix| paths, and see @secref["windowspaths"] for more information on simplifying paths.}
information on simplifying Windows paths.}
@defproc[(normal-case-path [path (or/c path-string? path-for-some-system?)]) @defproc[(normal-case-path [path (or/c path-string? path-for-some-system?)])
@ -489,9 +510,8 @@ platform, and resulting paths for the same platform.
This procedure does not access the filesystem. This procedure does not access the filesystem.
See @secref["unixpaths"] for more information on splitting See @secref["unixpaths"] and @secref["windowspaths"] for more
@|AllUnix| paths, and see @secref["windowspaths"] for more information on splitting paths.}
information on splitting Windows paths.}
@defproc[(explode-path [path (or/c path-string? path-for-some-system?)]) @defproc[(explode-path [path (or/c path-string? path-for-some-system?)])

View File

@ -3,7 +3,7 @@
@title[#:tag "unixpaths"]{@|AllUnix| Paths} @title[#:tag "unixpaths"]{@|AllUnix| Paths}
In @|AllUnix| paths, a @litchar{/} separates elements of the path, In a path on @|AllUnix|, a @litchar{/} separates elements of the path,
@litchar{.} as a path element always means the directory indicated by @litchar{.} as a path element always means the directory indicated by
preceding path, and @litchar{..} as a path element always means the preceding path, and @litchar{..} as a path element always means the
parent of the directory indicated by the preceding path. A leading parent of the directory indicated by the preceding path. A leading
@ -35,3 +35,13 @@ _path)]. Since that is not the case for other platforms, however,
be used when converting individual path elements. be used when converting individual path elements.
On Mac OS X, Finder aliases are zero-length files. On Mac OS X, Finder aliases are zero-length files.
@section[#:tag "unixpathrep"]{Unix Path Representation}
A path on @|AllUnix| is natively a byte string. For presentation to
users and for other string-based operations, a path is converted
to/from a string using the current locale's encoding with @litchar{?}
(encoding) or @code{#\uFFFD} (decoding) in place of errors. Beware
that the encoding may not accommodate all possible paths as
distinct strings.

View File

@ -3,7 +3,7 @@
@(define MzAdd (italic "Racket-specific:")) @(define MzAdd (italic "Racket-specific:"))
@title[#:tag "windowspaths"]{Windows Path Conventions} @title[#:tag "windowspaths"]{Windows Paths}
In general, a Windows pathname consists of an optional drive specifier In general, a Windows pathname consists of an optional drive specifier
and a drive-specific path. A Windows path can be @defterm{absolute} and a drive-specific path. A Windows path can be @defterm{absolute}
@ -101,7 +101,8 @@ include @litchar{\}.
@litchar{\\}@nonterm{machine}@litchar{\}@nonterm{volume} @litchar{\\}@nonterm{machine}@litchar{\}@nonterm{volume}
counts as the drive specifier.} counts as the drive specifier.}
@item{Normally, a path element cannot contain any of the following @item{Normally, a path element cannot contain a character in the
range @racket[#\x00] to @racket[#\x1F] nor any of the following
characters: characters:
@centerline{@litchar{<} @litchar{>} @litchar{:} @litchar{"} @litchar{/} @litchar{\} @litchar{|}} @centerline{@litchar{<} @litchar{>} @litchar{:} @litchar{"} @litchar{/} @litchar{\} @litchar{|}}
@ -314,3 +315,25 @@ produces @litchar{\\?\C:\x~\} and @litchar{\\?\REL\\aux};
the @litchar{\\?\} is needed in these cases to preserve a the @litchar{\\?\} is needed in these cases to preserve a
trailing space after @litchar{x} and to avoid referring to the AUX trailing space after @litchar{x} and to avoid referring to the AUX
device instead of an @filepath{aux} file. device instead of an @filepath{aux} file.
@section[#:tag "windowspathrep"]{Windows Path Representation}
A path on Windows is natively a sequence of UTF-16 code units, where
the sequence can include unpaired surrogates. This sequence is encoded
as a byte string through an extension of UTF-8, where unpaired
surrogates in the UTF-16 code-unit sequence are converted as if they
were non-surrogate values. The extended encodings are implemented on
Windows as the @racket["platform-UTF-16"] and
@racket["platform-UTF-8"] encodings for @racket[bytes-open-converter].
Racket's internal representation of a Windows path is a byte string,
so that @racket[path->bytes] and @racket[bytes->path] are always
inverses. When converting a path to a native UTF-16 code-unit
sequence, @racket[#\tab] is used in place of platform-UTF-8 decoding
errors (on the grounds that tab is normally disallowed as a character
in a Windows path, unlike @code{#\uFFFD}).
A Windows path is converted to a string by treating the platform-UTF-8
encoding as a UTF-8 encoding with @code{#\uFFFD} in place of
decoding errors. Similarly, a string is converted to a path by UTF-8
encoding (in which case no errors are possible).

View File

@ -813,7 +813,11 @@ static Scheme_Object *append_path(Scheme_Object *a, Scheme_Object *b)
Scheme_Object *scheme_char_string_to_path(Scheme_Object *p) Scheme_Object *scheme_char_string_to_path(Scheme_Object *p)
{ {
#ifdef DOS_FILE_SYSTEM
p = scheme_char_string_to_byte_string(p);
#else
p = scheme_char_string_to_byte_string_locale(p); p = scheme_char_string_to_byte_string_locale(p);
#endif
p->type = SCHEME_PLATFORM_PATH_KIND; p->type = SCHEME_PLATFORM_PATH_KIND;
return p; return p;
} }
@ -889,7 +893,11 @@ Scheme_Object *scheme_path_to_char_string(Scheme_Object *p)
{ {
Scheme_Object *s; Scheme_Object *s;
#ifdef DOS_FILE_SYSTEM
s = scheme_byte_string_to_char_string(p);
#else
s = scheme_byte_string_to_char_string_locale(p); s = scheme_byte_string_to_char_string_locale(p);
#endif
if (!SCHEME_CHAR_STRLEN_VAL(s)) if (!SCHEME_CHAR_STRLEN_VAL(s))
return scheme_make_utf8_string("?"); return scheme_make_utf8_string("?");