racket/collects/scribblings/reference/encodings.scrbl

#reader(lib "docreader.ss" "scribble")
@require["mz.ss"]

@title[#:tag "mz:encodings"]{Encodings and Locales}

When a port is provided to a character-based operation, such as
@scheme[read-char] or @scheme[read], the port's bytes are read and
interpreted as a UTF-8 encoding of characters. Thus, reading a single
character may require reading multiple bytes, and a procedure like
@scheme[char-ready?] may need to peek several bytes into the stream to
determine whether a character is available. In the case of a byte
stream that does not correspond to a valid UTF-8 encoding, functions
such as @scheme[read-char] may need to peek one byte ahead in the
stream to discover that the stream is not a valid encoding.

When an input port produces a sequence of bytes that is not a valid
UTF-8 encoding in a character-reading context, then bytes that
constitute an invalid sequence are converted to the character
@litchar{?}. Specifically, bytes 255 and 254 are always converted to
@litchar{?}, bytes in the range 192 to 253 produce @litchar{?} when
they are not followed by bytes that form a valid UTF-8 encoding, and
bytes in the range 128 to 191 are converted to @litchar{?} when they
are not part of a valid encoding that was started by a preceding byte
in the range 192 to 253. To put it another way, when reading a
sequence of bytes as characters, a minimal set of bytes are changed to
63 (which is the value that @scheme[(char->integer #\?)] produces) so
that the entire sequence of bytes is a valid UTF-8 encoding.

See @secref["mz:bytestrings"] for procedures that facilitate
conversions using UTF-8 or other encodings. See also
@scheme[reencode-input-port] and @scheme[reencode-output-port] for
obtaining a UTF-8-based port from one that uses a different encoding
of characters.


A @deftech{locale} captures information about a user's
culture-specific interpretation of character sequences. In particular,
a locale determines how strings are ``alphabetized,'' how a lowercase
character is converted to an uppercase character, and how strings are
compared without regard to case. String operations such as
@scheme[string-ci?] are @italic{not} sensitive to the current locale,
but operations such as @scheme[string-locale-ci?] (see
@secref["mz:strings"]) produce results consistent with the current
locale.

A locale also designates a particular encoding of code-point sequences
into byte sequences. Scheme generally ignores this aspect of the
locale, with a few notable exceptions: command-line arguments passed
to Scheme as byte strings are converted to character strings using the
locale's encoding; command-line strings passed as byte strings to
other processes (through @scheme[subprocess]) are converted to byte
strings using the locale's encoding; environment variables are
converted to and from strings using the locale's encoding; filesystem
paths are converted to and from strings (for display purposes) using
the locale's encoding; and, finally, Scheme provides functions such as
@scheme[string->bytes/locale] to specifically invoke a locale-specific
encoding.

A Unix user selects a locale by setting environment variables, such as
@envvar{LC_ALL}. Under Windows and Mac OS X, the operating system
provides other mechanisms for setting the locale. Within Scheme, the
current locale can be changed by setting the @scheme[current-locale]
parameter. The locale name within Scheme is a string, and the
available locale names depend on the platform and its configuration,
but the @scheme[""] locale means the current user's default locale;
under Windows and Mac OS X, the encoding for @scheme[""] is always
UTF-8, and locale-sensitive operations use the operating system's
native interface. (In particular, setting the @envvar{LC_ALL} and
@envvar{LC_CTYPE} environment variables do not affect the locale
@scheme[""] under Mac OS X. Use @scheme[getenv] and
@scheme[current-locale] to explicitly install the
environment-specified locale, if desired.) Setting the current locale
to @scheme[#f] makes locale-sensitive operations locale-insensitive,
which means using the Unicode mapping for case operations and using
UTF-8 for encoding.