racket/collects/scribblings/inside/strings.scrbl
Matthew Flatt b7583984d8 Scribble insidemz
svn: r7931
2007-12-09 22:59:08 +00:00

257 lines
9.5 KiB
Racket

#lang scribble/doc
@(require "utils.ss")
@title[#:tag "im:encodings"]{String Encodings}
The @cpp{scheme_utf8_decode} function decodes a @cpp{char} array as
UTF-8 into either a UCS-4 @cpp{mzchar} array or a UTF-16 @cpp{short}
array. The @cpp{scheme_utf8_encode} function encodes either a UCS-4
@cpp{mzchar} array or a UTF-16 @cpp{short} array into a UTF-8
@cpp{char} array.
These functions can be used to check or measure an encoding or
decoding without actually producing the result decoding or encoding,
and variations of the function provide control over the handling of
decoding errors.
@function[(int scheme_utf8_decode
[const-unsigned-char* s]
[int start]
[int end]
[mzchar* us]
[int dstart]
[int dend]
[long* ipos]
[char utf16]
[int permissive])]{
Decodes a byte array as UTF-8 to produce either Unicode code points
into @var{us} (when @var{utf16} is zero) or UTF-16 code units into
@var{us} cast to @cpp{short*} (when @var{utf16} is non-zero). No nul
terminator is added to @var{us}.
The result is non-negative when all of the given bytes are decoded,
and the result is the length of the decoding (in @cpp{mzchar}s or
@cpp{short}s). A @cpp{-2} result indicates an invalid encoding
sequence in the given bytes (possibly because the range to decode
ended mid-encoding), and a @cpp{-3} result indicates that decoding
stopped because not enough room was available in the result string.
The @var{start} and @var{end} arguments specify a range of @var{s} to
be decoded. If @var{end} is negative, @cpp{strlen(@var{s})} is used
as the end.
If @var{us} is @cpp{NULL}, then decoded bytes are not produced, but
the result is valid as if decoded bytes were written. The
@var{dstart} and @var{dend} arguments specify a target range in
@var{us} (in @cpp{mzchar} or @cpp{short} units) for the decoding; a
negative value for @var{dend} indicates that any number of bytes can
be written to @var{us}, which is normally sensible only when @var{us}
is @cpp{NULL} for measuring the length of the decoding.
If @var{ipos} is non-@cpp{NULL}, it is filled with the first undecoded
index within @var{s}. If the function result is non-negative, then
@cpp{*@var{ipos}} is set to the ending index (with is @var{end} if
non-negative, @cpp{strlen(@var{s})} otherwise). If the result is
@cpp{-1} or @cpp{-2}, then @cpp{*@var{ipos}} effectively indicates
how many bytes were decoded before decoding stopped.
If @var{permissive} is non-zero, it is used as the decoding of bytes
that are not part of a valid UTF-8 encoding. Thus, the function
result can be @cpp{-2} only if @var{permissive} is @cpp{0}.
This function does not allocate or trigger garbage collection.}
@function[(int scheme_utf8_decode_as_prefix
[const-unsigned-char* s]
[int start]
[int end]
[mzchar* us]
[int dstart]
[int dend]
[long* ipos]
[char utf16]
[int permissive])]{
Like @cpp{scheme_utf8_decode}, but the result is always the number
of the decoded @cpp{mzchar}s or @cpp{short}s. If a decoding error is
encountered, the result is still the size of the decoding up until
the error.}
@function[(int scheme_utf8_decode_all
[const-unsigned-char* s]
[int len]
[mzchar* us]
[int permissive])]{
Like @cpp{scheme_utf8_decode}, but with fewer arguments. The
decoding produces UCS-4 @cpp{mzchar}s. If the buffer @var{us} is
non-@cpp{NULL}, it is assumed to be long enough to hold the decoding
(which cannot be longer than the length of the input, though it may
be shorter). If @var{len} is negative, @cpp{strlen(@var{s})} is used
as the input length.}
@function[(int scheme_utf8_decode_prefix
[const-unsigned-char* s]
[int len]
[mzchar* us]
[int permissive])]{
Like @cpp{scheme_utf8_decode}, but with fewer arguments. The
decoding produces UCS-4 @cpp{mzchar}s. If the buffer @var{us}
@bold{must} be non-@cpp{NULL}, and it is assumed to be long enough to hold the
decoding (which cannot be longer than the length of the input, though
it may be shorter). If @var{len} is negative, @cpp{strlen(@var{s})}
is used as the input length.
In addition to the result of @cpp{scheme_utf8_decode}, the result
can be @cpp{-1} to indicate that the input ended with a partial
(valid) encoding. A @cpp{-1} result is possible even when
@var{permissive} is non-zero.}
@function[(mzchar* scheme_utf8_decode_to_buffer
[const-unsigned-char* s]
[int len]
[mzchar* buf]
[int blen])]{
Like @cpp{scheme_utf8_decode_all} with @var{permissive} as @cpp{0},
but if @var{buf} is not large enough (as indicated by @var{blen}) to
hold the result, a new buffer is allocated. Unlike other functions,
this one adds a nul terminator to the decoding result. The function
result is either @var{buf} (if it was big enough) or a buffer
allocated with @cpp{scheme_malloc_atomic}.}
@function[(mzchar* scheme_utf8_decode_to_buffer_len
[const-unsigned-char* s]
[int len]
[mzchar* buf]
[int blen]
[long* ulen])]{
Like @cpp{scheme_utf8_decode_to_buffer}, but the length of the
result (not including the terminator) is placed into @var{ulen} if
@var{ulen} is non-@cpp{NULL}.}
@function[(int scheme_utf8_decode_count
[const-unsigned-char* s]
[int start]
[int end]
[int* state]
[int might_continue]
[int permissive])]{
Like @cpp{scheme_utf8_decode}, but without producing the decoded
@cpp{mzchar}s, and always returning the number of decoded
@cpp{mzchar}s up until a decoding error (if any). If
@var{might_continue} is non-zero, the a partial valid encoding at
the end of the input is not decoded when @var{permissive} is also
non-zero.
If @var{state} is non-@cpp{NULL}, it holds information about partial
encodings; it should be set to zero for an initial call, and then
passed back to @cpp{scheme_utf8_decode} along with bytes that
extend the given input (i.e., without any unused partial
encodings). Typically, this mode makes sense only when
@var{might_continue} and @var{permissive} are non-zero.}
@function[(int scheme_utf8_encode
[const-mzchar* us]
[int start]
[int end]
[unsigned-char* s]
[int dstart]
[char utf16])]{
Encodes the given UCS-4 array of @cpp{mzchar}s (if @var{utf16} is
zero) or UTF-16 array of @cpp{short}s (if @var{utf16} is non-zero)
into @var{s}. The @var{end} argument must be no less than
@var{start}.
The array @var{s} is assumed to be long enough to contain the
encoding, but no encoding is written if @var{s} is @cpp{NULL}. The
@var{dstart} argument indicates a starting place in @var{s} to hold
the encoding. No nul terminator is added to @var{s}.
The result is the number of bytes produced for the encoding (or that
would be produced if @var{s} was non-@cpp{NULL}). Encoding never
fails.
This function does not allocate or trigger garbage collection.}
@function[(int scheme_utf8_encode_all
[const-mzchar* us]
[int len]
[unsigned-char* s])]{
Like @cpp{scheme_utf8_encode} with @cpp{0} for @var{start},
@var{len} for @var{end}, @cpp{0} for @var{dstart} and @cpp{0} for
@var{utf16}.}
@function[(char* scheme_utf8_encode_to_buffer
[const-mzchar* s]
[int len]
[char* buf]
[int blen])]{
Like @cpp{scheme_utf8_encode_all}, but the length of @var{buf} is
given, and if it is not long enough to hold the encoding, a buffer is
allocated. A nul terminator is added to the encoded array. The result
is either @var{buf} or an array allocated with
@cpp{scheme_malloc_atomic}.}
@function[(char* scheme_utf8_encode_to_buffer_len
[const-mzchar* s]
[int len]
[char* buf]
[int blen]
[long* rlen])]{
Like @cpp{scheme_utf8_encode_to_buffer}, but the length of the
resulting encoding (not including a nul terminator) is reported in
@var{rlen} if it is non-@cpp{NULL}.}
@function[(unsigned-short* scheme_ucs4_to_utf16
[const-mzchar* text]
[int start]
[int end]
[unsigned-short* buf]
[int bufsize]
[long* ulen]
[int term_size])]{
Converts a UCS-4 encoding (the indicated range of @var{text}) to a
UTF-16 encoding. The @var{end} argument must be no less than
@var{start}.
A result buffer is allocated if @var{buf} is not long enough (as
indicated by @var{bufsize}). If @var{ulen} is non-@cpp{NULL}, it is
filled with the length of the UTF-16 encoding. The @var{term_size}
argument indicates a number of @cpp{short}s to reserve at the end of
the result buffer for a terminator (but no terminator is actually
written).}
@function[(mzchar* scheme_utf16_to_ucs4
[const-unsigned-short* text]
[int start]
[int end]
[mzchar* buf]
[int bufsize]
[long* ulen]
[int term_size])]{
Converts a UTF-16 encoding (the indicated range of @var{text}) to a
UCS-4 encoding. The @var{end} argument must be no less than
@var{start}.
A result buffer is allocated if @var{buf} is not long enough (as
indicated by @var{bufsize}). If @var{ulen} is non-@cpp{NULL}, it is
filled with the length of the UCS-4 encoding. The @var{term_size}
argument indicates a number of @cpp{mzchar}s to reserve at the end of
the result buffer for a terminator (but no terminator is actually
written).}