#lang scribble/doc @(require "utils.ss") @title[#:tag "im:encodings"]{String Encodings} The @cpp{scheme_utf8_decode} function decodes a @cpp{char} array as UTF-8 into either a UCS-4 @cpp{mzchar} array or a UTF-16 @cpp{short} array. The @cpp{scheme_utf8_encode} function encodes either a UCS-4 @cpp{mzchar} array or a UTF-16 @cpp{short} array into a UTF-8 @cpp{char} array. These functions can be used to check or measure an encoding or decoding without actually producing the result decoding or encoding, and variations of the function provide control over the handling of decoding errors. @function[(int scheme_utf8_decode [const-unsigned-char* s] [int start] [int end] [mzchar* us] [int dstart] [int dend] [long* ipos] [char utf16] [int permissive])]{ Decodes a byte array as UTF-8 to produce either Unicode code points into @var{us} (when @var{utf16} is zero) or UTF-16 code units into @var{us} cast to @cpp{short*} (when @var{utf16} is non-zero). No nul terminator is added to @var{us}. The result is non-negative when all of the given bytes are decoded, and the result is the length of the decoding (in @cpp{mzchar}s or @cpp{short}s). A @cpp{-2} result indicates an invalid encoding sequence in the given bytes (possibly because the range to decode ended mid-encoding), and a @cpp{-3} result indicates that decoding stopped because not enough room was available in the result string. The @var{start} and @var{end} arguments specify a range of @var{s} to be decoded. If @var{end} is negative, @cpp{strlen(@var{s})} is used as the end. If @var{us} is @cpp{NULL}, then decoded bytes are not produced, but the result is valid as if decoded bytes were written. The @var{dstart} and @var{dend} arguments specify a target range in @var{us} (in @cpp{mzchar} or @cpp{short} units) for the decoding; a negative value for @var{dend} indicates that any number of bytes can be written to @var{us}, which is normally sensible only when @var{us} is @cpp{NULL} for measuring the length of the decoding. If @var{ipos} is non-@cpp{NULL}, it is filled with the first undecoded index within @var{s}. If the function result is non-negative, then @cpp{*@var{ipos}} is set to the ending index (with is @var{end} if non-negative, @cpp{strlen(@var{s})} otherwise). If the result is @cpp{-1} or @cpp{-2}, then @cpp{*@var{ipos}} effectively indicates how many bytes were decoded before decoding stopped. If @var{permissive} is non-zero, it is used as the decoding of bytes that are not part of a valid UTF-8 encoding. Thus, the function result can be @cpp{-2} only if @var{permissive} is @cpp{0}. This function does not allocate or trigger garbage collection.} @function[(int scheme_utf8_decode_as_prefix [const-unsigned-char* s] [int start] [int end] [mzchar* us] [int dstart] [int dend] [long* ipos] [char utf16] [int permissive])]{ Like @cpp{scheme_utf8_decode}, but the result is always the number of the decoded @cpp{mzchar}s or @cpp{short}s. If a decoding error is encountered, the result is still the size of the decoding up until the error.} @function[(int scheme_utf8_decode_all [const-unsigned-char* s] [int len] [mzchar* us] [int permissive])]{ Like @cpp{scheme_utf8_decode}, but with fewer arguments. The decoding produces UCS-4 @cpp{mzchar}s. If the buffer @var{us} is non-@cpp{NULL}, it is assumed to be long enough to hold the decoding (which cannot be longer than the length of the input, though it may be shorter). If @var{len} is negative, @cpp{strlen(@var{s})} is used as the input length.} @function[(int scheme_utf8_decode_prefix [const-unsigned-char* s] [int len] [mzchar* us] [int permissive])]{ Like @cpp{scheme_utf8_decode}, but with fewer arguments. The decoding produces UCS-4 @cpp{mzchar}s. If the buffer @var{us} @bold{must} be non-@cpp{NULL}, and it is assumed to be long enough to hold the decoding (which cannot be longer than the length of the input, though it may be shorter). If @var{len} is negative, @cpp{strlen(@var{s})} is used as the input length. In addition to the result of @cpp{scheme_utf8_decode}, the result can be @cpp{-1} to indicate that the input ended with a partial (valid) encoding. A @cpp{-1} result is possible even when @var{permissive} is non-zero.} @function[(mzchar* scheme_utf8_decode_to_buffer [const-unsigned-char* s] [int len] [mzchar* buf] [int blen])]{ Like @cpp{scheme_utf8_decode_all} with @var{permissive} as @cpp{0}, but if @var{buf} is not large enough (as indicated by @var{blen}) to hold the result, a new buffer is allocated. Unlike other functions, this one adds a nul terminator to the decoding result. The function result is either @var{buf} (if it was big enough) or a buffer allocated with @cpp{scheme_malloc_atomic}.} @function[(mzchar* scheme_utf8_decode_to_buffer_len [const-unsigned-char* s] [int len] [mzchar* buf] [int blen] [long* ulen])]{ Like @cpp{scheme_utf8_decode_to_buffer}, but the length of the result (not including the terminator) is placed into @var{ulen} if @var{ulen} is non-@cpp{NULL}.} @function[(int scheme_utf8_decode_count [const-unsigned-char* s] [int start] [int end] [int* state] [int might_continue] [int permissive])]{ Like @cpp{scheme_utf8_decode}, but without producing the decoded @cpp{mzchar}s, and always returning the number of decoded @cpp{mzchar}s up until a decoding error (if any). If @var{might_continue} is non-zero, the a partial valid encoding at the end of the input is not decoded when @var{permissive} is also non-zero. If @var{state} is non-@cpp{NULL}, it holds information about partial encodings; it should be set to zero for an initial call, and then passed back to @cpp{scheme_utf8_decode} along with bytes that extend the given input (i.e., without any unused partial encodings). Typically, this mode makes sense only when @var{might_continue} and @var{permissive} are non-zero.} @function[(int scheme_utf8_encode [const-mzchar* us] [int start] [int end] [unsigned-char* s] [int dstart] [char utf16])]{ Encodes the given UCS-4 array of @cpp{mzchar}s (if @var{utf16} is zero) or UTF-16 array of @cpp{short}s (if @var{utf16} is non-zero) into @var{s}. The @var{end} argument must be no less than @var{start}. The array @var{s} is assumed to be long enough to contain the encoding, but no encoding is written if @var{s} is @cpp{NULL}. The @var{dstart} argument indicates a starting place in @var{s} to hold the encoding. No nul terminator is added to @var{s}. The result is the number of bytes produced for the encoding (or that would be produced if @var{s} was non-@cpp{NULL}). Encoding never fails. This function does not allocate or trigger garbage collection.} @function[(int scheme_utf8_encode_all [const-mzchar* us] [int len] [unsigned-char* s])]{ Like @cpp{scheme_utf8_encode} with @cpp{0} for @var{start}, @var{len} for @var{end}, @cpp{0} for @var{dstart} and @cpp{0} for @var{utf16}.} @function[(char* scheme_utf8_encode_to_buffer [const-mzchar* s] [int len] [char* buf] [int blen])]{ Like @cpp{scheme_utf8_encode_all}, but the length of @var{buf} is given, and if it is not long enough to hold the encoding, a buffer is allocated. A nul terminator is added to the encoded array. The result is either @var{buf} or an array allocated with @cpp{scheme_malloc_atomic}.} @function[(char* scheme_utf8_encode_to_buffer_len [const-mzchar* s] [int len] [char* buf] [int blen] [long* rlen])]{ Like @cpp{scheme_utf8_encode_to_buffer}, but the length of the resulting encoding (not including a nul terminator) is reported in @var{rlen} if it is non-@cpp{NULL}.} @function[(unsigned-short* scheme_ucs4_to_utf16 [const-mzchar* text] [int start] [int end] [unsigned-short* buf] [int bufsize] [long* ulen] [int term_size])]{ Converts a UCS-4 encoding (the indicated range of @var{text}) to a UTF-16 encoding. The @var{end} argument must be no less than @var{start}. A result buffer is allocated if @var{buf} is not long enough (as indicated by @var{bufsize}). If @var{ulen} is non-@cpp{NULL}, it is filled with the length of the UTF-16 encoding. The @var{term_size} argument indicates a number of @cpp{short}s to reserve at the end of the result buffer for a terminator (but no terminator is actually written).} @function[(mzchar* scheme_utf16_to_ucs4 [const-unsigned-short* text] [int start] [int end] [mzchar* buf] [int bufsize] [long* ulen] [int term_size])]{ Converts a UTF-16 encoding (the indicated range of @var{text}) to a UCS-4 encoding. The @var{end} argument must be no less than @var{start}. A result buffer is allocated if @var{buf} is not long enough (as indicated by @var{bufsize}). If @var{ulen} is non-@cpp{NULL}, it is filled with the length of the UCS-4 encoding. The @var{term_size} argument indicates a number of @cpp{mzchar}s to reserve at the end of the result buffer for a terminator (but no terminator is actually written).}