man-pages/man1/uconv.1.html
2021-03-31 01:06:50 +01:00

658 lines
13 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML><HEAD><TITLE>Man page of UCONV</TITLE>
</HEAD><BODY>
<H1>UCONV</H1>
Section: ICU 66.1 Manual (1)<BR>Updated: 2005-jul-1<BR><A HREF="#index">Index</A>
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
<A NAME="lbAB">&nbsp;</A>
<H2>NAME</H2>
<B>uconv</B>
- convert data from one encoding to another
<A NAME="lbAC">&nbsp;</A>
<H2>SYNOPSIS</H2>
<B>uconv</B>
[
<B>-h</B>, <B>-?</B>, <B>--help</B>
]
[
<B>-V</B>, <B>--version</B>
]
[
<B>-s</B>, <B>--silent</B>
]
[
<B>-v</B>, <B>--verbose</B>
]
[
<B>-l</B>, <B>--list</B>
|
<B>-l</B>, <B>--list-code</B><I> code</I>
|
<B>--default-code</B>
|
<B>-L</B>, <B>--list-transliterators</B>
]
[
<B>--canon</B>
]
[
<B>-x</B><I> transliteration</I>
]
[
<B>--to-callback</B><I> callback</I>
|
<B>-c</B>
]
[
<B>--from-callback</B><I> callback</I>
|
<B>-i</B>
]
[
<B>--callback</B><I> callback</I>
]
[
<B>--fallback</B>
|
<B>--no-fallback</B>
]
[
<B>-b</B>, <B>--block-size</B><I> size</I>
]
[
<B>-f</B>, <B>--from-code</B><I> encoding</I>
]
[
<B>-t</B>, <B>--to-code</B><I> encoding</I>
]
[
<B>--add-signature</B>
]
[
<B>--remove-signature</B>
]
[
<B>-o</B>, <B>--output</B><I> file</I>
]
[
<I>file</I>...
]
<A NAME="lbAD">&nbsp;</A>
<H2>DESCRIPTION</H2>
<B>uconv</B>
converts, or transcodes, each given
<I>file</I>
(or its standard input if no
<I>file</I>
is specified) from one
<I>encoding</I>
to another.
The transcoding is done using Unicode as a pivot encoding
(i.e. the data are first transcoded from their original encoding to
Unicode, and then from Unicode to the destination encoding).
<P>
If an
<I>encoding</I>
is not specified or is
<B>-</B>,
the default encoding is used. Thus, calling
<B>uconv</B>
with no
<I>encoding</I>
provides an easy way to validate and sanitize data files for
further consumption by tools requiring data in the default encoding.
<P>
When calling
<B>uconv</B>,
it is possible to specify callbacks that are used to handle invalid
characters in the input, or characters that cannot be transcoded to
the destination encoding. Some encodings, for example, offer a default
substitution character that can be used to represent the occurrence of
such characters in the input. Other callbacks offer a useful visual
representation of the invalid data.
<P>
<B>uconv</B>
can also run the specified
<I>transliteration</I>
on the transcoded data,
in which case transliteration will happen as an intermediate step,
after the data have been transcoded to Unicode.
The
<I>transliteration</I>
can be either a list of semicolon-separated transliterator names,
or an arbitrarily complex set of rules in the ICU transliteration
rules format.
<P>
For transcoding purposes,
<B>uconv</B>
options are compatible with those of
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1),
making it easy to replace it in scripts. It is not necessarily the case,
however, that the encoding names used by
<B>uconv</B>
and ICU are the same as the ones used by
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1).
Also, options that provide informational data, such as the
<B>-l</B>, <B>--list</B>
one offered by some
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1)
variants such as GNU's, produce data in a slightly different and
easier to parse format.
<A NAME="lbAE">&nbsp;</A>
<H2>OPTIONS</H2>
<DL COMPACT>
<DT id="1"><B>-h</B>, <B>-?</B>, <B>--help</B>
<DD>
Print help about usage and exit.
<DT id="2"><B>-V</B>, <B>--version</B>
<DD>
Print the version of
<B>uconv</B>
and exit.
<DT id="3"><B>-s</B>, <B>--silent</B>
<DD>
Suppress messages during execution.
<DT id="4"><B>-v</B>, <B>--verbose</B>
<DD>
Display extra informative messages during execution.
<DT id="5"><B>-l</B>, <B>--list</B>
<DD>
List all the available encodings and exit.
<DT id="6"><B>-l</B>, <B>--list-code</B><I> code</I>
<DD>
List only the
<I>code</I>
encoding and exit. If
<I>code</I>
is not a proper encoding, exit with an error.
<DT id="7"><B>--default-code</B>
<DD>
List only the name of the default encoding and exit.
<DT id="8"><B>-L</B>, <B>--list-transliterators</B>
<DD>
List all the available transliterators and exit.
<DT id="9"><B>--canon</B>
<DD>
If used with
<B>-l</B>, <B>--list</B>
or
<B>--default-code</B>,
the list of encodings is produced in a format compatible with
<B><A HREF="/cgi-bin/man/man2html?5+convrtrs.txt">convrtrs.txt</A></B>(5).
If used with
<B>-L</B>, <B>--list-transliterators</B>,
print only one transliterator name per line.
<DT id="10"><B>-x</B><I> transliteration</I>
<DD>
Run the given
<I>transliteration</I>
on the transcoded Unicode data,
and use the transliterated data as input for the transcoding to
the destination encoding.
<DT id="11"><B>--to-callback</B><I> callback</I>
<DD>
Use
<I>callback</I>
to handle characters that cannot be transcoded to the destination
encoding. See section
<B>CALLBACKS</B>
for details on valid callbacks.
<DT id="12"><B>-c</B>
<DD>
Omit invalid characters from the output.
Same as
<B>--to-callback skip</B>.
<DT id="13"><B>--from-callback</B><I> callback</I>
<DD>
Use
<I>callback</I>
to handle characters that cannot be transcoded from the original
encoding. See section
<B>CALLBACKS</B>
for details on valid callbacks.
<DT id="14"><B>-i</B>
<DD>
Ignore invalid sequences in the input.
Same as
<B>--from-callback skip</B>.
<DT id="15"><B>--callback</B><I> callback</I>
<DD>
Use
<I>callback</I>
to handle both characters that cannot be transcoded from the original
encoding and characters that cannot be transcoded to the destination
encoding. See section
<B>CALLBACKS</B>
for details on valid callbacks.
<DT id="16"><B>--fallback</B>
<DD>
Use the fallback mapping when transcoding from
Unicode to the destination encoding.
<DT id="17"><B>--no-fallback</B>
<DD>
Do not use the fallback mapping when transcoding from Unicode to the
destination encoding.
This is the default.
<DT id="18"><B>-b</B>, <B>--block-size</B><I> size</I>
<DD>
Read input in blocks of
<I>size</I>
bytes at a time. The default block size is
4096.
<DT id="19"><B>-f</B>, <B>--from-code</B><I> encoding</I>
<DD>
Set the original encoding of the data to
<I>encoding</I>.
<DT id="20"><B>-t</B>, <B>--to-code</B><I> encoding</I>
<DD>
Transcode the data to
<I>encoding</I>.
<DT id="21"><B>--add-signature</B>
<DD>
Add a U+FEFF Unicode signature character (BOM) if the output charset
supports it and does not add one anyway.
<DT id="22"><B>--remove-signature</B>
<DD>
Remove a U+FEFF Unicode signature character (BOM).
<DT id="23"><B>-o</B>, <B>--output</B><I> file</I>
<DD>
Write the transcoded data to
<I>file</I>.
</DL>
<A NAME="lbAF">&nbsp;</A>
<H2>CALLBACKS</H2>
<B>uconv</B>
supports specifying callbacks to handle invalid data. Callbacks can be
set for both directions of transcoding: from the original encoding to
Unicode, with the
<B>--from-callback</B>
option, and from Unicode to the destination encoding, with the
<B>--to-callback</B>
option.
<P>
The following is a list of valid
<I>callback</I>
names, along with a description of their behavior. The list of
callbacks actually supported by
<B>uconv</B>
is displayed when it is called with
<B>-h</B>, <B>--help</B>.
<P>
<DL COMPACT>
<DT id="24"><B>substitute</B>
<DD>
Write the encoding's substitute sequence, or the Unicode
replacement character
<B>U+FFFD</B>
when transcoding to Unicode.
<DT id="25"><B>skip</B>
<DD>
Ignore the invalid data.
<DT id="26"><B>stop</B>
<DD>
Stop with an error when encountering invalid data.
This is the default callback.
<DT id="27"><B>escape</B>
<DD>
Same as
<B>escape-icu</B>.
<DT id="28"><B>escape-icu</B>
<DD>
Replace the missing characters with a string of the format
<B>%U</B><I>hhhh</I>
for plane 0 characters, and
<B>%U</B><I>hhhh</I>%U<I>hhhh</I>
for planes 1 and above characters,
where
<I>hhhh</I>
is the hexadecimal value of one of the UTF-16 code units representing the
character. Characters from planes 1 and above are written as a pair of
UTF-16 surrogate code units.
<DT id="29"><B>escape-java</B>
<DD>
Replace the missing characters with a string of the format
<B>\u</B><I>hhhh</I>
for plane 0 characters, and
<B>\u</B><I>hhhh</I>\u<I>hhhh</I>
for planes 1 and above characters,
where
<I>hhhh</I>
is the hexadecimal value of one of the UTF-16 code units representing the
character. Characters from planes 1 and above are written as a pair of
UTF-16 surrogate code units.
<DT id="30"><B>escape-c</B>
<DD>
Replace the missing characters with a string of the format
<B>\u</B><I>hhhh</I>
for plane 0 characters, and
<B>\U</B><I>hhhhhhhh</I>
for planes 1 and above characters,
where
<I>hhhh</I>
and
<I>hhhhhhhh</I>
are the hexadecimal values of the Unicode codepoint.
<DT id="31"><B>escape-xml</B>
<DD>
Same as
<B>escape-xml-hex</B>.
<DT id="32"><B>escape-xml-hex</B>
<DD>
Replace the missing characters with a string of the format
<B>&amp;#x</B><I>hhhh</I>;,
where
<I>hhhh</I>
is the hexadecimal value of the Unicode codepoint.
<DT id="33"><B>escape-xml-dec</B>
<DD>
Replace the missing characters with a string of the format
<B>&amp;#</B><I>nnnn</I>;,
where
<I>nnnn</I>
is the decimal value of the Unicode codepoint.
<DT id="34"><B>escape-unicode</B>
<DD>
Replace the missing characters with a string of the format
<B>{U+</B><I>hhhh</I>},
where
<I>hhhh</I>
is the hexadecimal value of the Unicode codepoint.
That hexadecimal string is of variable length and can use from 4 to
6 digits.
This is the format universally used to denote a Unicode codepoint in
the literature, delimited by curly braces for easy recognition of those
substitutions in the output.
</DL>
<A NAME="lbAG">&nbsp;</A>
<H2>EXAMPLES</H2>
Convert data from a given
<I>encoding</I>
to the platform encoding:
<P>
<DL COMPACT><DT id="35"><DD>
<B></B>$ uconv -f <I>encoding</I>
</DL>
<P>
Check if a
<I>file</I>
contains valid data for a given
<I>encoding</I>:
<P>
<DL COMPACT><DT id="36"><DD>
<B></B>$ uconv -f <I>encoding</I> -c <I>file</I> &gt;/dev/null
</DL>
<P>
Convert a UTF-8
<I>file</I>
to a given
<I>encoding</I>
and ensure that the resulting text is good for any version of HTML:
<P>
<DL COMPACT><DT id="37"><DD>
<B></B>$ uconv -f utf-8 -t <I>encoding</I> \
<BR>
<B> --callback escape-xml-dec </B><I>file</I>
</DL>
<P>
Display the names of the Unicode code points in a UTF-file:
<P>
<DL COMPACT><DT id="38"><DD>
<B></B>$ uconv -f utf-8 -x any-name <I>file</I>
</DL>
<P>
Print the name of a Unicode code point whose value is known (<B>U+30AB</B>
in this example):
<P>
<DL COMPACT><DT id="39"><DD>
<B></B>$ echo '\u30ab' | uconv -x 'hex-any; any-name'; echo
<BR>
{KATAKANA LETTER KA}{LINE FEED}
<BR>
$
</DL>
<P>
(The names are delimited by curly braces.
Also, the name of the line terminator is also displayed.)
<P>
Normalize UTF-8 data using Unicode NFKC, remove all control characters,
and map Katakana to Hiragana:
<P>
<DL COMPACT><DT id="40"><DD>
<B></B>$ uconv -f utf-8 -t utf-8 \
<BR>
<B> -x '::nfkc; [:Cc:] &gt;; ::katakana-hiragana;'</B>
</DL>
<A NAME="lbAH">&nbsp;</A>
<H2>CAVEATS AND BUGS</H2>
<B>uconv</B>
does report errors as occurring at the first invalid byte
encountered. This may be confusing to users of GNU
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1),
which reports errors as occurring at the first byte of an invalid
sequence. For multi-byte character sets or encodings, this means that
<B>uconv</B>
error positions may be at a later offset in the input stream than
would be the case with GNU
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1).
<P>
The reporting of error positions when a transliterator is used may be
inaccurate or unavailable, in which case
<B>uconv</B>
will report the offset in the output stream at which the error
occurred.
<A NAME="lbAI">&nbsp;</A>
<H2>AUTHORS</H2>
Jonas Utterstroem
<BR>
Yves Arrouye
<A NAME="lbAJ">&nbsp;</A>
<H2>VERSION</H2>
66.1
<A NAME="lbAK">&nbsp;</A>
<H2>COPYRIGHT</H2>
Copyright (C) 2000-2005 IBM, Inc. and others.
<A NAME="lbAL">&nbsp;</A>
<H2>SEE ALSO</H2>
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1)
<P>
<HR>
<A NAME="index">&nbsp;</A><H2>Index</H2>
<DL>
<DT id="41"><A HREF="#lbAB">NAME</A><DD>
<DT id="42"><A HREF="#lbAC">SYNOPSIS</A><DD>
<DT id="43"><A HREF="#lbAD">DESCRIPTION</A><DD>
<DT id="44"><A HREF="#lbAE">OPTIONS</A><DD>
<DT id="45"><A HREF="#lbAF">CALLBACKS</A><DD>
<DT id="46"><A HREF="#lbAG">EXAMPLES</A><DD>
<DT id="47"><A HREF="#lbAH">CAVEATS AND BUGS</A><DD>
<DT id="48"><A HREF="#lbAI">AUTHORS</A><DD>
<DT id="49"><A HREF="#lbAJ">VERSION</A><DD>
<DT id="50"><A HREF="#lbAK">COPYRIGHT</A><DD>
<DT id="51"><A HREF="#lbAL">SEE ALSO</A><DD>
</DL>
<HR>
This document was created by
<A HREF="/cgi-bin/man/man2html">man2html</A>,
using the manual pages.<BR>
Time: 00:05:28 GMT, March 31, 2021
</BODY>
</HTML>