658 lines
13 KiB
HTML
658 lines
13 KiB
HTML
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<HTML><HEAD><TITLE>Man page of UCONV</TITLE>
|
|
</HEAD><BODY>
|
|
<H1>UCONV</H1>
|
|
Section: ICU 66.1 Manual (1)<BR>Updated: 2005-jul-1<BR><A HREF="#index">Index</A>
|
|
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
|
|
|
|
<A NAME="lbAB"> </A>
|
|
<H2>NAME</H2>
|
|
|
|
<B>uconv</B>
|
|
|
|
- convert data from one encoding to another
|
|
<A NAME="lbAC"> </A>
|
|
<H2>SYNOPSIS</H2>
|
|
|
|
<B>uconv</B>
|
|
|
|
[
|
|
<B>-h</B>, <B>-?</B>, <B>--help</B>
|
|
|
|
]
|
|
[
|
|
<B>-V</B>, <B>--version</B>
|
|
|
|
]
|
|
[
|
|
<B>-s</B>, <B>--silent</B>
|
|
|
|
]
|
|
[
|
|
<B>-v</B>, <B>--verbose</B>
|
|
|
|
]
|
|
[
|
|
<B>-l</B>, <B>--list</B>
|
|
|
|
|
|
|
<B>-l</B>, <B>--list-code</B><I> code</I>
|
|
|
|
|
|
|
<B>--default-code</B>
|
|
|
|
|
|
|
<B>-L</B>, <B>--list-transliterators</B>
|
|
|
|
]
|
|
[
|
|
<B>--canon</B>
|
|
|
|
]
|
|
[
|
|
<B>-x</B><I> transliteration</I>
|
|
|
|
]
|
|
[
|
|
<B>--to-callback</B><I> callback</I>
|
|
|
|
|
|
|
<B>-c</B>
|
|
|
|
]
|
|
[
|
|
<B>--from-callback</B><I> callback</I>
|
|
|
|
|
|
|
<B>-i</B>
|
|
|
|
]
|
|
[
|
|
<B>--callback</B><I> callback</I>
|
|
|
|
]
|
|
[
|
|
<B>--fallback</B>
|
|
|
|
|
|
|
<B>--no-fallback</B>
|
|
|
|
]
|
|
[
|
|
<B>-b</B>, <B>--block-size</B><I> size</I>
|
|
|
|
]
|
|
[
|
|
<B>-f</B>, <B>--from-code</B><I> encoding</I>
|
|
|
|
]
|
|
[
|
|
<B>-t</B>, <B>--to-code</B><I> encoding</I>
|
|
|
|
]
|
|
[
|
|
<B>--add-signature</B>
|
|
|
|
]
|
|
[
|
|
<B>--remove-signature</B>
|
|
|
|
]
|
|
[
|
|
<B>-o</B>, <B>--output</B><I> file</I>
|
|
|
|
]
|
|
[
|
|
<I>file</I>...
|
|
|
|
]
|
|
<A NAME="lbAD"> </A>
|
|
<H2>DESCRIPTION</H2>
|
|
|
|
<B>uconv</B>
|
|
|
|
converts, or transcodes, each given
|
|
<I>file</I>
|
|
|
|
(or its standard input if no
|
|
<I>file</I>
|
|
|
|
is specified) from one
|
|
<I>encoding</I>
|
|
|
|
to another.
|
|
The transcoding is done using Unicode as a pivot encoding
|
|
(i.e. the data are first transcoded from their original encoding to
|
|
Unicode, and then from Unicode to the destination encoding).
|
|
<P>
|
|
|
|
If an
|
|
<I>encoding</I>
|
|
|
|
is not specified or is
|
|
<B>-</B>,
|
|
|
|
the default encoding is used. Thus, calling
|
|
<B>uconv</B>
|
|
|
|
with no
|
|
<I>encoding</I>
|
|
|
|
provides an easy way to validate and sanitize data files for
|
|
further consumption by tools requiring data in the default encoding.
|
|
<P>
|
|
|
|
When calling
|
|
<B>uconv</B>,
|
|
|
|
it is possible to specify callbacks that are used to handle invalid
|
|
characters in the input, or characters that cannot be transcoded to
|
|
the destination encoding. Some encodings, for example, offer a default
|
|
substitution character that can be used to represent the occurrence of
|
|
such characters in the input. Other callbacks offer a useful visual
|
|
representation of the invalid data.
|
|
<P>
|
|
|
|
<B>uconv</B>
|
|
|
|
can also run the specified
|
|
<I>transliteration</I>
|
|
|
|
on the transcoded data,
|
|
in which case transliteration will happen as an intermediate step,
|
|
after the data have been transcoded to Unicode.
|
|
The
|
|
<I>transliteration</I>
|
|
|
|
can be either a list of semicolon-separated transliterator names,
|
|
or an arbitrarily complex set of rules in the ICU transliteration
|
|
rules format.
|
|
<P>
|
|
|
|
For transcoding purposes,
|
|
<B>uconv</B>
|
|
|
|
options are compatible with those of
|
|
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1),
|
|
|
|
making it easy to replace it in scripts. It is not necessarily the case,
|
|
however, that the encoding names used by
|
|
<B>uconv</B>
|
|
|
|
and ICU are the same as the ones used by
|
|
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1).
|
|
|
|
Also, options that provide informational data, such as the
|
|
<B>-l</B>, <B>--list</B>
|
|
|
|
one offered by some
|
|
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1)
|
|
|
|
variants such as GNU's, produce data in a slightly different and
|
|
easier to parse format.
|
|
<A NAME="lbAE"> </A>
|
|
<H2>OPTIONS</H2>
|
|
|
|
<DL COMPACT>
|
|
<DT id="1"><B>-h</B>, <B>-?</B>, <B>--help</B>
|
|
|
|
<DD>
|
|
Print help about usage and exit.
|
|
<DT id="2"><B>-V</B>, <B>--version</B>
|
|
|
|
<DD>
|
|
Print the version of
|
|
<B>uconv</B>
|
|
|
|
and exit.
|
|
<DT id="3"><B>-s</B>, <B>--silent</B>
|
|
|
|
<DD>
|
|
Suppress messages during execution.
|
|
<DT id="4"><B>-v</B>, <B>--verbose</B>
|
|
|
|
<DD>
|
|
Display extra informative messages during execution.
|
|
<DT id="5"><B>-l</B>, <B>--list</B>
|
|
|
|
<DD>
|
|
List all the available encodings and exit.
|
|
<DT id="6"><B>-l</B>, <B>--list-code</B><I> code</I>
|
|
|
|
<DD>
|
|
List only the
|
|
<I>code</I>
|
|
|
|
encoding and exit. If
|
|
<I>code</I>
|
|
|
|
is not a proper encoding, exit with an error.
|
|
<DT id="7"><B>--default-code</B>
|
|
|
|
<DD>
|
|
List only the name of the default encoding and exit.
|
|
<DT id="8"><B>-L</B>, <B>--list-transliterators</B>
|
|
|
|
<DD>
|
|
List all the available transliterators and exit.
|
|
<DT id="9"><B>--canon</B>
|
|
|
|
<DD>
|
|
If used with
|
|
<B>-l</B>, <B>--list</B>
|
|
|
|
or
|
|
<B>--default-code</B>,
|
|
|
|
the list of encodings is produced in a format compatible with
|
|
<B><A HREF="/cgi-bin/man/man2html?5+convrtrs.txt">convrtrs.txt</A></B>(5).
|
|
|
|
If used with
|
|
<B>-L</B>, <B>--list-transliterators</B>,
|
|
|
|
print only one transliterator name per line.
|
|
<DT id="10"><B>-x</B><I> transliteration</I>
|
|
|
|
<DD>
|
|
Run the given
|
|
<I>transliteration</I>
|
|
|
|
on the transcoded Unicode data,
|
|
and use the transliterated data as input for the transcoding to
|
|
the destination encoding.
|
|
<DT id="11"><B>--to-callback</B><I> callback</I>
|
|
|
|
<DD>
|
|
Use
|
|
<I>callback</I>
|
|
|
|
to handle characters that cannot be transcoded to the destination
|
|
encoding. See section
|
|
<B>CALLBACKS</B>
|
|
|
|
for details on valid callbacks.
|
|
<DT id="12"><B>-c</B>
|
|
|
|
<DD>
|
|
Omit invalid characters from the output.
|
|
Same as
|
|
<B>--to-callback skip</B>.
|
|
|
|
<DT id="13"><B>--from-callback</B><I> callback</I>
|
|
|
|
<DD>
|
|
Use
|
|
<I>callback</I>
|
|
|
|
to handle characters that cannot be transcoded from the original
|
|
encoding. See section
|
|
<B>CALLBACKS</B>
|
|
|
|
for details on valid callbacks.
|
|
<DT id="14"><B>-i</B>
|
|
|
|
<DD>
|
|
Ignore invalid sequences in the input.
|
|
Same as
|
|
<B>--from-callback skip</B>.
|
|
|
|
<DT id="15"><B>--callback</B><I> callback</I>
|
|
|
|
<DD>
|
|
Use
|
|
<I>callback</I>
|
|
|
|
to handle both characters that cannot be transcoded from the original
|
|
encoding and characters that cannot be transcoded to the destination
|
|
encoding. See section
|
|
<B>CALLBACKS</B>
|
|
|
|
for details on valid callbacks.
|
|
<DT id="16"><B>--fallback</B>
|
|
|
|
<DD>
|
|
Use the fallback mapping when transcoding from
|
|
Unicode to the destination encoding.
|
|
<DT id="17"><B>--no-fallback</B>
|
|
|
|
<DD>
|
|
Do not use the fallback mapping when transcoding from Unicode to the
|
|
destination encoding.
|
|
This is the default.
|
|
<DT id="18"><B>-b</B>, <B>--block-size</B><I> size</I>
|
|
|
|
<DD>
|
|
Read input in blocks of
|
|
<I>size</I>
|
|
|
|
bytes at a time. The default block size is
|
|
4096.
|
|
<DT id="19"><B>-f</B>, <B>--from-code</B><I> encoding</I>
|
|
|
|
<DD>
|
|
Set the original encoding of the data to
|
|
<I>encoding</I>.
|
|
|
|
<DT id="20"><B>-t</B>, <B>--to-code</B><I> encoding</I>
|
|
|
|
<DD>
|
|
Transcode the data to
|
|
<I>encoding</I>.
|
|
|
|
<DT id="21"><B>--add-signature</B>
|
|
|
|
<DD>
|
|
Add a U+FEFF Unicode signature character (BOM) if the output charset
|
|
supports it and does not add one anyway.
|
|
<DT id="22"><B>--remove-signature</B>
|
|
|
|
<DD>
|
|
Remove a U+FEFF Unicode signature character (BOM).
|
|
<DT id="23"><B>-o</B>, <B>--output</B><I> file</I>
|
|
|
|
<DD>
|
|
Write the transcoded data to
|
|
<I>file</I>.
|
|
|
|
</DL>
|
|
<A NAME="lbAF"> </A>
|
|
<H2>CALLBACKS</H2>
|
|
|
|
<B>uconv</B>
|
|
|
|
supports specifying callbacks to handle invalid data. Callbacks can be
|
|
set for both directions of transcoding: from the original encoding to
|
|
Unicode, with the
|
|
<B>--from-callback</B>
|
|
|
|
option, and from Unicode to the destination encoding, with the
|
|
<B>--to-callback</B>
|
|
|
|
option.
|
|
<P>
|
|
|
|
The following is a list of valid
|
|
<I>callback</I>
|
|
|
|
names, along with a description of their behavior. The list of
|
|
callbacks actually supported by
|
|
<B>uconv</B>
|
|
|
|
is displayed when it is called with
|
|
<B>-h</B>, <B>--help</B>.
|
|
|
|
<P>
|
|
|
|
<DL COMPACT>
|
|
<DT id="24"><B>substitute</B>
|
|
|
|
<DD>
|
|
Write the encoding's substitute sequence, or the Unicode
|
|
replacement character
|
|
<B>U+FFFD</B>
|
|
|
|
when transcoding to Unicode.
|
|
<DT id="25"><B>skip</B>
|
|
|
|
<DD>
|
|
Ignore the invalid data.
|
|
<DT id="26"><B>stop</B>
|
|
|
|
<DD>
|
|
Stop with an error when encountering invalid data.
|
|
This is the default callback.
|
|
<DT id="27"><B>escape</B>
|
|
|
|
<DD>
|
|
Same as
|
|
<B>escape-icu</B>.
|
|
|
|
<DT id="28"><B>escape-icu</B>
|
|
|
|
<DD>
|
|
Replace the missing characters with a string of the format
|
|
<B>%U</B><I>hhhh</I>
|
|
|
|
for plane 0 characters, and
|
|
<B>%U</B><I>hhhh</I>%U<I>hhhh</I>
|
|
|
|
for planes 1 and above characters,
|
|
where
|
|
<I>hhhh</I>
|
|
|
|
is the hexadecimal value of one of the UTF-16 code units representing the
|
|
character. Characters from planes 1 and above are written as a pair of
|
|
UTF-16 surrogate code units.
|
|
<DT id="29"><B>escape-java</B>
|
|
|
|
<DD>
|
|
Replace the missing characters with a string of the format
|
|
<B>\u</B><I>hhhh</I>
|
|
|
|
for plane 0 characters, and
|
|
<B>\u</B><I>hhhh</I>\u<I>hhhh</I>
|
|
|
|
for planes 1 and above characters,
|
|
where
|
|
<I>hhhh</I>
|
|
|
|
is the hexadecimal value of one of the UTF-16 code units representing the
|
|
character. Characters from planes 1 and above are written as a pair of
|
|
UTF-16 surrogate code units.
|
|
<DT id="30"><B>escape-c</B>
|
|
|
|
<DD>
|
|
Replace the missing characters with a string of the format
|
|
<B>\u</B><I>hhhh</I>
|
|
|
|
for plane 0 characters, and
|
|
<B>\U</B><I>hhhhhhhh</I>
|
|
|
|
for planes 1 and above characters,
|
|
where
|
|
<I>hhhh</I>
|
|
|
|
and
|
|
<I>hhhhhhhh</I>
|
|
|
|
are the hexadecimal values of the Unicode codepoint.
|
|
<DT id="31"><B>escape-xml</B>
|
|
|
|
<DD>
|
|
Same as
|
|
<B>escape-xml-hex</B>.
|
|
|
|
<DT id="32"><B>escape-xml-hex</B>
|
|
|
|
<DD>
|
|
Replace the missing characters with a string of the format
|
|
<B>&#x</B><I>hhhh</I>;,
|
|
|
|
where
|
|
<I>hhhh</I>
|
|
|
|
is the hexadecimal value of the Unicode codepoint.
|
|
<DT id="33"><B>escape-xml-dec</B>
|
|
|
|
<DD>
|
|
Replace the missing characters with a string of the format
|
|
<B>&#</B><I>nnnn</I>;,
|
|
|
|
where
|
|
<I>nnnn</I>
|
|
|
|
is the decimal value of the Unicode codepoint.
|
|
<DT id="34"><B>escape-unicode</B>
|
|
|
|
<DD>
|
|
Replace the missing characters with a string of the format
|
|
<B>{U+</B><I>hhhh</I>},
|
|
|
|
where
|
|
<I>hhhh</I>
|
|
|
|
is the hexadecimal value of the Unicode codepoint.
|
|
That hexadecimal string is of variable length and can use from 4 to
|
|
6 digits.
|
|
This is the format universally used to denote a Unicode codepoint in
|
|
the literature, delimited by curly braces for easy recognition of those
|
|
substitutions in the output.
|
|
</DL>
|
|
<A NAME="lbAG"> </A>
|
|
<H2>EXAMPLES</H2>
|
|
|
|
Convert data from a given
|
|
<I>encoding</I>
|
|
|
|
to the platform encoding:
|
|
<P>
|
|
<DL COMPACT><DT id="35"><DD>
|
|
<B></B>$ uconv -f <I>encoding</I>
|
|
|
|
</DL>
|
|
|
|
<P>
|
|
|
|
Check if a
|
|
<I>file</I>
|
|
|
|
contains valid data for a given
|
|
<I>encoding</I>:
|
|
|
|
<P>
|
|
<DL COMPACT><DT id="36"><DD>
|
|
<B></B>$ uconv -f <I>encoding</I> -c <I>file</I> >/dev/null
|
|
|
|
</DL>
|
|
|
|
<P>
|
|
|
|
Convert a UTF-8
|
|
<I>file</I>
|
|
|
|
to a given
|
|
<I>encoding</I>
|
|
|
|
and ensure that the resulting text is good for any version of HTML:
|
|
<P>
|
|
<DL COMPACT><DT id="37"><DD>
|
|
<B></B>$ uconv -f utf-8 -t <I>encoding</I> \
|
|
|
|
<BR>
|
|
|
|
<B> --callback escape-xml-dec </B><I>file</I>
|
|
|
|
</DL>
|
|
|
|
<P>
|
|
|
|
Display the names of the Unicode code points in a UTF-file:
|
|
<P>
|
|
<DL COMPACT><DT id="38"><DD>
|
|
<B></B>$ uconv -f utf-8 -x any-name <I>file</I>
|
|
|
|
</DL>
|
|
|
|
<P>
|
|
|
|
Print the name of a Unicode code point whose value is known (<B>U+30AB</B>
|
|
in this example):
|
|
<P>
|
|
<DL COMPACT><DT id="39"><DD>
|
|
<B></B>$ echo '\u30ab' | uconv -x 'hex-any; any-name'; echo
|
|
|
|
<BR>
|
|
|
|
{KATAKANA LETTER KA}{LINE FEED}
|
|
<BR>
|
|
|
|
$
|
|
</DL>
|
|
|
|
<P>
|
|
(The names are delimited by curly braces.
|
|
Also, the name of the line terminator is also displayed.)
|
|
<P>
|
|
|
|
Normalize UTF-8 data using Unicode NFKC, remove all control characters,
|
|
and map Katakana to Hiragana:
|
|
<P>
|
|
<DL COMPACT><DT id="40"><DD>
|
|
<B></B>$ uconv -f utf-8 -t utf-8 \
|
|
|
|
<BR>
|
|
|
|
<B> -x '::nfkc; [:Cc:] >; ::katakana-hiragana;'</B>
|
|
|
|
</DL>
|
|
<A NAME="lbAH"> </A>
|
|
<H2>CAVEATS AND BUGS</H2>
|
|
|
|
<B>uconv</B>
|
|
|
|
does report errors as occurring at the first invalid byte
|
|
encountered. This may be confusing to users of GNU
|
|
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1),
|
|
|
|
which reports errors as occurring at the first byte of an invalid
|
|
sequence. For multi-byte character sets or encodings, this means that
|
|
<B>uconv</B>
|
|
|
|
error positions may be at a later offset in the input stream than
|
|
would be the case with GNU
|
|
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1).
|
|
|
|
<P>
|
|
|
|
The reporting of error positions when a transliterator is used may be
|
|
inaccurate or unavailable, in which case
|
|
<B>uconv</B>
|
|
|
|
will report the offset in the output stream at which the error
|
|
occurred.
|
|
<A NAME="lbAI"> </A>
|
|
<H2>AUTHORS</H2>
|
|
|
|
Jonas Utterstroem
|
|
<BR>
|
|
|
|
Yves Arrouye
|
|
<A NAME="lbAJ"> </A>
|
|
<H2>VERSION</H2>
|
|
|
|
66.1
|
|
<A NAME="lbAK"> </A>
|
|
<H2>COPYRIGHT</H2>
|
|
|
|
Copyright (C) 2000-2005 IBM, Inc. and others.
|
|
<A NAME="lbAL"> </A>
|
|
<H2>SEE ALSO</H2>
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1)
|
|
|
|
<P>
|
|
|
|
<HR>
|
|
<A NAME="index"> </A><H2>Index</H2>
|
|
<DL>
|
|
<DT id="41"><A HREF="#lbAB">NAME</A><DD>
|
|
<DT id="42"><A HREF="#lbAC">SYNOPSIS</A><DD>
|
|
<DT id="43"><A HREF="#lbAD">DESCRIPTION</A><DD>
|
|
<DT id="44"><A HREF="#lbAE">OPTIONS</A><DD>
|
|
<DT id="45"><A HREF="#lbAF">CALLBACKS</A><DD>
|
|
<DT id="46"><A HREF="#lbAG">EXAMPLES</A><DD>
|
|
<DT id="47"><A HREF="#lbAH">CAVEATS AND BUGS</A><DD>
|
|
<DT id="48"><A HREF="#lbAI">AUTHORS</A><DD>
|
|
<DT id="49"><A HREF="#lbAJ">VERSION</A><DD>
|
|
<DT id="50"><A HREF="#lbAK">COPYRIGHT</A><DD>
|
|
<DT id="51"><A HREF="#lbAL">SEE ALSO</A><DD>
|
|
</DL>
|
|
<HR>
|
|
This document was created by
|
|
<A HREF="/cgi-bin/man/man2html">man2html</A>,
|
|
using the manual pages.<BR>
|
|
Time: 00:05:28 GMT, March 31, 2021
|
|
</BODY>
|
|
</HTML>
|