man-pages/man1/uconv.1.html


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML><HEAD><TITLE>Man page of UCONV</TITLE>
</HEAD><BODY>
<H1>UCONV</H1>
Section: ICU 66.1 Manual (1)<BR>Updated: 2005-jul-1<BR><A HREF="#index">Index</A>
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>

<A NAME="lbAB">&nbsp;</A>
<H2>NAME</H2>

<B>uconv</B>

- convert data from one encoding to another
<A NAME="lbAC">&nbsp;</A>
<H2>SYNOPSIS</H2>

<B>uconv</B>

[
<B>-h</B>, <B>-?</B>, <B>--help</B>

]
[
<B>-V</B>, <B>--version</B>

]
[
<B>-s</B>, <B>--silent</B>

]
[
<B>-v</B>, <B>--verbose</B>

]
[
<B>-l</B>, <B>--list</B>

|
<B>-l</B>, <B>--list-code</B><I> code</I>

|
<B>--default-code</B>

|
<B>-L</B>, <B>--list-transliterators</B>

]
[
<B>--canon</B>

]
[
<B>-x</B><I> transliteration</I>

]
[
<B>--to-callback</B><I> callback</I>

|
<B>-c</B>

]
[
<B>--from-callback</B><I> callback</I>

|
<B>-i</B>

]
[
<B>--callback</B><I> callback</I>

]
[
<B>--fallback</B>

|
<B>--no-fallback</B>

]
[
<B>-b</B>, <B>--block-size</B><I> size</I>

]
[
<B>-f</B>, <B>--from-code</B><I> encoding</I>

]
[
<B>-t</B>, <B>--to-code</B><I> encoding</I>

]
[
<B>--add-signature</B>

]
[
<B>--remove-signature</B>

]
[
<B>-o</B>, <B>--output</B><I> file</I>

]
[
<I>file</I>...

]
<A NAME="lbAD">&nbsp;</A>
<H2>DESCRIPTION</H2>

<B>uconv</B>

converts, or transcodes, each given
<I>file</I>

(or its standard input if no
<I>file</I>

is specified) from one
<I>encoding</I>

to another.
The transcoding is done using Unicode as a pivot encoding
(i.e. the data are first transcoded from their original encoding to
Unicode, and then from Unicode to the destination encoding).
<P>

If an
<I>encoding</I>

is not specified or is
<B>-</B>,

the default encoding is used. Thus, calling
<B>uconv</B>

with no
<I>encoding</I>

provides an easy way to validate and sanitize data files for
further consumption by tools requiring data in the default encoding.
<P>

When calling
<B>uconv</B>,

it is possible to specify callbacks that are used to handle invalid
characters in the input, or characters that cannot be transcoded to
the destination encoding. Some encodings, for example, offer a default
substitution character that can be used to represent the occurrence of
such characters in the input. Other callbacks offer a useful visual
representation of the invalid data.
<P>

<B>uconv</B>

can also run the specified
<I>transliteration</I>

on the transcoded data,
in which case transliteration will happen as an intermediate step,
after the data have been transcoded to Unicode.
The
<I>transliteration</I>

can be either a list of semicolon-separated transliterator names,
or an arbitrarily complex set of rules in the ICU transliteration
rules format.
<P>

For transcoding purposes,
<B>uconv</B>

options are compatible with those of
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1),

making it easy to replace it in scripts. It is not necessarily the case,
however, that the encoding names used by
<B>uconv</B>

and ICU are the same as the ones used by
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1).

Also, options that provide informational data, such as the
<B>-l</B>, <B>--list</B>

one offered by some
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1)

variants such as GNU's, produce data in a slightly different and
easier to parse format.
<A NAME="lbAE">&nbsp;</A>
<H2>OPTIONS</H2>

<DL COMPACT>
<DT id="1"><B>-h</B>, <B>-?</B>, <B>--help</B>

<DD>
Print help about usage and exit.
<DT id="2"><B>-V</B>, <B>--version</B>

<DD>
Print the version of
<B>uconv</B>

and exit.
<DT id="3"><B>-s</B>, <B>--silent</B>

<DD>
Suppress messages during execution.
<DT id="4"><B>-v</B>, <B>--verbose</B>

<DD>
Display extra informative messages during execution.
<DT id="5"><B>-l</B>, <B>--list</B>

<DD>
List all the available encodings and exit.
<DT id="6"><B>-l</B>, <B>--list-code</B><I> code</I>

<DD>
List only the
<I>code</I>

encoding and exit. If
<I>code</I>

is not a proper encoding, exit with an error.
<DT id="7"><B>--default-code</B>

<DD>
List only the name of the default encoding and exit.
<DT id="8"><B>-L</B>, <B>--list-transliterators</B>

<DD>
List all the available transliterators and exit.
<DT id="9"><B>--canon</B>

<DD>
If used with
<B>-l</B>, <B>--list</B>

or
<B>--default-code</B>,

the list of encodings is produced in a format compatible with
<B><A HREF="/cgi-bin/man/man2html?5+convrtrs.txt">convrtrs.txt</A></B>(5).

If used with
<B>-L</B>, <B>--list-transliterators</B>,

print only one transliterator name per line.
<DT id="10"><B>-x</B><I> transliteration</I>

<DD>
Run the given
<I>transliteration</I>

on the transcoded Unicode data,
and use the transliterated data as input for the transcoding to
the destination encoding.
<DT id="11"><B>--to-callback</B><I> callback</I>

<DD>
Use
<I>callback</I>

to handle characters that cannot be transcoded to the destination
encoding. See section
<B>CALLBACKS</B>

for details on valid callbacks.
<DT id="12"><B>-c</B>

<DD>
Omit invalid characters from the output.
Same as
<B>--to-callback skip</B>.

<DT id="13"><B>--from-callback</B><I> callback</I>

<DD>
Use
<I>callback</I>

to handle characters that cannot be transcoded from the original
encoding. See section
<B>CALLBACKS</B>

for details on valid callbacks.
<DT id="14"><B>-i</B>

<DD>
Ignore invalid sequences in the input.
Same as
<B>--from-callback skip</B>.

<DT id="15"><B>--callback</B><I> callback</I>

<DD>
Use
<I>callback</I>

to handle both characters that cannot be transcoded from the original
encoding and characters that cannot be transcoded to the destination
encoding. See section
<B>CALLBACKS</B>

for details on valid callbacks.
<DT id="16"><B>--fallback</B>

<DD>
Use the fallback mapping when transcoding from
Unicode to the destination encoding.
<DT id="17"><B>--no-fallback</B>

<DD>
Do not use the fallback mapping when transcoding from Unicode to the
destination encoding.
This is the default.
<DT id="18"><B>-b</B>, <B>--block-size</B><I> size</I>

<DD>
Read input in blocks of
<I>size</I>

bytes at a time. The default block size is
4096.
<DT id="19"><B>-f</B>, <B>--from-code</B><I> encoding</I>

<DD>
Set the original encoding of the data to
<I>encoding</I>.

<DT id="20"><B>-t</B>, <B>--to-code</B><I> encoding</I>

<DD>
Transcode the data to
<I>encoding</I>.

<DT id="21"><B>--add-signature</B>

<DD>
Add a U+FEFF Unicode signature character (BOM) if the output charset
supports it and does not add one anyway.
<DT id="22"><B>--remove-signature</B>

<DD>
Remove a U+FEFF Unicode signature character (BOM).
<DT id="23"><B>-o</B>, <B>--output</B><I> file</I>

<DD>
Write the transcoded data to
<I>file</I>.

</DL>
<A NAME="lbAF">&nbsp;</A>
<H2>CALLBACKS</H2>

<B>uconv</B>

supports specifying callbacks to handle invalid data. Callbacks can be
set for both directions of transcoding: from the original encoding to
Unicode, with the
<B>--from-callback</B>

option, and from Unicode to the destination encoding, with the
<B>--to-callback</B>

option.
<P>

The following is a list of valid
<I>callback</I>

names, along with a description of their behavior. The list of
callbacks actually supported by
<B>uconv</B>

is displayed when it is called with
<B>-h</B>, <B>--help</B>.

<P>

<DL COMPACT>
<DT id="24"><B>substitute</B>

<DD>
Write the encoding's substitute sequence, or the Unicode
replacement character
<B>U+FFFD</B>

when transcoding to Unicode.
<DT id="25"><B>skip</B>

<DD>
Ignore the invalid data.
<DT id="26"><B>stop</B>

<DD>
Stop with an error when encountering invalid data.
This is the default callback.
<DT id="27"><B>escape</B>

<DD>
Same as
<B>escape-icu</B>.

<DT id="28"><B>escape-icu</B>

<DD>
Replace the missing characters with a string of the format
<B>%U</B><I>hhhh</I>

for plane 0 characters, and
<B>%U</B><I>hhhh</I>%U<I>hhhh</I>

for planes 1 and above characters,
where
<I>hhhh</I>

is the hexadecimal value of one of the UTF-16 code units representing the
character. Characters from planes 1 and above are written as a pair of
UTF-16 surrogate code units.
<DT id="29"><B>escape-java</B>

<DD>
Replace the missing characters with a string of the format
<B>\u</B><I>hhhh</I>

for plane 0 characters, and
<B>\u</B><I>hhhh</I>\u<I>hhhh</I>

for planes 1 and above characters,
where
<I>hhhh</I>

is the hexadecimal value of one of the UTF-16 code units representing the
character. Characters from planes 1 and above are written as a pair of
UTF-16 surrogate code units.
<DT id="30"><B>escape-c</B>

<DD>
Replace the missing characters with a string of the format
<B>\u</B><I>hhhh</I>

for plane 0 characters, and
<B>\U</B><I>hhhhhhhh</I>

for planes 1 and above characters,
where
<I>hhhh</I>

and
<I>hhhhhhhh</I>

are the hexadecimal values of the Unicode codepoint.
<DT id="31"><B>escape-xml</B>

<DD>
Same as
<B>escape-xml-hex</B>.

<DT id="32"><B>escape-xml-hex</B>

<DD>
Replace the missing characters with a string of the format
<B>&amp;#x</B><I>hhhh</I>;,

where
<I>hhhh</I>

is the hexadecimal value of the Unicode codepoint.
<DT id="33"><B>escape-xml-dec</B>

<DD>
Replace the missing characters with a string of the format
<B>&amp;#</B><I>nnnn</I>;,

where
<I>nnnn</I>

is the decimal value of the Unicode codepoint.
<DT id="34"><B>escape-unicode</B>

<DD>
Replace the missing characters with a string of the format
<B>{U+</B><I>hhhh</I>},

where
<I>hhhh</I>

is the hexadecimal value of the Unicode codepoint.
That hexadecimal string is of variable length and can use from 4 to
6 digits.
This is the format universally used to denote a Unicode codepoint in
the literature, delimited by curly braces for easy recognition of those
substitutions in the output.
</DL>
<A NAME="lbAG">&nbsp;</A>
<H2>EXAMPLES</H2>

Convert data from a given
<I>encoding</I>

to the platform encoding:
<P>
<DL COMPACT><DT id="35"><DD>
<B></B>$ uconv -f <I>encoding</I>

</DL>

<P>

Check if a
<I>file</I>

contains valid data for a given
<I>encoding</I>:

<P>
<DL COMPACT><DT id="36"><DD>
<B></B>$ uconv -f <I>encoding</I> -c <I>file</I> &gt;/dev/null

</DL>

<P>

Convert a UTF-8
<I>file</I>

to a given
<I>encoding</I>

and ensure that the resulting text is good for any version of HTML:
<P>
<DL COMPACT><DT id="37"><DD>
<B></B>$ uconv -f utf-8 -t <I>encoding</I> \

<BR>

<B>    --callback escape-xml-dec </B><I>file</I>

</DL>

<P>

Display the names of the Unicode code points in a UTF-file:
<P>
<DL COMPACT><DT id="38"><DD>
<B></B>$ uconv -f utf-8 -x any-name <I>file</I>

</DL>

<P>

Print the name of a Unicode code point whose value is known (<B>U+30AB</B>
in this example):
<P>
<DL COMPACT><DT id="39"><DD>
<B></B>$ echo '\u30ab' | uconv -x 'hex-any; any-name'; echo

<BR>

{KATAKANA LETTER KA}{LINE FEED}
<BR>

$
</DL>

<P>
(The names are delimited by curly braces.
Also, the name of the line terminator is also displayed.)
<P>

Normalize UTF-8 data using Unicode NFKC, remove all control characters,
and map Katakana to Hiragana:
<P>
<DL COMPACT><DT id="40"><DD>
<B></B>$ uconv -f utf-8 -t utf-8 \

<BR>

<B>      -x '::nfkc; [:Cc:] &gt;; ::katakana-hiragana;'</B>

</DL>
<A NAME="lbAH">&nbsp;</A>
<H2>CAVEATS AND BUGS</H2>

<B>uconv</B>

does report errors as occurring at the first invalid byte
encountered. This may be confusing to users of GNU
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1),

which reports errors as occurring at the first byte of an invalid
sequence. For multi-byte character sets or encodings, this means that
<B>uconv</B>

error positions may be at a later offset in the input stream than
would be the case with GNU
<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1).

<P>

The reporting of error positions when a transliterator is used may be
inaccurate or unavailable, in which case
<B>uconv</B>

will report the offset in the output stream at which the error
occurred.
<A NAME="lbAI">&nbsp;</A>
<H2>AUTHORS</H2>

Jonas Utterstroem
<BR>

Yves Arrouye
<A NAME="lbAJ">&nbsp;</A>
<H2>VERSION</H2>

66.1
<A NAME="lbAK">&nbsp;</A>
<H2>COPYRIGHT</H2>

Copyright (C) 2000-2005 IBM, Inc. and others.
<A NAME="lbAL">&nbsp;</A>
<H2>SEE ALSO</H2>

<B><A HREF="/cgi-bin/man/man2html?1+iconv">iconv</A></B>(1)

<P>

<HR>
<A NAME="index">&nbsp;</A><H2>Index</H2>
<DL>
<DT id="41"><A HREF="#lbAB">NAME</A><DD>
<DT id="42"><A HREF="#lbAC">SYNOPSIS</A><DD>
<DT id="43"><A HREF="#lbAD">DESCRIPTION</A><DD>
<DT id="44"><A HREF="#lbAE">OPTIONS</A><DD>
<DT id="45"><A HREF="#lbAF">CALLBACKS</A><DD>
<DT id="46"><A HREF="#lbAG">EXAMPLES</A><DD>
<DT id="47"><A HREF="#lbAH">CAVEATS AND BUGS</A><DD>
<DT id="48"><A HREF="#lbAI">AUTHORS</A><DD>
<DT id="49"><A HREF="#lbAJ">VERSION</A><DD>
<DT id="50"><A HREF="#lbAK">COPYRIGHT</A><DD>
<DT id="51"><A HREF="#lbAL">SEE ALSO</A><DD>
</DL>
<HR>
This document was created by
<A HREF="/cgi-bin/man/man2html">man2html</A>,
using the manual pages.<BR>
Time: 00:05:28 GMT, March 31, 2021
</BODY>
</HTML>