334 lines
10 KiB
HTML
334 lines
10 KiB
HTML
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<HTML><HEAD><TITLE>Man page of UNICODE</TITLE>
|
|
</HEAD><BODY>
|
|
<H1>UNICODE</H1>
|
|
Section: Linux Programmer's Manual (7)<BR>Updated: 2019-03-06<BR><A HREF="#index">Index</A>
|
|
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
|
|
|
|
<A NAME="lbAB"> </A>
|
|
<H2>NAME</H2>
|
|
|
|
unicode - universal character set
|
|
<A NAME="lbAC"> </A>
|
|
<H2>DESCRIPTION</H2>
|
|
|
|
The international standard ISO 10646 defines the
|
|
Universal Character Set (UCS).
|
|
UCS contains all characters of all other character set standards.
|
|
It also guarantees "round-trip compatibility";
|
|
in other words,
|
|
conversion tables can be built such that no information is lost
|
|
when a string is converted from any other encoding to UCS and back.
|
|
<P>
|
|
|
|
UCS contains the characters required to represent practically all
|
|
known languages.
|
|
This includes not only the Latin, Greek, Cyrillic,
|
|
Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese,
|
|
Japanese and Korean Han ideographs as well as scripts such as
|
|
Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
|
|
Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,
|
|
Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,
|
|
Ogham, Myanmar, Sinhala, Thaana, Yi, and others.
|
|
For scripts not yet
|
|
covered, research on how to best encode them for computer usage is
|
|
still going on and they will be added eventually.
|
|
This might
|
|
eventually include not only Hieroglyphs and various historic
|
|
Indo-European languages, but even some selected artistic scripts such
|
|
as Tengwar, Cirth, and Klingon.
|
|
UCS also covers a large number of
|
|
graphical, typographical, mathematical, and scientific symbols,
|
|
including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows,
|
|
Macintosh, OCR fonts, as well as many word processing and publishing
|
|
systems, and more are being added.
|
|
<P>
|
|
|
|
The UCS standard (ISO 10646) describes a
|
|
31-bit character set architecture
|
|
consisting of 128 24-bit
|
|
<I>groups</I>,
|
|
|
|
each divided into 256 16-bit
|
|
<I>planes</I>
|
|
|
|
made up of 256 8-bit
|
|
<I>rows</I>
|
|
|
|
with 256
|
|
<I>column</I>
|
|
|
|
positions, one for each character.
|
|
Part 1 of the standard (ISO 10646-1)
|
|
defines the first 65534 code positions (0x0000 to 0xfffd), which form
|
|
the
|
|
<I>Basic Multilingual Plane</I>
|
|
|
|
(BMP), that is plane 0 in group 0.
|
|
Part 2 of the standard (ISO 10646-2)
|
|
adds characters to group 0 outside the BMP in several
|
|
<I>supplementary planes</I>
|
|
|
|
in the range 0x10000 to 0x10ffff.
|
|
There are no plans to add characters
|
|
beyond 0x10ffff to the standard, therefore of the entire code space,
|
|
only a small fraction of group 0 will ever be actually used in the
|
|
foreseeable future.
|
|
The BMP contains all characters found in the
|
|
commonly used other character sets.
|
|
The supplemental planes added by
|
|
ISO 10646-2 cover only more exotic characters for special scientific,
|
|
dictionary printing, publishing industry, higher-level protocol and
|
|
enthusiast needs.
|
|
<P>
|
|
|
|
The representation of each UCS character as a 2-byte word is referred
|
|
to as the UCS-2 form (only for BMP characters),
|
|
whereas UCS-4 is the representation of each character by a 4-byte word.
|
|
In addition, there exist two encoding forms UTF-8
|
|
for backward compatibility with ASCII processing software and UTF-16
|
|
for the backward-compatible handling of non-BMP characters up to
|
|
0x10ffff by UCS-2 software.
|
|
<P>
|
|
|
|
The UCS characters 0x0000 to 0x007f are identical to those of the
|
|
classic US-ASCII
|
|
character set and the characters in the range 0x0000 to 0x00ff
|
|
are identical to those in
|
|
ISO 8859-1 (Latin-1).
|
|
<A NAME="lbAD"> </A>
|
|
<H3>Combining characters</H3>
|
|
|
|
Some code points in UCS
|
|
have been assigned to
|
|
<I>combining characters</I>.
|
|
|
|
These are similar to the nonspacing accent keys on a typewriter.
|
|
A combining character just adds an accent to the previous character.
|
|
The most important accented characters have codes of their own in UCS,
|
|
however, the combining character mechanism allows us to add accents
|
|
and other diacritical marks to any character.
|
|
The combining characters
|
|
always follow the character which they modify.
|
|
For example, the German
|
|
character Umlaut-A ("Latin capital letter A with diaeresis") can
|
|
either be represented by the precomposed UCS code 0x00c4, or
|
|
alternatively as the combination of a normal "Latin capital letter A"
|
|
followed by a "combining diaeresis": 0x0041 0x0308.
|
|
<P>
|
|
|
|
Combining characters are essential for instance for encoding the Thai
|
|
script or for mathematical typesetting and users of the International
|
|
Phonetic Alphabet.
|
|
<A NAME="lbAE"> </A>
|
|
<H3>Implementation levels</H3>
|
|
|
|
As not all systems are expected to support advanced mechanisms like
|
|
combining characters, ISO 10646-1 specifies the following three
|
|
<I>implementation levels</I>
|
|
|
|
of UCS:
|
|
<DL COMPACT>
|
|
<DT id="1">Level 1<DD>
|
|
Combining characters and Hangul Jamo
|
|
(a variant encoding of the Korean script, where a Hangul syllable
|
|
glyph is coded as a triplet or pair of vowel/consonant codes) are not
|
|
supported.
|
|
<DT id="2">Level 2<DD>
|
|
In addition to level 1, combining characters are now allowed for some
|
|
languages where they are essential (e.g., Thai, Lao, Hebrew,
|
|
Arabic, Devanagari, Malayalam).
|
|
<DT id="3">Level 3<DD>
|
|
All UCS characters are supported.
|
|
</DL>
|
|
<P>
|
|
|
|
The Unicode 3.0 Standard
|
|
published by the Unicode Consortium
|
|
contains exactly the UCS Basic Multilingual Plane
|
|
at implementation level 3, as described in ISO 10646-1:2000.
|
|
Unicode 3.1 added the supplemental planes of ISO 10646-2.
|
|
The Unicode standard and
|
|
technical reports published by the Unicode Consortium provide much
|
|
additional information on the semantics and recommended usages of
|
|
various characters.
|
|
They provide guidelines and algorithms for
|
|
editing, sorting, comparing, normalizing, converting, and displaying
|
|
Unicode strings.
|
|
<A NAME="lbAF"> </A>
|
|
<H3>Unicode under Linux</H3>
|
|
|
|
Under GNU/Linux, the C type
|
|
<I>wchar_t</I>
|
|
|
|
is a signed 32-bit integer type.
|
|
Its values are always interpreted
|
|
by the C library as UCS
|
|
code values (in all locales), a convention that is signaled by the GNU
|
|
C library to applications by defining the constant
|
|
<B>__STDC_ISO_10646__</B>
|
|
|
|
as specified in the ISO C99 standard.
|
|
<P>
|
|
|
|
UCS/Unicode can be used just like ASCII in input/output streams,
|
|
terminal communication, plaintext files, filenames, and environment
|
|
variables in the ASCII compatible UTF-8 multibyte encoding.
|
|
To signal the use of UTF-8 as the character
|
|
encoding to all applications, a suitable
|
|
<I>locale</I>
|
|
|
|
has to be selected via environment variables (e.g.,
|
|
"LANG=en_GB.UTF-8").
|
|
<P>
|
|
|
|
The
|
|
<B>nl_langinfo(CODESET)</B>
|
|
|
|
function returns the name of the selected encoding.
|
|
Library functions such as
|
|
<B><A HREF="/cgi-bin/man/man2html?3+wctomb">wctomb</A></B>(3)
|
|
|
|
and
|
|
<B><A HREF="/cgi-bin/man/man2html?3+mbsrtowcs">mbsrtowcs</A></B>(3)
|
|
|
|
can be used to transform the internal
|
|
<I>wchar_t</I>
|
|
|
|
characters and strings into the system character encoding and back
|
|
and
|
|
<B><A HREF="/cgi-bin/man/man2html?3+wcwidth">wcwidth</A></B>(3)
|
|
|
|
tells, how many positions (0-2) the cursor is advanced by the
|
|
output of a character.
|
|
<P>
|
|
|
|
<A NAME="lbAG"> </A>
|
|
<H3>Private Use Areas (PUA)</H3>
|
|
|
|
In the Basic Multilingual Plane,
|
|
the range 0xe000 to 0xf8ff will never be assigned to any characters by
|
|
the standard and is reserved for private usage.
|
|
For the Linux
|
|
community, this private area has been subdivided further into the
|
|
range 0xe000 to 0xefff which can be used individually by any end-user
|
|
and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
|
|
coordinated among all Linux users.
|
|
The registry of the characters
|
|
assigned to the Linux zone is maintained by LANANA and the registry
|
|
itself is
|
|
<I>Documentation/admin-guide/unicode.rst</I>
|
|
|
|
in the Linux kernel sources
|
|
|
|
(or
|
|
<I>Documentation/unicode.txt</I>
|
|
|
|
before Linux 4.10).
|
|
<P>
|
|
|
|
Two other planes are reserved for private usage, plane 15
|
|
(Supplementary Private Use Area-A, range 0xf0000 to 0xffffd)
|
|
and plane 16 (Supplementary Private Use Area-B, range
|
|
0x100000 to 0x10fffd).
|
|
<A NAME="lbAH"> </A>
|
|
<H3>Literature</H3>
|
|
|
|
<DL COMPACT>
|
|
<DT id="4">*<DD>
|
|
Information technology --- Universal Multiple-Octet Coded Character
|
|
Set (UCS) --- Part 1: Architecture and Basic Multilingual Plane.
|
|
International Standard ISO/IEC 10646-1, International Organization
|
|
for Standardization, Geneva, 2000.
|
|
<DT id="5"><DD>
|
|
This is the official specification of UCS .
|
|
Available from
|
|
|
|
|
|
<DT id="6">*<DD>
|
|
The Unicode Standard, Version 3.0.
|
|
The Unicode Consortium, Addison-Wesley,
|
|
Reading, MA, 2000, ISBN 0-201-61633-5.
|
|
<DT id="7">*<DD>
|
|
S. Harbison, G. Steele. C: A Reference Manual. Fourth edition,
|
|
Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
|
|
<DT id="8"><DD>
|
|
A good reference book about the C programming language.
|
|
The fourth
|
|
edition covers the 1994 Amendment 1 to the ISO C90 standard, which
|
|
adds a large number of new C library functions for handling wide and
|
|
multibyte character encodings, but it does not yet cover ISO C99,
|
|
which improved wide and multibyte character support even further.
|
|
<DT id="9">*<DD>
|
|
Unicode Technical Reports.
|
|
<DL COMPACT><DT id="10"><DD>
|
|
|
|
|
|
</DL>
|
|
|
|
<DT id="11">*<DD>
|
|
Markus Kuhn: UTF-8 and Unicode FAQ for UNIX/Linux.
|
|
<DL COMPACT><DT id="12"><DD>
|
|
|
|
|
|
</DL>
|
|
|
|
<DT id="13">*<DD>
|
|
Bruno Haible: Unicode HOWTO.
|
|
<DL COMPACT><DT id="14"><DD>
|
|
|
|
|
|
</DL>
|
|
|
|
|
|
|
|
</DL>
|
|
<A NAME="lbAI"> </A>
|
|
<H2>SEE ALSO</H2>
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?1+locale">locale</A></B>(1),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?3+setlocale">setlocale</A></B>(3),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?7+charsets">charsets</A></B>(7),
|
|
|
|
<B><A HREF="/cgi-bin/man/man2html?7+utf-8">utf-8</A></B>(7)
|
|
|
|
<A NAME="lbAJ"> </A>
|
|
<H2>COLOPHON</H2>
|
|
|
|
This page is part of release 5.05 of the Linux
|
|
<I>man-pages</I>
|
|
|
|
project.
|
|
A description of the project,
|
|
information about reporting bugs,
|
|
and the latest version of this page,
|
|
can be found at
|
|
<A HREF="https://www.kernel.org/doc/man-pages/.">https://www.kernel.org/doc/man-pages/.</A>
|
|
<P>
|
|
|
|
<HR>
|
|
<A NAME="index"> </A><H2>Index</H2>
|
|
<DL>
|
|
<DT id="15"><A HREF="#lbAB">NAME</A><DD>
|
|
<DT id="16"><A HREF="#lbAC">DESCRIPTION</A><DD>
|
|
<DL>
|
|
<DT id="17"><A HREF="#lbAD">Combining characters</A><DD>
|
|
<DT id="18"><A HREF="#lbAE">Implementation levels</A><DD>
|
|
<DT id="19"><A HREF="#lbAF">Unicode under Linux</A><DD>
|
|
<DT id="20"><A HREF="#lbAG">Private Use Areas (PUA)</A><DD>
|
|
<DT id="21"><A HREF="#lbAH">Literature</A><DD>
|
|
</DL>
|
|
<DT id="22"><A HREF="#lbAI">SEE ALSO</A><DD>
|
|
<DT id="23"><A HREF="#lbAJ">COLOPHON</A><DD>
|
|
</DL>
|
|
<HR>
|
|
This document was created by
|
|
<A HREF="/cgi-bin/man/man2html">man2html</A>,
|
|
using the manual pages.<BR>
|
|
Time: 00:06:10 GMT, March 31, 2021
|
|
</BODY>
|
|
</HTML>
|