530 lines
10 KiB
HTML
530 lines
10 KiB
HTML
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<HTML><HEAD><TITLE>Man page of ENC2XS</TITLE>
|
|
</HEAD><BODY>
|
|
<H1>ENC2XS</H1>
|
|
Section: Perl Programmers Reference Guide (1)<BR>Updated: 2020-10-19<BR><A HREF="#index">Index</A>
|
|
<A HREF="/cgi-bin/man/man2html">Return to Main Contents</A><HR>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<A NAME="lbAB"> </A>
|
|
<H2>NAME</H2>
|
|
|
|
enc2xs -- Perl Encode Module Generator
|
|
<A NAME="lbAC"> </A>
|
|
<H2>SYNOPSIS</H2>
|
|
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
enc2xs -[options]
|
|
enc2xs -M ModName mapfiles...
|
|
enc2xs -C
|
|
|
|
</PRE>
|
|
|
|
|
|
<A NAME="lbAD"> </A>
|
|
<H2>DESCRIPTION</H2>
|
|
|
|
|
|
|
|
<I>enc2xs</I> builds a Perl extension for use by Encode from either
|
|
Unicode Character Mapping files (.ucm) or Tcl Encoding Files (.enc).
|
|
Besides being used internally during the build process of the Encode
|
|
module, you can use <I>enc2xs</I> to add your own encoding to perl.
|
|
No knowledge of <FONT SIZE="-1">XS</FONT> is necessary.
|
|
<A NAME="lbAE"> </A>
|
|
<H2>Quick Guide</H2>
|
|
|
|
|
|
|
|
If you want to know as little about Perl as possible but need to
|
|
add a new encoding, just read this chapter and forget the rest.
|
|
<DL COMPACT>
|
|
<DT id="1">0.<DD>
|
|
|
|
|
|
Have a .ucm file ready. You can get it from somewhere or you can write
|
|
your own from scratch or you can grab one from the Encode distribution
|
|
and customize it. For the <FONT SIZE="-1">UCM</FONT> format, see the next Chapter. In the
|
|
example below, I'll call my theoretical encoding myascii, defined
|
|
in <I>my.ucm</I>. <TT>"$"</TT> is a shell prompt.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$ ls -F
|
|
my.ucm
|
|
|
|
</PRE>
|
|
|
|
|
|
<DT id="2">1.<DD>
|
|
|
|
|
|
Issue a command as follows;
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$ enc2xs -M My my.ucm
|
|
generating Makefile.PL
|
|
generating My.pm
|
|
generating README
|
|
generating Changes
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Now take a look at your current directory. It should look like this.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$ ls -F
|
|
Makefile.PL My.pm my.ucm t/
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The following files were created.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
Makefile.PL - MakeMaker script
|
|
My.pm - Encode submodule
|
|
t/My.t - test file
|
|
|
|
</PRE>
|
|
|
|
|
|
<DL COMPACT><DT id="3"><DD>
|
|
<DL COMPACT>
|
|
<DT id="4">1.1.<DD>
|
|
|
|
|
|
If you want *.ucm installed together with the modules, do as follows;
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$ mkdir Encode
|
|
$ mv *.ucm Encode
|
|
$ enc2xs -M My Encode/*ucm
|
|
|
|
</PRE>
|
|
|
|
|
|
</DL>
|
|
</DL>
|
|
|
|
<DL COMPACT><DT id="5"><DD>
|
|
</DL>
|
|
|
|
<DT id="6">2.<DD>
|
|
|
|
|
|
Edit the files generated. You don't have to if you have no time <FONT SIZE="-1">AND</FONT> no
|
|
intention to give it to someone else. But it is a good idea to edit
|
|
the pod and to add more tests.
|
|
<DT id="7">3.<DD>
|
|
|
|
|
|
Now issue a command all Perl Mongers love:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$ perl Makefile.PL
|
|
Writing Makefile for Encode::My
|
|
|
|
</PRE>
|
|
|
|
|
|
<DT id="8">4.<DD>
|
|
|
|
|
|
Now all you have to do is make.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$ make
|
|
cp My.pm blib/lib/Encode/My.pm
|
|
/usr/local/bin/perl /usr/local/bin/enc2xs -Q -O \
|
|
-o encode_t.c -f encode_t.fnm
|
|
Reading myascii (myascii)
|
|
Writing compiled form
|
|
128 bytes in string tables
|
|
384 bytes (75%) saved spotting duplicates
|
|
1 bytes (0.775%) saved using substrings
|
|
....
|
|
chmod 644 blib/arch/auto/Encode/My/My.bs
|
|
$
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The time it takes varies depending on how fast your machine is and
|
|
how large your encoding is. Unless you are working on something big
|
|
like euc-tw, it won't take too long.
|
|
<DT id="9">5.<DD>
|
|
|
|
|
|
You can ``make install'' already but you should test first.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
$ make test
|
|
PERL_DL_NONLAZY=1 /usr/local/bin/perl -Iblib/arch -Iblib/lib \
|
|
-e 'use Test::Harness qw(&runtests $verbose); \
|
|
$verbose=0; runtests @ARGV;' t/*.t
|
|
t/My....ok
|
|
All tests successful.
|
|
Files=1, Tests=2, 0 wallclock secs
|
|
( 0.09 cusr + 0.01 csys = 0.09 CPU)
|
|
|
|
</PRE>
|
|
|
|
|
|
<DT id="10">6.<DD>
|
|
|
|
|
|
If you are content with the test result, just ``make install''
|
|
<DT id="11">7.<DD>
|
|
|
|
|
|
If you want to add your encoding to Encode's demand-loading list
|
|
(so you don't have to ``use Encode::YourEncoding''), run
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
enc2xs -C
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
to update Encode::ConfigLocal, a module that controls local settings.
|
|
After that, ``use Encode;'' is enough to load your encodings on demand.
|
|
</DL>
|
|
<A NAME="lbAF"> </A>
|
|
<H2>The Unicode Character Map</H2>
|
|
|
|
|
|
|
|
Encode uses the Unicode Character Map (<FONT SIZE="-1">UCM</FONT>) format for source character
|
|
mappings. This format is used by <FONT SIZE="-1">IBM</FONT>'s <FONT SIZE="-1">ICU</FONT> package and was adopted
|
|
by Nick Ing-Simmons for use with the Encode module. Since <FONT SIZE="-1">UCM</FONT> is
|
|
more flexible than Tcl's Encoding Map and far more user-friendly,
|
|
this is the recommended format for Encode now.
|
|
<P>
|
|
|
|
A <FONT SIZE="-1">UCM</FONT> file looks like this.
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
#
|
|
# Comments
|
|
#
|
|
<code_set_name> "US-ascii" # Required
|
|
<code_set_alias> "ascii" # Optional
|
|
<mb_cur_min> 1 # Required; usually 1
|
|
<mb_cur_max> 1 # Max. # of bytes/char
|
|
<subchar> \x3F # Substitution char
|
|
#
|
|
CHARMAP
|
|
<U0000> \x00 |0 # <control>
|
|
<U0001> \x01 |0 # <control>
|
|
<U0002> \x02 |0 # <control>
|
|
....
|
|
<U007C> \x7C |0 # VERTICAL LINE
|
|
<U007D> \x7D |0 # RIGHT CURLY BRACKET
|
|
<U007E> \x7E |0 # TILDE
|
|
<U007F> \x7F |0 # <control>
|
|
END CHARMAP
|
|
|
|
</PRE>
|
|
|
|
|
|
<DL COMPACT>
|
|
<DT id="12">•<DD>
|
|
Anything that follows <TT>"#"</TT> is treated as a comment.
|
|
<DT id="13">•<DD>
|
|
The header section continues until a line containing the word
|
|
<FONT SIZE="-1">CHARMAP.</FONT> This section has a form of <I><keyword> value</I>, one
|
|
pair per line. Strings used as values must be quoted. Barewords are
|
|
treated as numbers. <I>\xXX</I> represents a byte.
|
|
|
|
|
|
<P>
|
|
|
|
|
|
Most of the keywords are self-explanatory. <I>subchar</I> means
|
|
substitution character, not subcharacter. When you decode a Unicode
|
|
sequence to this encoding but no matching character is found, the byte
|
|
sequence defined here will be used. For most cases, the value here is
|
|
\x3F; in <FONT SIZE="-1">ASCII,</FONT> this is a question mark.
|
|
<DT id="14">•<DD>
|
|
<FONT SIZE="-1">CHARMAP</FONT> starts the character map section. Each line has a form as
|
|
follows:
|
|
|
|
|
|
<P>
|
|
|
|
|
|
|
|
|
|
<PRE>
|
|
<UXXXX> \xXX.. |0 # comment
|
|
^ ^ ^
|
|
| | +- Fallback flag
|
|
| +-------- Encoded byte sequence
|
|
+-------------- Unicode Character ID in hex
|
|
|
|
</PRE>
|
|
|
|
|
|
|
|
|
|
<P>
|
|
|
|
|
|
The format is roughly the same as a header section except for the
|
|
fallback flag: | followed by 0..3. The meaning of the possible
|
|
values is as follows:
|
|
<DL COMPACT><DT id="15"><DD>
|
|
<DL COMPACT>
|
|
<DT id="16">|0<DD>
|
|
|
|
|
|
Round trip safe. A character decoded to Unicode encodes back to the
|
|
same byte sequence. Most characters have this flag.
|
|
<DT id="17">|1<DD>
|
|
|
|
|
|
Fallback for unicode -> encoding. When seen, enc2xs adds this
|
|
character for the encode map only.
|
|
<DT id="18">|2<DD>
|
|
|
|
|
|
Skip sub-char mapping should there be no code point.
|
|
<DT id="19">|3<DD>
|
|
|
|
|
|
Fallback for encoding -> unicode. When seen, enc2xs adds this
|
|
character for the decode map only.
|
|
</DL>
|
|
</DL>
|
|
|
|
<DL COMPACT><DT id="20"><DD>
|
|
</DL>
|
|
|
|
<DT id="21">•<DD>
|
|
And finally, <FONT SIZE="-1">END OF CHARMAP</FONT> ends the section.
|
|
</DL>
|
|
<P>
|
|
|
|
When you are manually creating a <FONT SIZE="-1">UCM</FONT> file, you should copy ascii.ucm
|
|
or an existing encoding which is close to yours, rather than write
|
|
your own from scratch.
|
|
<P>
|
|
|
|
When you do so, make sure you leave at least <B>U0000</B> to <B>U0020</B> as
|
|
is, unless your environment is <FONT SIZE="-1">EBCDIC.</FONT>
|
|
<P>
|
|
|
|
<B></B><FONT SIZE="-1"><B>CAVEAT</B></FONT><B></B>: not all features in <FONT SIZE="-1">UCM</FONT> are implemented. For example,
|
|
icu:state is not used. Because of that, you need to write a perl
|
|
module if you want to support algorithmical encodings, notably
|
|
the <FONT SIZE="-1">ISO-2022</FONT> series. Such modules include Encode::JP::2022_JP,
|
|
Encode::KR::2022_KR, and Encode::TW::HZ.
|
|
<A NAME="lbAG"> </A>
|
|
<H3>Coping with duplicate mappings</H3>
|
|
|
|
|
|
|
|
When you create a map, you <FONT SIZE="-1">SHOULD</FONT> make your mappings round-trip safe.
|
|
That is, <TT>"encode('your-encoding', decode('your-encoding', $data)) eq
|
|
$data"</TT> stands for all characters that are marked as <TT>"|0"</TT>. Here is
|
|
how to make sure:
|
|
<DL COMPACT>
|
|
<DT id="22">•<DD>
|
|
Sort your map in Unicode order.
|
|
<DT id="23">•<DD>
|
|
When you have a duplicate entry, mark either one with '|1' or '|3'.
|
|
<DT id="24">•<DD>
|
|
And make sure the '|1' or '|3' entry <FONT SIZE="-1">FOLLOWS</FONT> the '|0' entry.
|
|
</DL>
|
|
<P>
|
|
|
|
Here is an example from big5-eten.
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
<U2550> \xF9\xF9 |0
|
|
<U2550> \xA2\xA4 |3
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
Internally Encoding -> Unicode and Unicode -> Encoding Map looks like
|
|
this;
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
E to U U to E
|
|
--------------------------------------
|
|
\xF9\xF9 => U2550 U2550 => \xF9\xF9
|
|
\xA2\xA4 => U2550
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
So it is round-trip safe for \xF9\xF9. But if the line above is upside
|
|
down, here is what happens.
|
|
<P>
|
|
|
|
|
|
|
|
<PRE>
|
|
E to U U to E
|
|
--------------------------------------
|
|
\xA2\xA4 => U2550 U2550 => \xF9\xF9
|
|
(\xF9\xF9 => U2550 is now overwritten!)
|
|
|
|
</PRE>
|
|
|
|
|
|
<P>
|
|
|
|
The Encode package comes with <I>ucmlint</I>, a crude but sufficient
|
|
utility to check the integrity of a <FONT SIZE="-1">UCM</FONT> file. Check under the
|
|
Encode/bin directory for this.
|
|
<P>
|
|
|
|
When in doubt, you can use <I>ucmsort</I>, yet another utility under
|
|
Encode/bin directory.
|
|
<A NAME="lbAH"> </A>
|
|
<H2>Bookmarks</H2>
|
|
|
|
|
|
|
|
<DL COMPACT>
|
|
<DT id="25">•<DD>
|
|
<FONT SIZE="-1">ICU</FONT> Home Page
|
|
<<A HREF="http://www.icu-project.org/">http://www.icu-project.org/</A>>
|
|
<DT id="26">•<DD>
|
|
<FONT SIZE="-1">ICU</FONT> Character Mapping Tables
|
|
<<A HREF="http://site.icu-project.org/charts/charset">http://site.icu-project.org/charts/charset</A>>
|
|
<DT id="27">•<DD>
|
|
ICU:Conversion Data
|
|
<<A HREF="http://www.icu-project.org/userguide/conversion-data.html">http://www.icu-project.org/userguide/conversion-data.html</A>>
|
|
</DL>
|
|
<A NAME="lbAI"> </A>
|
|
<H2>SEE ALSO</H2>
|
|
|
|
|
|
|
|
Encode,
|
|
perlmod,
|
|
perlpod
|
|
<P>
|
|
|
|
<HR>
|
|
<A NAME="index"> </A><H2>Index</H2>
|
|
<DL>
|
|
<DT id="28"><A HREF="#lbAB">NAME</A><DD>
|
|
<DT id="29"><A HREF="#lbAC">SYNOPSIS</A><DD>
|
|
<DT id="30"><A HREF="#lbAD">DESCRIPTION</A><DD>
|
|
<DT id="31"><A HREF="#lbAE">Quick Guide</A><DD>
|
|
<DT id="32"><A HREF="#lbAF">The Unicode Character Map</A><DD>
|
|
<DL>
|
|
<DT id="33"><A HREF="#lbAG">Coping with duplicate mappings</A><DD>
|
|
</DL>
|
|
<DT id="34"><A HREF="#lbAH">Bookmarks</A><DD>
|
|
<DT id="35"><A HREF="#lbAI">SEE ALSO</A><DD>
|
|
</DL>
|
|
<HR>
|
|
This document was created by
|
|
<A HREF="/cgi-bin/man/man2html">man2html</A>,
|
|
using the manual pages.<BR>
|
|
Time: 00:05:11 GMT, March 31, 2021
|
|
</BODY>
|
|
</HTML>
|