Post by Eric Pozharski*SKIP*
Post by Peter J. HolzerThen I don't know what you meant by "utf8". Care to explain?
Do you know difference between utf-8 and utf8 for Perl?
UTF-8 is the "UCS Transformation Format, 8-bit form" as defined by the
Unicode consortium. It defines a mapping from unicode characters to
bytes and back. When you use it as an encoding in Perl, There will be
some checks that the input is actually a valid unicode character. For
example, you can't encode a surrogate character:
$s2 = encode("utf-8", "\x{D812}");
results in the string "\xef\xbf\xbd", which is UTF-8 for U+FFFD (the
replacement character used to signal invalid characters).
utf8 may mean (at least) three different things in a Perl context:
* It is a perl-proprietary encoding (actually two encodings, but EBCDIC
support in perl has been dead for several years and I doubt it will
ever come back, so I'll ignore that) for storing strings. The
encoding is based on UTF-8, but it can represent code points with up
to 64 bits[1], while UTF-8 is limited to 36 bits by design and to
values <= 0x10FFFF by fiat. It also doesn't check for surrogates, so
$s2 = encode("utf8", "\x{D812}");
results in the string "\xed\xa0\x92", as one would naively expect.
You should never use this encoding when reading or writing files.
It's only for perl internal use and AFAIK it isn't documented
anywhere except possibly in the source code.
* Since the perl interpreter uses the format to store strings with
Unicode character semantics (marked with the UTF8 flag), such strings
are often called "utf8 strings" in the documentation. This is
somewhat unfortunate, because "utf8" looks very similar to "utf-8",
which can cause confusion and because it exposes an implementation
detail (There are several other possible storage formats a perl
interpreter could reasonable use) to the user.
I avoid this usage. I usually talk about "byte strings" or "character
strings", or use even more verbose language to make clear what I am
talking about. For example, in this thread the distinction between
byte strings and character is almost irrelevant, it is only important
whether a string contains an element > 0xFF or not.
* There is also an I/O layer “:utf8”, which is subtly different from
both “:encoding(utf8)” and “:encoding(utf-8)“.
Post by Eric Pozharski(For long time, up to yesterday, I believed that that utf-8 is
all-caps; I was wrong, it's caseless.)
Yes, the encoding names (as used in Encode::encode, Encode::decode and
the :encoding() I/O-Layers) are case-insensitive.
Post by Eric PozharskiPost by Peter J. Holzer* The encoding of the source code of the script
Wrong.
[quote perldoc encoding on]
* Internally converts all literals ("q//,qq//,qr//,qw///, qx//") from
the encoding specified to utf8. In Perl 5.8.1 and later, literals in
"tr///" and "DATA" pseudo-filehandle are also converted.
[quote off]
How is this proving me wrong? It confirms what I wrote.
If you use “use encoding 'KOI8-U';”, you can use KOI8 sequences (either
literally or via escape sequences) in your source code. For example, if
you store this program in KOI8-U encoding:
#!/usr/bin/perl
use warnings;
use strict;
use 5.010;
use encoding 'KOI8-U';
my $s1 = "Б";
say ord($s1);
my $s2 = "\x{E2}";
say ord($s2);
__END__
(i.e. the string literal on line 7 is stored as the byte sequence 0x22
0xE2 0x22), the program will print 1041 twice, because:
* The perl compiler knows that the source code is in KOI-8, so a single
byte 0xE2 in the source code represents the character “U+0411
CYRILLIC CAPITAL LETTER BE”. Similarly, Escape sequences of the form
\ooo and \Xxx are taken to denote bytes in the source character set
and translated to unicode. So both the literal Б on line 7 and the
\x{E2} on line 9 are translated to U+0411.
* At run time, the bytecode interpreter sees a string with the single
unicode character U+0411. How this character was represented in the
source code is irrelevant (and indeed, unknowable) to the byte code
interpreter at this stage. It just prints the decimal representation
of 0x0411, which happens to be 1041.
Post by Eric PozharskiIn pre-all-utf8 times qr// was working on bytes without being told to
behave otherwise. That's different now.
Yes, I think I wrote that before. I don't know what this has to do with
the behaviour of “use encoding”, except that historically, “use
encoding” was intended to convert old byte-oriented scripts to the brave new
unicode-centered world with minimal effort. (I don't think it met that
goal: Over the years I have encountered a lot of people who had problems
with “use encoding”, but I don't remember ever reading from someone who
successfully converted their scripts by slapping “use encoding '...'”
at the beginning.)
Post by Eric PozharskiPost by Peter J. Holzer* The default encoding of some I/O streams
We here, in our barbaric world, had (and still have) to process any
binary encoding except latin1 (guess what, CP866 is still alive).
[quote perldoc encoding on]
* Changing PerlIO layers of "STDIN" and "STDOUT" to the encoding
specified.
[quote off]
That's not saying anything about 'default'. It's about 'encoding
specified'.
You misunderstood what I meant by "default". When The perl interpreter
creates the STDIN and STOUT file handles, these have some I/O layers
applied to them, without the user having to explicitely having to call
binmode(). These are applied by default, and hence I call them the
default layers. The list of default layers varies between systems
(Windows adds the :crlf layer, Linux doesn't), on command line settings
(-CS adds the :utf8 layer, IIRC), and of course it can also be
manipulated by modules like “encoding”. “use encoding 'CP866';” pushes
the layer “:encoding(CP866)” onto the STDIN and STDOUT handles. You can
still override them with binmode(), but they are there by default, you
don't have to call “binmode STDIN, ":encoding(CP866)"” explicitely
(but you do have to call it explicitely for STDERR, which IMNSHO is
inconsistent).
Post by Eric PozharskiPost by Peter J. Holzerand it does so even in an inconsistent manner (e.g. the encoding is
applied to STDOUT, but not to STDERR)
No problems with that here. STDERR is us-ascii, point.
If my scripts handle non-ascii characters, I want those characters also
in my error messages. If a script is intended for normal users (not
sysadmins), I might even want the error messages to be in their native
language instead of English. German can expressed in pure US-ASCII,
although it's awkward. Russian or Chinese is harder.
Post by Eric PozharskiPost by Peter J. Holzerand finally, because it is too complex and that will lead to
surprising results.
In your elitist latin1 world -- may be so. But we, down here, are
barbarians, you know.
May I remind you that it was you who was surprised by the behaviour of
“use encoding” in this thread, not me?
In Message <***@orphan.zombinet> you wrote:
| {10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "à"' # hooray!
| à
| {10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
| �
| {10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hoora
| à
|
| Except the middle one (what I should think about), I think encoding.pm
| wins again.
You didn't understand why the the middle one produced this particular
result. So you were surprised by the way “use encoding” translates
string literals. I wasn't surprised. I knew how it works and explained
it to you in my followup.
Still, although I think I understand “use encoding” fairly well (because
I spent a lot of time reading the docs and playing with it when I still
thought it would be a useful tool, and later because I spent a lot of
time arguing on usenet that it isn't useful) I think it is too complex.
I would be afraid of making stupid mistakes like writing "\x{E0}" when I
meant chr(0xE0), and even if I don't make them, the next guy who has to
maintain the scripts probably understands much less about “use encoding”
than I do and is likely to misunderstand my code and introduce errors.
hp
[1] I admit that I was surprised by this. It is documented that strings
consist of 64-bit elements on 64-bit machines, but I thought this
was an obvious documentation error until I actually tried it.
--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | ***@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel