Why "Wide character in print"?

Post by tcgo
I just made a test code with Perl, using the Pi symbol with
#!/usr/bin/perl
use utf8;
my $cosa = "Here is my ☺ résúmé \x{2639}!";
print "$cosa\n";
And it gives me a "warning" message: "Wide character in print at
./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the
warning disappears, but why was it showing before of adding the
binmode?

Because the people who nowadays work on perl unicode support have
decided that it should behave as if the encoding used by it was some
super secret sauce shrouded in eternal mystery: All data flowing into
a Perl program is supposed to be converted to this super secret
internal mystery encoding before being used and all data flowing out
of a Perl program is supposed to be converted to something software
other than perl understands beforehand. De facto, the situation is
such that everything is fine when perl is used in an environment where
UTF-8 is the 'native' method for supporting wide characters because
this is also what perl uses itself, and anyone using something
else is essentially fucked. De jure, perl is supposed to be nasty to
everyone, or at least try as hard as possible without breaking
backwards compatibility.

Alan Curry

2012-09-30 20:00:23 UTC

Post by tcgo
Hi!
I just made a test code with Perl, using the Pi symbol with
#!/usr/bin/perl
use utf8;
my $cosa = "Here is my âº rÃ©sÃºmÃ© \x{2639}!";
print "$cosa\n";
And it gives me a "warning" message: "Wide character in print at
./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the warning
disappears, but why was it showing before of adding the binmode?

The binmode documents your assumption that nobody will ever run your program
on a non-UTF8-mode terminal.

--
Alan Curry

Peter J. Holzer

2012-09-30 20:25:38 UTC

Because, unless you tell it with binmode, Perl doesn't know what
encoding it is supposed to use. It could get the encoding from the
locale settings, but that would only work for text written to a
terminal, not for arbitrary data written to a file, so perl doesn't
make assumptions and asks you to set the encoding explicitely.

(If you want to get the encoding from the locale, use I18N::Langinfo,
unfortunately this doesn't work on all platforms (at least it didn't
work on Windows last time I looked, but that was a few years ago)

hp

j***@gmail.com

2012-10-23 22:50:52 UTC

Post by tcgo
#!/usr/bin/perl
use utf8;
my $cosa = "Here is my ☺ résúmé \x{2639}!";
print "$cosa\n";
And it gives me a "warning" message: "Wide character in print at
./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the
warning disappears, but why was it showing before of adding the
binmode?

“use utf8” means only that the script file itself is UTF-8-encoded;
It doesn’t say how to manage the output to STDOUT.

JD

C.DeRykus

2012-10-24 07:54:24 UTC

Post by tcgo
Hi!
#!/usr/bin/perl
use utf8;
my $cosa = "Here is my ☺ résúmé \x{2639}!";
print "$cosa\n";
...

Here's a follow-on with an observation/question for someone more knowledgeable about Perl unicode)

I don't know how 'use locale' affects this but I
only see the OP's expected display of characters
by using the "\N{U+...}" notation to force character
semantics:

#use utf8;
my $cosa = "Here is my \N{U+263A} résúmé \N{U+03C0}!";

Output: Here is my ☺ résúmé π!

--
Charles DeRykus

Eli the Bearded

2012-10-24 23:31:16 UTC

Post by tcgo
And it gives me a "warning" message: "Wide character in print at ./unicode line 4". After
adding "binmode(STDOUT, ":utf8");" the warning disappears, but why was it showing before of
adding the binmode?

The question has been answered in other follow-ups, but I'm kinda curious
how come this has not made it into the FAQ yet? Or maybe it has in a newer
perl, but:

$ grep -ci wide.character /usr/share/perl/5.14.2/pod/perlfaq*pod
/usr/share/perl/5.14.2/pod/perlfaq.pod:0
/usr/share/perl/5.14.2/pod/perlfaq1.pod:0
/usr/share/perl/5.14.2/pod/perlfaq2.pod:0
/usr/share/perl/5.14.2/pod/perlfaq3.pod:0
/usr/share/perl/5.14.2/pod/perlfaq4.pod:0
/usr/share/perl/5.14.2/pod/perlfaq5.pod:0
/usr/share/perl/5.14.2/pod/perlfaq6.pod:0
/usr/share/perl/5.14.2/pod/perlfaq7.pod:0
/usr/share/perl/5.14.2/pod/perlfaq8.pod:0
/usr/share/perl/5.14.2/pod/perlfaq9.pod:0
$

The explanation in perldiag is a good start:

=item Wide character in %s

(S utf8) Perl met a wide character (>255) when it wasn't expecting
one. This warning is by default on for I/O (like print). The easiest
way to quiet this warning is simply to add the C<:utf8> layer to the
output, e.g. C<binmode STDOUT, ':utf8'>. Another way to turn off the
warning is to add C<no warnings 'utf8';> but that is often closer to
cheating. In general, you are supposed to explicitly mark the
filehandle with an encoding, see L<open> and L<perlfunc/binmode>.

For example:

=head2 Why is my code generating a "Wide character in <foo>" error?

You have tried to use a "wide character" (>255) when Perl was not
expecting one. Often what you want to do to is add the C<:utf8> layer to
the output, e.g. C<binmode STDOUT, ':utf8'>. Another way to turn off the
warning is to add C<no warnings 'utf8';> but that is often closer to
cheating. In general, you are supposed to explicitly mark the
filehandle with an encoding, see L<open> and L<perlfunc/binmode>.

But that (and the docs for binmode()) doesn't address why the warning will still
happen for ":raw" streams:

echo "some binary stream with U+2639 in it" | \
perl -we 'binmode(STDOUT, ":raw");
binmode(STDIN, ":raw");
while(<>) { s/U\+2639/\x{2639}/g; print } '

I've used the "while(<>) { s///g; print; }" construct to patch binary
files in the past (rename functions in compiled programs, etc). I haven't
yet needed to sub-in wide characters, but it doesn't seem unreasonable.

I'm guessing that my binary stream situation is what "no warnings 'utf8';"
is intended to fix.

Elijah
------
javascript functions, once renamed, are not accessible by normal websites

Ben Morrow

2012-10-25 02:15:12 UTC

Post by tcgo

Post by tcgo
And it gives me a "warning" message: "Wide character in print at

./unicode line 4". After

Post by tcgo
adding "binmode(STDOUT, ":utf8");" the warning disappears, but why was

it showing before of

Post by tcgo
adding the binmode?

<snip>

Post by tcgo
=item Wide character in %s
(S utf8) Perl met a wide character (>255) when it wasn't expecting
one. This warning is by default on for I/O (like print). The easiest
way to quiet this warning is simply to add the C<:utf8> layer to the
output, e.g. C<binmode STDOUT, ':utf8'>. Another way to turn off the
warning is to add C<no warnings 'utf8';> but that is often closer to
cheating. In general, you are supposed to explicitly mark the
filehandle with an encoding, see L<open> and L<perlfunc/binmode>.

<snip>

Post by tcgo
But that (and the docs for binmode()) doesn't address why the warning will still
echo "some binary stream with U+2639 in it" | \
perl -we 'binmode(STDOUT, ":raw");
binmode(STDIN, ":raw");
while(<>) { s/U\+2639/\x{2639}/g; print } '

[You should set $/ = \1024 or something else appropriate before using <>
on a binary file. By default <> reads newline-delimited lines, and there
is no particular reason for newlines to occur in sensible places in a
binary file. Of course, if the file is small enough it may be better to
read the whole thing and skip the while loop altogether.]

If you are dealing with :raw streams then your data needs to be in
bytes. That is, you should be using

use Encode "encode";

my $u2639 = encode "UTF-8", "\x{2639}";

s/U\+2639/$u2639/g;

Imagine you were trying to perform this replacement the other way
around; a substitution like

s/\x{2639}/U+2639/;

would never match, since the :raw layer would return a UTF8-encoded
U+2639 as three bytes. (It would also return a UTF-16 U+2639 as two
bytes, and a UCS-4 U+2639 as four bytes.) If you wanted it to match you
would need to use $u2639 defined as above, and deal with the possibility
of the character being split between chunks.

Post by tcgo
I've used the "while(<>) { s///g; print; }" construct to patch binary
files in the past (rename functions in compiled programs, etc). I haven't
yet needed to sub-in wide characters, but it doesn't seem unreasonable.

A binary file cannot contain 'wide characters' as such, instead it
contains some *encoding* of wide characters. Since Perl has no way to
guess which encoding you want you need to be explicit, either by using
Encode directly or by calling it indirectly using PerlIO::encoding.

Post by tcgo
I'm guessing that my binary stream situation is what "no warnings
'utf8';" is intended to fix.

No, not at all. If you review the (W utf8) warnings in perldiag, you
will see they all to do with performing character operations on Unicode
codepoints which are not valid characters (UTF-16 surrogates, codepoints
which haven't been allocated yet, explicit non-characters like U+FFFF).
They have nothing to do with ordinary Unicode IO.

Ben

Eli the Bearded

2012-10-25 20:56:46 UTC

Post by Eli the Bearded
But that (and the docs for binmode()) doesn't address why the warning
echo "some binary stream with U+2639 in it" | \
perl -we 'binmode(STDOUT, ":raw");
binmode(STDIN, ":raw");
while(<>) { s/U\+2639/\x{2639}/g; print } '

If you are dealing with :raw streams then your data needs to be in
bytes. That is, you should be using
use Encode "encode";
my $u2639 = encode "UTF-8", "\x{2639}";
s/U\+2639/$u2639/g;

Hmmmm. That looks awkward, but it does make sense. My example was all
about a UTF-8 string embeded in a file without a uniform encoding.

Post by Eli the Bearded
I've used the "while(<>) { s///g; print; }" construct to patch binary
files in the past (rename functions in compiled programs, etc). I haven't
yet needed to sub-in wide characters, but it doesn't seem unreasonable.

A binary file cannot contain 'wide characters' as such, instead it
contains some *encoding* of wide characters.

I'm not the one who introduced the term "wide characters" here, Perl's
warning did. Outside of this error message, "wide characters" means
UTF-16 or UTF-32 to my mind.

Post by Eli the Bearded
I'm guessing that my binary stream situation is what "no warnings
'utf8';" is intended to fix.

"Wide character in %s" is a "S utf8", not "W utf8" condition, and it is
not about "performing character operations on Unicode codepoints which are
not valid characters". It's a difficult error message for me to wrap my
head around what it *really* means, since perl is rather confusing about
how it handles character encodings. Anything character over ord(127) is
two or more bytes in UTF-8. What makes 255 special?

=item Wide character in %s

(S utf8) Perl met a wide character (>255) when it wasn't expecting one.

My command line is UTF-8:

$ echo 'Ã¥' | od -xc
0000000 a5c3 000a
303 245 \n
0000003

So why does this happen:

$ echo 'a' | perl -Mutf8 -wne 's/a/Ã¥/;print' | od -xc
0000000 0ae5
345 \n
0000002

And why is it different for these two cases:

$ echo 'a' | perl -mutf8 -wne 's/a/Ã¥/;print' | od -xc
0000000 a5c3 000a
303 245 \n
0000003

$ echo 'a' | perl -wne 'BEGIN{use utf8;} s/a/Ã¥/;print' | od -xc
0000000 a5c3 000a
303 245 \n
0000003

$ perl -v

This is perl 5, version 14, subversion 2 (v5.14.2) built for
x86_64-linux-gnu-thread-multi (with 53 registered patches, see perl -V
for more detail)

Elijah
------
the difference between -M and -m is very slight according to perlrun

Ben Morrow

2012-10-26 02:02:39 UTC

<snip>

Post by Eli the Bearded

A binary file cannot contain 'wide characters' as such, instead it
contains some *encoding* of wide characters.

I'm not the one who introduced the term "wide characters" here, Perl's
warning did. Outside of this error message, "wide characters" means
UTF-16 or UTF-32 to my mind.

Well, that's four possible encodings right there; more if you include
UCS-2 and the various BOM options.

'Wide character', in this warning, means 'a character with ordinal
greater than 255' or alternatively 'a character which won't fit into a
single byte'.

Post by Eli the Bearded

Post by Eli the Bearded
I'm guessing that my binary stream situation is what "no warnings
'utf8';" is intended to fix.

"Wide character in %s" is a "S utf8", not "W utf8" condition, and it is
not about "performing character operations on Unicode codepoints which are
not valid characters".

You're right; I missed that.

Post by Eli the Bearded
It's a difficult error message for me to wrap my
head around what it *really* means, since perl is rather confusing about
how it handles character encodings. Anything character over ord(127) is
two or more bytes in UTF-8. What makes 255 special?

You need to reread perlunicode and/or perlunitut. Basically, since perl
5.8, a string can contain one of two entirely different things:

- a sequence of Unicode characters, suitable for character
operations such as uc() and lc(), such that ord() returns a
Unicode codepoint (a 'character string');

- a sequence of uninterpreted bytes, suitable for binary operations
such as unpack(), such that ord() returns a byte value from 0 to
255 (a 'byte string').

It's up to the programmer to keep track of which of these two any given
string is supposed to represent; there have been a great many arguments
these past several years about whether or not this is a sensible model,
but it's the model we've got and we have to work with it.

Unix IO is defined in terms of strings of bytes. This means that any
string read from or written to a file must be a byte string: there is no
way to pass a 'byte' larger than 255 to write(2). Since Perl has no way
of knowing whether any given string was supposed to be a character
string or a byte string it can't enforce this; the best it can do is
warn if you pass it a string that was supposed to be bytes which
contains a value >255.

Since all IO occurs in terms of bytes, if you wish to read or write
characters you need to convert strings of bytes into strings of
characters and vice versa. This is the job of the Encode module, which
supports pretty-much every character encoding ('charset') you are likely
to need. The PerlIO :encoding layer is a shortcut which passes all reads
through Encode::decode and all writes through Encode::encode; it's
useful when you know all IO on a particular filehandle should use a
particular encoding. If you want to mix encodings, or deal with a
mixture of binary and encoded-text data, you need to handle the encoding
and decoding yourself.
No, your *terminal* uses UTF-8, and possibly your shell. The command
line passed to executed programs is a list of strings of bytes, because
that's how execve(2) is defined. This is an important distinction,
because what perl sees initially is bytes, and if you want it to convert
them to characters you have to ask.

Post by Eli the Bearded
$ echo 'å' | od -xc
0000000 a5c3 000a
303 245 \n
0000003

The command-line passed to 'echo' (ignoring for the moment the fact that
it's probably a builtin) looked like this:

65 63 68 6f 00 a5 c3 00 00
e c h o å

echo copied those two bytes to stdout, completely unaware they were
supposed to represent a single character, and added a byte of 0a. od,
similarly, read three bytes from stdin and printed out a representation
of them, again unaware that they were supposed to represent two
characters.

Post by Eli the Bearded
$ echo 'a' | perl -Mutf8 -wne 's/a/å/;print' | od -xc
0000000 0ae5
345 \n
0000002

'use utf8' tells Perl that the script it is running, whether it is from
-e or a file, should be decoded to characters using the UTF-8 encoding
before being passed to the Perl lexer. (In theory you can 'use encoding'
to specify a different source character encoding, but in practice that
pragma has always been buggy and is better avoided.)

[I'm going to look at this invocation, since it's a little easier to
explain

perl -Mutf8 -E'say "å"' | od -xc

but the output is exactly the same. Assume the equivalent substitutions
below.]

In this case, the complete command line passed to perl looks like

70 65 72 6c 00 2d 4d 75 74 66 38 00 2d 45 73 61 79 20 22 c3 a5 22 00 00
p e r l - M u t f 8 - E s a y " å "

The perl command-line handling code executes 'use utf8;' when it sees
the -M option, 'use feature ':5.14';' when it sees -E, then passes

73 61 79 20 22 c3 a5 22
s a y " å "

to the input-handling code. The 'use utf8' has installed a filter which
converts bytes to characters using UTF-8, so this gets translated to

0073 0061 0079 0020 0022 00e5 0022
s a y " å "

before being passed to the lexer. (I've written these with 4-digit
ordinals since they're now nominally Unicode codepoints, but as far as
perl is concerned there is no difference. In particular, notice that
they all still fit into a single byte.) The lexer converts the "å" into
a 1-character string which eventually gets passed to 'say', which
appends a newline (that is, a character with ordinal 0a) and passes it
to the STDOUT filehandle for writing.

Since this filehandle has no translation PerlIO layers on it, it assumes
that strings sent to it for writing are *byte* strings. This means that
it writes two bytes, e5 0a, to stdout. (The -x option to od then
reverses them, but that's just little-endian stupidity. -tx1 is usually
better.)

Post by Eli the Bearded
$ echo 'a' | perl -mutf8 -wne 's/a/å/;print' | od -xc
0000000 a5c3 000a
303 245 \n
0000003

-mutf8 is equivalent to 'use utf8 ()', which does precisely nothing
(well, it loads a little code into the utf8:: namespace, but it
*doesn't* affect the source encoding). This means the convert-from-UTF-8
step above is skipped, so say gets passed the 2-character string c3 a5.
Again it appends an 0a, and sends it to STDOUT; this time, the output is
three bytes.

Post by Eli the Bearded
$ echo 'a' | perl -wne 'BEGIN{use utf8;} s/a/å/;print' | od -xc
0000000 a5c3 000a
303 245 \n
0000003

The effect of 'use utf8', like 'strict' and 'warnings', is lexically
scoped. In this case the effect lasts until the end of the BEGIN block,
so the rest of the code is entirely unaffected.

Post by Eli the Bearded
the difference between -M and -m is very slight according to perlrun

It's exactly the difference between

use Foo;

and

use Foo ();

unless you include parameters (-mfoo=bar) in which case it is identical
to -M. That is: -m will not call the ->import method.

Ben

Eric Pozharski

2012-10-27 09:30:00 UTC

with <vi7pl9-***@anubis.morrow.me.uk> Ben Morrow wrote:

*SKIP*

(In theory you can 'use encoding' to specify a different source
character encoding, but in practice that pragma has always been buggy
and is better avoided.)

Stop spreading FUD. They need

use encoding ENCNAME Filter => 1;

(what I<ENCNAME> could possibly be?) but

* "use utf8" is implicitly declared so you no longer have to "use
utf8" to "${"\x{4eba}"}++".

what pretty much defies the purpose of C<use encoding;>.

*SKIP*

The lexer converts the "å" into a 1-character string which eventually
gets passed to 'say', which appends a newline (that is, a character
with ordinal 0a) and passes it to the STDOUT filehandle for writing.

That's not a whole story.

{2754:13} [0:0]% perl -Mutf8 -MDevel::Peek -wle '$aa = "а" ; Dump $aa'
SV = PV(0x927a750) at 0x9295fac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x9291a08 "\320\260"\0 [UTF8 "\x{430}"]
CUR = 2
LEN = 12
{2936:14} [0:0]% perl -Mutf8 -MDevel::Peek -wle '$aa = "å" ; Dump $aa'
SV = PV(0x9af4750) at 0x9b0ffac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x9b0ba08 "\303\245"\0 [UTF8 "\x{e5}"]
CUR = 2
LEN = 12

For a first glance, me wondered: what the heck is with yours
C<use warnings;>. Now I feel much better.

*CUT*

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Ben Morrow

2012-10-27 23:37:07 UTC

(In theory you can 'use encoding' to specify a different source
character encoding, but in practice that pragma has always been buggy
and is better avoided.)

Stop spreading FUD.

That was certainly not my intention. My understanding is that 'use
encoding' is liable to cause incorrect behaviour and segfaults; see for
instance

https://rt.perl.org/rt3/Public/Bug/Display.html?id=31923
https://rt.perl.org/rt3/Public/Bug/Display.html?id=36248
https://rt.perl.org/rt3/Public/Bug/Display.html?id=37526
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2009-09/msg00669.html

Incidentally, while looking for those I also found

http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2011-03/msg00255.html

which suggests that 'use utf8' is also broken; I didn't know that until
just now, and I'm not sure I entirely believe it.

If you have newer information than me, I'd be happy to change my opinion.

Post by Eric Pozharski
They need
use encoding ENCNAME Filter => 1;

That installs a source filter; I'm not sure what the effects of that
are, but I wouldn't be surprised if you get the union of any bugs in
'use encoding' and any bugs in 'use utf8'.

Post by Eric Pozharski
(what I<ENCNAME> could possibly be?) but
* "use utf8" is implicitly declared so you no longer have to "use
utf8" to "${"\x{4eba}"}++".

I don't believe this is safe either. The pad code (which handles 'my'
variables) isn't utf8-safe, so you can't create 'my' variables with
Unicode names. (The above is a symref to a global; I don't know if the
code handling the names of globals is utf8-safe, but even if it is that
isn't terribly useful.)

Looking at the code in git, it's possible this has been fixed in 5.16; I
haven't been keeping up with core changes recently. However, this isn't
mentioned in perl5160delta, so I suspect that whatever core changes have
been made aren't considered sufficient for full utf8 identifier support.

(Note: that is not a latin lowercase 'a' in the source, but U+0430
CYRILLIC SMALL LETTER A. On my terminal they look identical, which
confused me for a moment.)

In any case, the result is exactly what I said: the string contains one
(logical) character. If you apply length() to that string it will return
1. (This character happens to be represented internally as two bytes;
that is none of your business.) What do you think I omitted from the
story?

Ben

Peter J. Holzer

2012-10-28 12:32:46 UTC

(In theory you can 'use encoding' to specify a different source
character encoding, but in practice that pragma has always been buggy
and is better avoided.)

Stop spreading FUD.

That was certainly not my intention. My understanding is that 'use
encoding' is liable to cause incorrect behaviour and segfaults; see for
instance
https://rt.perl.org/rt3/Public/Bug/Display.html?id=31923
https://rt.perl.org/rt3/Public/Bug/Display.html?id=36248
https://rt.perl.org/rt3/Public/Bug/Display.html?id=37526
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2009-09/msg00669.html
Incidentally, while looking for those I also found
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2011-03/msg00255.html
which suggests that 'use utf8' is also broken; I didn't know that until
just now, and I'm not sure I entirely believe it.

That doesn't look like a bug in "use utf8" to me, but like a bug in the
code which generates the warnings.

It doesn't help that Tom just dumped a load of gibberish into his mail
without specifying which encoding he was using. I had to guess that he
was using CP1252.

Anyway, with use utf8, the qw[] section of his program is parsed correcly as

("élite", "Ævar", "μῦθος", "mío")

In the error message each character (even those in the printable ASCII
range U+0020 ... U+007E) is "helpfully" given in hex which I agree is
... suboptimal.

Post by Ben Morrow
If you have newer information than me, I'd be happy to change my opinion.

Me too, although frankly I see no reason to use encoding even if it
works. It mixes up encoding of the source code and the I/O, which is not
a good idea, IMSHO, and my editor handles UTF-8 just fine, so I don't
see why I should write my perl scripts in a different encoding than
UTF-8. I/O can be handled explicitely by I/O layers or implicitely by
"use open".

Post by Eric Pozharski
(what I<ENCNAME> could possibly be?) but
* "use utf8" is implicitly declared so you no longer have to "use
utf8" to "${"\x{4eba}"}++".

I'm puzzled about this part of the documentation, too. Why would anybody
want to use a variable ${"\x{4eba}"} ? I am guessing that the variable
is really supposed to be $人, i.e., there is a Han character in the
source code, not a symref.

Is this unsafe? I have occasionally used non-ascii characters in
variable names (mostly Greek characters in physical formulas) together
with use utf8 since 5.8.x and I never noticed a problem. (The only
"problem" I noticed is that the euro sign isn't a word character, so you
can't have a variable $amount_in_€. But then you can't have a variable
$amount_in_$ either, so I guess this is fair ;-))

hp

Eric Pozharski

2012-10-28 11:45:49 UTC

(In theory you can 'use encoding' to specify a different source
character encoding, but in practice that pragma has always been
buggy and is better avoided.)

Stop spreading FUD.

C<use threads;> and C<use encoding 'utf8';>. Unexpected(?) edge case?

Post by Ben Morrow
https://rt.perl.org/rt3/Public/Bug/Display.html?id=36248

C<use utf8;>, C<use encoding 'utf8';>, and C<use Encode;>. Panic mode?

Post by Ben Morrow
https://rt.perl.org/rt3/Public/Bug/Display.html?id=37526

Double encoding.

Post by Ben Morrow
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2009-09/msg00669.html

Monkey wrench.

Post by Ben Morrow
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2011-03/msg00255.html

Works just as expected, see below.

Post by Ben Morrow
which suggests that 'use utf8' is also broken; I didn't know that
until just now, and I'm not sure I entirely believe it. If you have
newer information than me, I'd be happy to change my opinion.

Probably that's not safe to state things like this below unprivately,
but:

not perl->isa( 'fool-proof' ) or die

(I'm trying to speak Perl here). IOW, Perl has an entry level. And
it's quite high. And one of steps to get behind is ability to read. I
don't mind ability to read code, I mean ability to RTFM. Three former
examples are clearly (for me) of that type. I have a couple of scripts
that have C<use encoding 'utf8';> (I<STDIN>, I<STDOUT>, and quote-like
operators) and C<use open ':locale';> (other filehandles, quite risky,
but those scripts are not for distribution thus I'm safe here). Those
scripts were started 4.5 years ago (according to logs, I can't believe
it was sarge (thus 5.8.8?)). Anyway, 5.10.0, 5.10.1, 5.14.2 -- because
I've made those right. Because I've read carefully, all the unicode
documentation that comes with perl (namely perluniitro.pod,
perlunicode.pod, utf8.pod, encoding.pm, Encdoe.pm (perlunifaq.pod,
perlunitut, and perluniprops.pod weren't distributed five years ago,
should read them too)). I've found that I don't need utf8.pm (those
scripts and modules should be us-ascii anyway).

I feel utf8-safe because, first of all, I can read. If I can, they can
too, can't they? Apparently, they don't, maybe because they can't.

Post by Eric Pozharski
They need
use encoding ENCNAME Filter => 1;

That installs a source filter; I'm not sure what the effects of that
are, but I wouldn't be surprised if you get the union of any bugs in
'use encoding' and any bugs in 'use utf8'.

Post by Eric Pozharski
(what I<ENCNAME> could possibly be?) but
* "use utf8" is implicitly declared so you no longer have to
"use utf8" to "${"\x{4eba}"}++".

BTW, I've checked. There's no C<use utf8>. It's B<require utf8> and no
import. A whole different story.

Post by Ben Morrow
I don't believe this is safe either. The pad code (which handles 'my'
variables) isn't utf8-safe, so you can't create 'my' variables with
Unicode names. (The above is a symref to a global; I don't know if the
code handling the names of globals is utf8-safe, but even if it is
that isn't terribly useful.)

Let me rephrase one famous proverb:

If an answer you've got is 'filter', you probably asking wrong
question.

*SKIP*

Post by Ben Morrow
In any case, the result is exactly what I said: the string contains
one (logical) character. If you apply length() to that string it will
return 1. (This character happens to be represented internally as two
bytes; that is none of your business.) What do you think I omitted
from the story?

Right. And that's closely related to your last example (the one about
utf8.pm being unsafe). I've tried to make a point that *characters*
from different *ranges* happen to be of different length in bytes.

{9829:45} [0:0]% perl -Mutf8 -MDevel::Peek -wle '$aa = "aàа" ; Dump $aa'
SV = PV(0xa06f750) at 0xa08afac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
CUR = 5
LEN = 12

*Characters* of latin1 aren't wide (even if they are characters, they
are still one byte long)

{10406:65} [0:0]% perl -Mutf8 -wle 'print "[à]"'
[à]
{10415:66} [0:0]% perl -Mutf8 -wle 'print "[а]"'
Wide character in print at -e line 1.
[а]

I must have added those braces, because:

{10421:67} [0:0]% perl -wle 'print "à"' # no problmes, just a byte
à
{10477:68} [0:0]% perl -Mutf8 -wle 'print "à"' # oops

{10520:69} [0:0]% perl -Mutf8 -wle 'print "à "' # stupid
à
{10522:70} [0:0]% perl -Mutf8 -wle 'print "\x{E0}"' # oops

{10532:71} [0:0]% perl -Mutf8 -wle 'print "\x{E0} "' # stupid
à
{10602:79} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0}"' # oops

{10608:80} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0} "' # stupid
à

But watch this:

{10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "à"' # hooray!
à
{10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
�
{10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hooray!
à

Except the middle one (what I should think about), I think encoding.pm
wins again.

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Peter J. Holzer

2012-10-28 20:06:52 UTC

Then maybe you shouldn't have chosen two examples which both are same
length in bytes.

Post by Eric Pozharski
{9829:45} [0:0]% perl -Mutf8 -MDevel::Peek -wle '$aa = "aàа" ; Dump $aa'
SV = PV(0xa06f750) at 0xa08afac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
CUR = 5
LEN = 12
*Characters* of latin1 aren't wide (even if they are characters, they
are still one byte long)

In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
characters. Your example shows this: "à" (LATIN SMALL LETTER A WITH
GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".

But this isn't what "wide character" in the warning means. In the
warning, it means a string element with a code > 255. For string
elements <= 255, perl can assume that they are supposed to be bytes, not
characters, when you try to write them to a byte stream. It could be
argued that this assumption is a mistake, but for better or worse we are
stuck with that decision. But for string elements > 255, that just isn't
possible. It can't be a byte, it must be a character, and to convert a
character into bytes, the encoding needs to known.

Post by Eric Pozharski
{10406:65} [0:0]% perl -Mutf8 -wle 'print "[à]"'
[à]
{10415:66} [0:0]% perl -Mutf8 -wle 'print "[а]"'
Wide character in print at -e line 1.
[а]

... as these examples demonstrate.

Post by Eric Pozharski
{10421:67} [0:0]% perl -wle 'print "à"' # no problmes, just a byte
à

Assuming you use a UTF-8 terminal here: No, this isn't one byte. These are
two bytes, \303\240.

Post by Eric Pozharski
{10477:68} [0:0]% perl -Mutf8 -wle 'print "à"' # oops

Now you have one character (because of -Mutf8, the two bytes \303\240
are decoded to the character U+00e0), but you are trying to write it to a byte
stream without specifying the encoding. Perl writes the single byte
0xE0, which your UTF-8 terminal cannot interpret. (Mine displays a
question mark in a dark circle)

Post by Eric Pozharski
{10520:69} [0:0]% perl -Mutf8 -wle 'print "à "' # stupid
à

Huh? What version of Perl on what platform is this? The string is
"\x{E0}\x{20}". All elements of the string are <= 255, so the string is
output as a byte string. This isn't valid UTF-8, and your terminal
shouldn't be able to interpret it as "à" anymore than it was able to
interpret "\x{E0}\x{0A}" above.

[more equivalent examples snipped]

If your program does character I/O, you *need* to specify the encoding
of the I/O channels. For one-liners, the -C option is sufficent:

hrunkner:~/tmp 20:40 :-) 195% perl -CS -Mutf8 -wle 'print "à"'
à

For scripts you would use binmode or 'use open'.

(Didn't you praise yourself on your ability to read? This is documented
and it has been repeated by several people in this newsgroup for years)

Post by Eric Pozharski
{10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "à"' # hooray!
à
{10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
�
{10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hooray!
à
Except the middle one (what I should think about), I think encoding.pm
wins again.

Excellent example, it shows exactly one of the pitfalls of using "use
encoding". One would expect "\x{E0}" to result in a string with a single
element with code 0xE0. At least you seem to have expected it, and for a
moment I was confused, too. But 'use encoding' doesn't work that way. It
was designed to convert string constants from the specified encoding to
Unicode, so it tries to interpret "\x{E0}" as UTF-8, but of course this
isn't valid UTF-8. So you get "\x{FFFD}" instead (U+FFFD is the
REPLACEMENT CHARACTER used to mark invalid characters).

If you use a correct UTF-8 encoded string, it works as expected (well,
expected by somebody who's read the documentation and remembers that
little pitfall):

hrunkner:~/tmp 20:47 :-) 197% perl -Mencoding=utf8 -wle 'print "\303\240"'
à

For one-liners like this, using the same encoding for the script and the
I/O is useful ("-CS -Mutf8" is even shorter than "-Mencoding=utf8", but
maybe you don't have a UTF-8 capable terminal). However, for real
programs, I think tying the encoding of the source code to the encoding
of I/O-streams the script is supposed to handle is foolish. My scripts
are always encoded in UTF-8, but they frequently have to handle files in
CP-1252.

hp

Helmut Richter

2012-10-28 20:57:20 UTC

Post by Peter J. Holzer
But this isn't what "wide character" in the warning means. In the
warning, it means a string element with a code > 255. For string
elements <= 255, perl can assume that they are supposed to be bytes, not
characters, when you try to write them to a byte stream.

You have to distinguish what may work sometimes or always, and what is
part of the interface which *should* work. If it does nor work in the
latter case, it is an error; if it does not work in the former case you
have made a bad guess about how it is implemented. So do not rely on your
guesses but use the documented interface.

There are two ways to use the interface:

- You regard all strings, both during the run of the script and on
input/output, as bytes (=groups of 8 bits) without any meaning as
characters (=member of an alphabet for writing text). This will work if
all devices, and the script itself, use the same character code, which
must not have bytes with value >255. This *can* be a viable option if
you can either guarantee this restriction, or if your bytes do not
have a character meaning.

In this case, strings in the program text with characters that are not
contained in the common character code are meaningless, and will yield
errors.

- You regard the data during the run of the script as sequences of
characters, and the data on onput and output as sequences of bytes. Then
you have to convert bytes into textstrings on input and textstrings into
bytes on output -- in both cases you can specify the conversion once and
for all for each file. This is the only working way when the restrictions
of the last item are not fulfilled.

In this case, strings in the program text may contain any characters
whether or not they are representable in the codes used in input/output.
The "use utf8" pragma tells perl to interpret the program text itself as a
sequence of UTF-8 characters which will make a difference only for literal
strings in the program.

A third way does *not* work:

- You do input and output on strings of bytes and assume that perl will guess
correctly what characters these byte represent in your opinion.
Unfortunately that will *often* work (because perl assumes ISO-8859-1 on
many systems which may be what you are actually using), but it will also
often break (if you use other codes, or if you mix strings which happen to
contain only ISO-8859-1 characters with string containing also other
characters). But if it breaks, it is your fault: it is nowhere guaranteed
how text strings map to byte strings and vice versa, the sole exception
being the documented encode and decode functions.

This is fairly well explained in
http://search.cpan.org/~dom/perl-5.14.3/pod/perlunitut.pod

--
Helmut Richter

Rainer Weikusat

2012-10-28 21:39:53 UTC

Helmut Richter <hhr-***@web.de> writes:

[...]

Post by Helmut Richter
- You regard the data during the run of the script as sequences of
characters, and the data on onput and output as sequences of bytes. Then
you have to convert bytes into textstrings on input and textstrings into
bytes on output -- in both cases you can specify the conversion once and
for all for each file. This is the only working way when the restrictions
of the last item are not fulfilled.

This is the only 'working way' when the assumption that perl uses a
'secret mystery encoding' different from any other encoding known to
man is taken for granted. But this assumption is wrong and the concept
makes preciously little sense since it requires an additional copy of
all input data and all output data (possibly, times the number of perl
processes in a 'long' pipeline since not even perl is supposed to be
able to talk to perl natively). Considering the way perl is
implemented, this is a real problem for users of Windows (and Mac OS
X, AFAIK) because in both cases, perl uses something other than the
native encoding. That some people would like to inflict the same
damage onto users of platforms where the problem doesn't exist is
certainly very laudable but IMNSHO, best ignored.

Peter J. Holzer

2012-10-29 09:43:44 UTC

Post by Rainer Weikusat

This is the only 'working way' when the assumption that perl uses a
'secret mystery encoding' different from any other encoding known to
man is taken for granted.

The encoding isn't a 'secret mystery'. It is well documented that it
is Unicode.

perl -CS -MEncode -E 'say ord(Encode::decode("utf-8", "\xE2\x82\xAC"))'

is defined to print "8364".

It is a 'secret mystery' (wink, wink, nudge, nudge) how this is
represented internally, just like the representation of numbers is a
'secret mystery'.

However, for most programs you don't have to know that Perl character
strings are Unicode strings. It is sufficient to know that Perl has the
concept of a "character" which is different from the concept of a
"byte", that a character has certain properties (e.g. it can be a letter
or an ideograph, it may have an associated uppercase or lowercase
letter, ...) and to convert a sequence of characters into a sequence of
bytes you have to encode them. Whether the Euro sign has the numeric
code 8364 or 4711 is rarely significant.

Post by Rainer Weikusat
But this assumption is wrong and the concept
makes preciously little sense since it requires an additional copy of
all input data and all output data

This is an unsubstantiated claim. It is possible that the current
implementation of I/O layers does indeed perform an additional copy (I
haven't checked the code), but this is certainly not required.

And even if it is true, it is almost certainly lost in the noise as soon
as your script does something more complex than "cat" with your input -
almost any string operation in perl performs a copy.

Post by Rainer Weikusat
(possibly, times the number of perl processes in a 'long' pipeline
since not even perl is supposed to be able to talk to perl natively).
Considering the way perl is implemented, this is a real problem for
users of Windows (and Mac OS X, AFAIK) because in both cases, perl
uses something other than the native encoding.

Why is this a real problem?

Post by Rainer Weikusat
That some people would like to inflict the same damage onto users of
platforms where the problem doesn't exist is certainly very laudable
but IMNSHO, best ignored.

Whatever "the problem" may be. The problem that characters and bytes
aren't the same and that most programmers prefer to think of text as a
sequence of characters, not a sequence of bytes exists on every
platform.

hp

Helmut Richter

2012-10-29 10:47:32 UTC

Post by Peter J. Holzer
However, for most programs you don't have to know that Perl character
strings are Unicode strings.

Are they? They are strings of characters that are contained in Unicode. They
are not necessarily internally encoded as Unicode. People run into problems
when they make assumptions about the way they are implemented. I would have
worded:

For all programs you must not pretend to know that Perl character strings
are Unicode strings.

It may be true, it may be false -- either way, it is not part of the
documented interface. Hence, it must not be used even if it be true.

--
Helmut Richter

Rainer Weikusat

2012-10-29 13:40:07 UTC

Post by Helmut Richter

Post by Peter J. Holzer
However, for most programs you don't have to know that Perl character
strings are Unicode strings.

[...]

Post by Helmut Richter
For all programs you must not pretend to know that Perl character strings
are Unicode strings.
It may be true, it may be false -- either way, it is not part of the
documented interface. Hence, it must not be used even if it be true.

At best, that's a part of the interface which was meanwhile
'undocumented' because the implementation choices which were made
weren't the implementation choices that should have been made,
according to the opinions of some people who didn't make the
descision. But indepedently of that, inventing the 'Perl is an
island!' character encoding - no matter how hypothetical - remains a
stupid idea. Perl is not an island and it has to interact with code
written in other programming languages, although maybe not in the
fantasy universe of people who implement 'wepp fremmwuergs' and
'ohpscheckt suesstemms' who are generally not troubled by the minor
consideration of making their stuff do something actually useful in
the real world. Conseqently, Perl should be compatible with some
existing convention, ideally, with all existing 'local'
conventions. If this isn't possible, the next best choice is not 'make
everyone bleed'.

Helmut Richter

2012-10-30 16:40:50 UTC

Post by Rainer Weikusat
But indepedently of that, inventing the 'Perl is an
island!' character encoding - no matter how hypothetical - remains a
stupid idea.

Every program is an "island" within its code. No matter what I use, I do not
normally know the internals, and if I happen to know them I should not use my
knowledge because the internals may change at any time.

Perl is not an island as far as interaction with other programs is
concerned. It is documented how to read and write byte data, and how to read
and write character data whose code and encoding is known. If desired, it is
also not really difficult to write code that tries to guess an unknown code --
with all the pitfalls such a behaviour entails.

There is one interface decision perl has made: it does not by default use the
locale settings to determine the default code and encoding, rather it requires
that these be specified in the script. Opinions may be divided; I like this
decision because my experience is that often the locale settings appear to be
randomly uncorrelated to the codes actually used.

The implementation decisions that are not part of the interface, in particular
the internal representation of values of different types including strings,
concern future developers but not users. If perl decides to store characters
internally as a 37-bit EBCDIC enhancement, it does not really bother me as
long as the programm still interacts correctly with the outside world in
standardised codes.

--
Helmut Richter

Ben Morrow

2012-11-05 19:10:46 UTC

Post by Helmut Richter

Post by Peter J. Holzer
However, for most programs you don't have to know that Perl character
strings are Unicode strings.

Perl strings are (nearly) always *Unicode* strings; this is not the same
as saying they are internally represented in any particular encoding of
Unicode. uc(chr(97)) eq chr(65), for instance, and uc(chr(0x450)) eq
chr(0x400). What you must (pretend to) not know is that they are
sometimes internally represented in UTF-8 and sometimes in ISO8859-1; in
principle the internal representation could be changed to UCS-4 or
something else sane without breaking anything.

(In practice it would break XS, so it probably won't happen, which is a
shame. UTF-8 was a very bad choice of internal representation, in
retrospect, though it seemed to make sense at the time. It makes a great
many internal operations much more complicated than they need to be,
because you can no longer index into an array to find a particular
character in the string.)

Where it gets a bit sketchy is when you are dealing with characters >127
in strings which happen to be internally encoded as ISO8859-1 rather
than UTF-8 (which you shouldn't need to know about). For versions of
perl <5.12, uc(chr(255)) returns chr(255), which is incorrect, because
the correct answer is chr(376) which would require changing the internal
representation. This applies to all characters >127, even those where
the case-change exists in ISO8859-1, but only if the string happened to
be internally represented in ISO8859-1.

This bug was fixed in 5.12, but for the sake of compatibility the fix is
(so far) only activated in the scope of 'use feature "unicode_strings"',
which is switched on by 'use 5.012'. As a result you may see code
mucking about with utf8::is_utf8 and so on, in an attempt to work around
this bug; a better fix is to upgrade to 5.12 and put 'use 5.012;' at the
top of each file.

If you need to manipulate strings of bytes, for instance for IO, you
simply represent a byte $b by the character chr($b), where 0 <= $b <=
255. If you attempt to do IO to a raw filehandle with a character with
an ordinal >255, you get a warning and perl does something stupid; I
agree with Peter it would be better for it to die in this case.

Ben

Rainer Weikusat

2012-11-05 22:15:07 UTC

Ben Morrow <***@morrow.me.uk> writes:

[...]

Post by Ben Morrow
(In practice it would break XS, so it probably won't happen, which is a
shame. UTF-8 was a very bad choice of internal representation, in
retrospect, though it seemed to make sense at the time. It makes a great
many internal operations much more complicated than they need to be,
because you can no longer index into an array to find a particular
character in the string.)

The only way to provide that is to store all characters as integer
values large enough to encompass all conceivably existing Unicode
codepoints. Otherwise, you're going to have multibyte characters and
consequently, 'indexing into the array to find a particular character
in the string' won't work anymore.

Independently of this, the UTF-8 encoding was designed to have
represenation of the Unicode character set which was backwards
compatible with 'ASCII-based systems' and it is not only a widely
supported internet standard (http://tools.ietf.org/html/rfc3629) and
the method of choice for dealing with 'Unicode' for UNIX(*) and
similar system but formed the 'basic character encoding' of complete
operating systems as early as 1992
(http://plan9.bell-labs.com/plan9/about.html). As such, supporting it
natively in a programming language closely associated with UNIX(*), at
least at that time, should have been pretty much a no brainer. "But
Microsoft did it difffentely !!1" is the ultimate argument for some
people but - thankfully - these didn't get to piss into Perl until
very much later and thus, the damage they can still do is mostly
limited to 'propaganda'.

Rainer Weikusat

2012-11-05 22:22:07 UTC

Post by Rainer Weikusat
[...]

I would also like to point out that this is an inherent deficiency of
the idea to represent all glyphs of all conceivable scripts with a
single encoding scheme at that the practial consequences of that are
mostly 'anything which restricts itself to the US typewriter character
set is fine' (and everyone else is going to have no end of problems
because of that).

I actually stopped using German characters like a-umlaut years ago
exactly because of this.

Ben Morrow

2012-11-05 22:48:12 UTC

Post by Rainer Weikusat

Yes. That's called 'a 32-bit int', and is the standard wchar_t C
representation of Unicode. A sensible alternative would be a 1/2/4-byte
upgrade scheme somewhat similar to the current Perl scheme, but with all
the alternatives being constant width; a smarter alternative would be to
represent a string as a series of pieces, each of which could make a
different choice (and, potentially, some of which could be shared or CoW
with other strings).

Post by Rainer Weikusat
Independently of this, the UTF-8 encoding was designed to have
represenation of the Unicode character set which was backwards
compatible with 'ASCII-based systems' and it is not only a widely
supported internet standard (http://tools.ietf.org/html/rfc3629) and
the method of choice for dealing with 'Unicode' for UNIX(*) and
similar system but formed the 'basic character encoding' of complete
operating systems as early as 1992
(http://plan9.bell-labs.com/plan9/about.html).

There is a very big difference between a sensible *internal*
representation and a sensible *external* representation. UTF-8 was
designed as an external representation; it's extremely good (in my
narrow, Western, English-speaking opinion) for that purpose. It was
never intended to be used internally, except by applications which
didn't attempt to decode it to characters.

But then, you've never really understood the concept of abstraction,
have you?

Post by Rainer Weikusat
As such, supporting it
natively in a programming language closely associated with UNIX(*), at
least at that time, should have been pretty much a no brainer. "But
Microsoft did it difffentely !!1" is the ultimate argument for some
people but - thankfully - these didn't get to piss into Perl until
very much later and thus, the damage they can still do is mostly
limited to 'propaganda'.

I don't know what Win32's internal representation is (I suspect 32bit
int, the same as Unix), but its default external representation is
UTF-16, which is about the most braindead concoction anyone has ever
come up with. The only possible justification for its existence is
backwards-compatibility with systems which started implementing Unicode
before it was finished, and even then I'm *certain* they could have made
it less grotesquely ugly if they'd tried (a UTF-8-like scheme, for
instance).

So no, my comments about the unsuitability of UTF-8 as an internal
encoding have nothing whatever to do with Win32, and everything to do
with actually understanding how string operations work at the machine
level.

Ben

Rainer Weikusat

2012-11-05 23:30:13 UTC

[...]

Post by Ben Morrow
But then, you've never really understood the concept of abstraction,
have you?

This mostly means that I cannot possibly be a self-conscious human
being capable of interacting with the world in some kind of
'intelligent' (meaning, influencing it such that it changes according
to some desired outcome) way but must be some kind of lifeform below
the level of a dog or a bird. Yet, I'm capable of using written
language to communicate with you (with some difficulties), using a
computer connected to 'the internet' in order to run a program on a
completely different computer 9 miles away from my present location,
utilizing a server I have to pay for once a year from by bank account
which resides (AFAIK) in Berlin.

How can this possibly be?

Rainer Weikusat

2012-11-06 07:39:17 UTC

[...]

Post by Rainer Weikusat
The only way to provide that is to store all characters as integer
values large enough to encompass all conceivably existing Unicode
codepoints. Otherwise, you're going to have multibyte characters and
consequently, 'indexing into the array to find a particular character
in the string' won't work anymore.

With the most naive implementation, this would mean that moving 100G
of text data through Perl (and that's a small number for some jobs I'm
thinking of) requires copying 400G of data into Perl and 400G out of
it. What you consider 'smart' would only penalize people who actually
used non-ASCII-scripts to some (possibly serious) degree.

There is a very big difference between a sensible *internal*
representation and a sensible *external* representation.

This notion of 'internal' and 'external' representation is nonsense:
In order to cooperate sensibly, a number of different processes need
to use the same 'representation' for text data to avoid repeated
decoding and encoding whenever data needs to cross a process
boundary. And for 'external representation', using a proper
compression algorithm for data which doesn't need to be usable in its
stored form will yield better results than any 'encoding scheme'
biased towards making the important things (deal with US-english texts)
simple and resting comfortably on the notion that everything else is
someone else's problem.

Rainer Weikusat

2012-11-06 20:21:14 UTC

Post by Rainer Weikusat
[...]

Yes. That's called 'a 32-bit int', and is the standard wchar_t C
representation of Unicode.

[...]

Post by Rainer Weikusat
With the most naive implementation, this would mean that moving 100G
of text data through Perl (and that's a small number for some jobs I'm
thinking of) requires copying 400G of data into Perl and 400G out of
it.

And - of course - this still wouldn't help since a 'character'
as it appears in some script doesn't necessarily map 1:1 to a Unicode
codepoint. Eg, the German a-umlaut can either be represented as the
ISO-8859-1 code for that (IIRC) or as 'a' followed by a 'combining
diaresis' (and the policy of the Unicode consortium is actually to avoid
adding more 'precombined characters' in favor of 'grapheme
construction sequences', at least, that's what it was in 2005, when I
last had a closer look at this).

Peter J. Holzer

2012-11-06 21:27:21 UTC

As such, supporting it natively in a programming language closely
associated with UNIX(*), at least at that time, should have been
pretty much a no brainer. "But Microsoft did it difffentely !!1" is
the ultimate argument for some people but - thankfully - these didn't
get to piss into Perl until very much later and thus, the damage they
can still do is mostly limited to 'propaganda'.

I guess you haven't seen Punycode ;-) [There seems to be no "barf"
emoticon in Unicode - I'm disappointed]

Post by Ben Morrow
The only possible justification for its existence is
backwards-compatibility with systems which started implementing
Unicode before it was finished,

What do you mean by "finished"? There is a new version of the Unicode
standard about once per year, so it probably won't be "finished" as long
as the unicode consortium exists.

Unicode was originally intended to be a 16 bit code, and Unicode 1.0
reflected this: It was 16 bit only and there was no intention to expand
it. That was only added in 2.0, about 4 years later (and at that time it
was theoretical: The first characters outside of the BMP were defined in
Unicode 3.1 in 2001, 9 years after the first release).

So of course anybody who implemented Unicode between 1992 and 1996
implemented it as a 16 bit code, because that was what the standard
said. Those early adopters include Plan 9, Windows NT, and Java.

Post by Ben Morrow
and even then I'm *certain* they could have made it less grotesquely
ugly if they'd tried (a UTF-8-like scheme, for instance).

UTF-16 has a few things in common with UTF-8:

* both are backward compatible with an existing shorter encoding
(UTF-8: US-ASCII, UTF-16: UCS-2)
* both are variable width
* both are self-terminating
* Both use some high bits to distinguish between a single unit (8 resp.
16 bits), the first unit and subsequent unit(s)

The main differences are

* UTF-16 is based on 16-bit units instead of bytes (well, duh!)
* There was no convenient free block at the top of the value range,
so the surrogate areas are somewhere in the middle.
* and therefore ordering isn't preserved (but that wouldn't be
meaningful anyway)

The main problem I have with UTF-16 is of a psychological nature: It is
extremely tempting to assume that it's a constant-width encoding because
"nobody uses those funky characters above U+FFFF anyway". Basically the
"all the world uses US-ASCII" trap reloaded.

hp

Ben Morrow

2012-11-06 23:26:12 UTC

Post by Ben Morrow
I don't know what Win32's internal representation is (I suspect 32bit
int, the same as Unix), but its default external representation is
UTF-16, which is about the most braindead concoction anyone has ever
come up with.

I guess you haven't seen Punycode ;-) [There seems to be no "barf"
emoticon in Unicode - I'm disappointed]

Oh, God, I'd forgotten about that. Thank you so very much for reminding
me. (And Google Translate says U+6D92 is Chinese for 'vomit'; will that
do?)

Post by Ben Morrow
The only possible justification for its existence is
backwards-compatibility with systems which started implementing
Unicode before it was finished,

What do you mean by "finished"? There is a new version of the Unicode
standard about once per year, so it probably won't be "finished" as long
as the unicode consortium exists.
Unicode was originally intended to be a 16 bit code, and Unicode 1.0
reflected this: It was 16 bit only and there was no intention to expand
it. That was only added in 2.0, about 4 years later (and at that time it
was theoretical: The first characters outside of the BMP were defined in
Unicode 3.1 in 2001, 9 years after the first release).
So of course anybody who implemented Unicode between 1992 and 1996
implemented it as a 16 bit code, because that was what the standard
said. Those early adopters include Plan 9, Windows NT, and Java.

Yeah, fair enough, I suppose. It seems obvious in hindsight that 16 bits
weren't going to be enough, but maybe that isn't fair.

Post by Ben Morrow
and even then I'm *certain* they could have made it less grotesquely
ugly if they'd tried (a UTF-8-like scheme, for instance).

* both are backward compatible with an existing shorter encoding
(UTF-8: US-ASCII, UTF-16: UCS-2)
* both are variable width
* both are self-terminating
* Both use some high bits to distinguish between a single unit (8 resp.
16 bits), the first unit and subsequent unit(s)
The main differences are
* UTF-16 is based on 16-bit units instead of bytes (well, duh!)

Which is one of its major problems: it has all the disadvantages of both
multibyte and wide encodings.

Post by Peter J. Holzer
* There was no convenient free block at the top of the value range,
so the surrogate areas are somewhere in the middle.
* and therefore ordering isn't preserved (but that wouldn't be
meaningful anyway)
The main problem I have with UTF-16 is of a psychological nature: It is
extremely tempting to assume that it's a constant-width encoding because
"nobody uses those funky characters above U+FFFF anyway". Basically the
"all the world uses US-ASCII" trap reloaded.

The main problem *I* have is the fact the surrogates are allocated out
of the Unicode character space, so everyone doing anything with Unicode
has to take account of them, even if they won't ever be touching UTF-16
data. UTF-8 doesn't do that: it has magic bits indicating the
variable-length sections, but they are kept away from the data bits
representing the actual characters encoded.

The same could have been done with UTF-16. If I'm reading the charts
right, Unicode 1.1.5 (the last version before the change) allocated
characters from 0000-9FA5 and from F900-FFFF, which leaves Axxx-Exxx
free to represent multi-word characters. So, for instance, they could
have used the following scheme: A word matching one of

0xxxxxxxxxxxxxxx
1001xxxxxxxxxxxx
1111xxxxxxxxxxxx

is a single-word character. Other characters are represented as two
words, encoded as

101ppppphhhhhhhh 110pppppllllllll

which represents the 26-bit character

pppppppppphhhhhhhhllllllll

I know that at that point they were intending to extend the character
set to 31 bits, but IMHO reducing that to 26 would have been a lesser
evil than stuffing a whole lot of encoding rubbish into the application-
visible character set. Especially given (hindsight, again) that they
were going to eventually reduce the character range to 21 bits anyway.
(The scheme above could be made more implementation-efficient by
reducing the plane by two more bits, leaving byte-shifts but no
bit-shifts.)

Meh.

Ben

Rainer Weikusat

2012-11-07 10:51:40 UTC

[...]

Post by Peter J. Holzer
Unicode was originally intended to be a 16 bit code, and Unicode 1.0
reflected this: It was 16 bit only and there was no intention to expand
it. That was only added in 2.0, about 4 years later (and at that time it
was theoretical: The first characters outside of the BMP were defined in
Unicode 3.1 in 2001, 9 years after the first release).
So of course anybody who implemented Unicode between 1992 and 1996
implemented it as a 16 bit code, because that was what the standard
said. Those early adopters include Plan 9, Windows NT, and Java.

Yeah, fair enough, I suppose. It seems obvious in hindsight that 16 bits
weren't going to be enough, but maybe that isn't fair.

It should have been obvious 'in foresight' that the '16 bit code' of
today will turn into a 22 bit code tomorrow, a 56 bit code a fortnight
from now and then slip back to 18.5 bit two weeks later[*] (the 0.5 bit
introduced by some guy who used to work with MPEG who transferred to the
Unicode consortium), much in the same way the W3C keeps changing the
name of HTML 4.01 strict to give the impression of development beyond
aimlessly moving in circles in the hope that - some day - someone might
chose to adopt it (web developers have shown a remarkable common sense
in this respect).

BTW, there's another aspect of the "all the world is external to perl
and doesn't matter [to us]" nonsense: perl can be embedded. Eg, I
spend a sizable part of my day yesterday writing some Perl code
supposed to run inside of postgres, as part of an UTF-8 based
database. In practice, it is possible to chose a database encoding
which can represent everything which needs to be represented in this
database which is also compatible with Perl, making it feasible to use
it for data manipulation. In theory, that's another "Thing which must
not be done" which - in this case - simply means that avoiding Perl
for such code in favour of a language which gives its users less
gratuitious headaches is preferable.

[*] I keep wondering why the letter T isn't defined as 'vertical
bar' + 'combining overline' (or why A isn't 'greek delta' + 'combining
hyphen' ...)

Peter J. Holzer

2012-11-11 10:59:12 UTC

Post by Peter J. Holzer
The main problem I have with UTF-16 is of a psychological nature: It is
extremely tempting to assume that it's a constant-width encoding because
"nobody uses those funky characters above U+FFFF anyway". Basically the
"all the world uses US-ASCII" trap reloaded.

That takes a huge chunk (25%, or even 37.5% if you include the ranges
which you have omitted above) out of the BMP. These codepoints would
either not be assigned at all (same as with UTF-16) or have to be
represented as four bytes. By comparison, the UTF-16 scheme reduces the
number of codepoints representable in 16 bits only by 3.1%. So there was
a tradeoff: Number of characters representable in 16 bits (63488 :
40960 or 49152) versus total number of representable characters (1112064
: 67108864). Clearly they thought 1112064 ought to be enough for
everyone and opted for a denser representation of common characters.
(That doesn't mean that they considered exactly your encoding: But
surely they considered several different encodings before settling on
what is now known as UTF-16.

Post by Ben Morrow
I know that at that point they were intending to extend the character
set to 31 bits,

Yes, but certainly not with UTF-16: That encoding is limited to ~ 20
bits (codepoints U+0000 .. U+10FFFF).

Post by Ben Morrow
but IMHO reducing that to 26 would have been a lesser evil than
stuffing a whole lot of encoding rubbish into the application- visible
character set.

The only thing that's visible in the character set is that there is a
chunk of 2048 reserved code points which will never be assigned. How is
that different from other chunks of unassigned code points which may or
may not be assigned in the future?

hp

Ben Morrow

2012-11-11 23:07:03 UTC

Obviously I meant the latter. My whole point is that the Unicode
character list should not contain references to any particular encoding
scheme.

Post by Ben Morrow
but IMHO reducing that to 26 would have been a lesser evil than
stuffing a whole lot of encoding rubbish into the application- visible
character set.

Surrogates in decoded Unicode text have to be handled differently from
other currently-unassigned characters. A program which processes
arbitrary Unicode text can reasonably pass through unrecognised
characters, on the grounds that the sender may be using a more recent
version of Unicode, but surrogates should always signal an error and so
have to be explicitly checked for. For instance, a grep for 'surrogate'
in the perl source reveals a whole lot of code for checking for and
warning about surrogates, which simply shouldn't have been necessary.

Ben

Dr.Ruud

2012-11-08 18:39:25 UTC

Post by Rainer Weikusat

Let's invent the byte-oriented utf-2d.

The bytes for the special (i.e. non-ASCII) characters have the high bit
on, and further still have a meaningful value, such that they can be
matched as a (cased) word-character / digit / whitespace, punctuation, etc.
Per special character position there can be an entry in the side table,
that defines the real data for that position.

The 80-8F bytes are for future extensions. An 9E-byte can prepend a data
part. An 9F byte (ends a data part and) starts a table part.

An ASCII buffer remains as is. A latin1 buffer also remains as is,
unless it contains a code point between 80 and 9F.

Possible usage of 90-9F, assuming " 0Aa." collation:

90: .... space
91: ...# digit
92: ..#. upper
93: ..## upper|digit
94: .#.. lower
95: .#.# lower|digit
96: .##. alpha
97: .### alnum
98: #... punct
99: #..# numeric?
9A: #.#. ...
9B: #.## ...
9C: ##.. ...
9D: ##.# ...
9E: ###. SOD (start-of-data)
9F: #### SOT (start-of-table)

--
Ruud

Peter J. Holzer

2012-11-06 20:51:25 UTC

Post by Rainer Weikusat
[...]

The only way to provide that is to store all characters as integer
values large enough to encompass all conceivably existing Unicode
codepoints.

Not necessarily. As Ben already pointed out, not all strings have to
have the same representation. There is at least one programming language
(Pike) which uses 1, 2, or 4 bytes per character depending on the
"widest" character in the string. IIRC, Pike had Unicode code before
Perl, so Perl could have "stolen" that idea.

Post by Rainer Weikusat
Otherwise, you're going to have multibyte characters and
consequently, 'indexing into the array to find a particular character
in the string' won't work anymore.

There are other tradeoffs, too: UTF-8 is quite compact for latin text,
but it takes about 2 bytes per character for most other alphabetic
scripts (e.g. Cyrillic, Greek, Devanagari) and 3 for CJK and some other
alphabetic scripts (e.g. Hiragana and Katakana). So the size problem you
mentioned may be reversed if you are mainly processing Asian text.
Plus scanning a text may be quite a bit faster if you can do it in 16
bit quantities instead of 8 bit quantities.

However, the Plan 9 C API has exactly the distinction you are
criticizing: Internally, strings are arrays of 16-bit quantities,
externally, they read and written as UTF-8.

From the well-known "Hello world" paper:

| All programs in Plan 9 now read and write text as UTF, not ASCII.
| This change breaks two deep-rooted symmetries implicit in most C
| programs:
|
| 1. A character is no longer a char.
|
| 2. The internal representation (Rune) of a character now differs from
| its external representation (UTF).

(The paper was written before Unicode 2.0, so all characters were 16
bit. I don't know the current state of Plan 9)

hp

Peter J. Holzer

2012-10-29 09:16:37 UTC

Post by Helmut Richter

I was careful to use the term "string element" and avoid the terms
"byte" and "character" when talking about the things a string is
composed of.

Perl has two types of strings: Character strings (often called utf8
strings in the documentation) and byte strings. Character strings are
composed of 32-bit entities, each denoting a unicode code point. So
"\x{1f42a}" is a string with the single character DROMEDARY CAMEL.
Byte strings are just that: Strings of uninterpreted bytes. Any
semantics assigned to them is semantics of the program, not of the Perl
language (this isn't quite correct: character oriented functions like lc
or character classes in regexps do work on them, but only for ASCII).

These differences are documented, and I consider them part of the
interface, although some members of p5p consider the distinction a bug
and try to remove it.

However, for the warning "Wide character in print" this is irrelevant.

Perl doesn't distinguish between character and byte strings when writing
them to a file handle. For both the strings "\x{E0}" (a byte string) and
"\N{U+00E0}" (a character string), if you write them to a raw file
handle, the single byte 0xE0 will be written. Both will be converted to
two bytes 0xC3 0xA0 if you write them the a file handle with the
":encoding(UTF-8)" layer. And so on. But for strings with elements >
255, it simply isn't possible, to write a single byte with this value to
a byte stream, because a byte has only 8 bits (on the platforms we care
about). So Perl prints a warning and encodes the string in UTF-8 (or
just copies its internal representation, which happens to be the same
thing). I would argue that perl should die() instead, but this has been
the observed and documented behaviour since 5.8.0, so I doubt it will
change.

[Rest snipped. All true, but IMHO not very relevant to this thread].

hp

Eric Pozharski

2012-10-29 12:52:06 UTC

Post by Ben Morrow
In any case, the result is exactly what I said: the string contains
one (logical) character. If you apply length() to that string it
will return 1. (This character happens to be represented internally
as two bytes; that is none of your business.) What do you think I
omitted from the story?

Then maybe you shouldn't have chosen two examples which both are same
length in bytes.

(Last night I've reread loads of perlunicode and friends, I feel much
better now) No, they are the same length *if* encoding of stream is set:

{7453:22} [0:0]% perl -CS -Mutf8 -wle 'print "à"' | xxd
0000000: c3a0 0a ...
{7459:23} [0:0]% perl -CS -Mutf8 -wle 'print "а"' | xxd
0000000: d0b0 0a ...
{7466:24} [0:0]%

But latin1 is special (I've reread perlunicode and friends), *if*
there's no reason (printing isn't reason) to upgrade to utf8 then
*characters* of latin1 script (and latin1 only) stay *bytes*:

{7466:24} [0:0]% perl -Mutf8 -wle 'print "à"' | xxd
0000000: e00a ..
{7795:25} [0:0]% perl -Mutf8 -wle 'print "а"' | xxd
Wide character in print at -e line 1.
0000000: d0b0 0a ...

But even if encoding of stream isn't set concatenation with non-latin1
script upgrades latin1 too:

{7800:26} [0:0]% perl -Mutf8 -wle 'print "[à][а]"' | xxd
Wide character in print at -e line 1.
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].

Please rewind the thread. That's exactly what happened couple of posts

No. Because it's not UTF-8, it's utf8. As long as utf8 semantics isn't
set, anything scalar stays plain bytes:

{2786:10} [0:0]% perl -MDevel::Peek -wle 'Dump "à"'
SV = PV(0x9d0e878) at 0x9d29f28
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x9d2ddc8 "\303\240"\0
CUR = 2
LEN = 12

However, when utf8 semantics is set, then those codepoints that fit
latin1 script become special Perl-latin1:

{5930:11} [0:0]% perl -MDevel::Peek -Mutf8 -wle 'Dump "à"'
SV = PV(0x9b92880) at 0x9badf10
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0x9bb1eb0 "\303\240"\0 [UTF8 "\x{e0}"]
CUR = 2
LEN = 12

Upgrade to UTF-8 encoding or staying with latin1 encoding depends on
concatation with already upgraded to UTF-8 codepoints and/or encoding of
output stream.

*SKIP*

Post by Eric Pozharski
{10477:68} [0:0]% perl -Mutf8 -wle 'print "à"' # oops

Now you have one character (because of -Mutf8, the two bytes \303\240
are decoded to the character U+00e0), but you are trying to write it
to a byte stream without specifying the encoding. Perl writes the
single byte 0xE0, which your UTF-8 terminal cannot interpret. (Mine
displays a question mark in a dark circle)

{42:1} [0:0]% perl -Mutf8 -wle 'print "à"'
à
{1903:2} [0:0]% perl -Mutf8 -wle 'print "à"'

{1933:3} [0:0]% perl -Mutf8 -wle 'print "à"' | xxd
0000000: e00a

Instead it does. Once. It wasn't typeing, it was search through
history. Now I'm bothered. Does anyone here know how to list
extensions enabled in running instance of urxvt?

*SKIP*

Post by Peter J. Holzer
For one-liners like this, using the same encoding for the script and
the I/O is useful ("-CS -Mutf8" is even shorter than
"-Mencoding=utf8", but maybe you don't have a UTF-8 capable terminal).

{14999:29} [0:0]% perl -mencoding -wle 'print "[à][а]"' | xxd
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].
{15017:30} [0:0]% perl -CS -Mutf8 -wle 'print "[à][а]"' | xxd
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].

Golf?

Post by Peter J. Holzer
However, for real programs, I think tying the encoding of the source
code to the encoding of I/O-streams the script is supposed to handle
is foolish. My scripts are always encoded in UTF-8, but they
frequently have to handle files in CP-1252.

Mine are us-ascii, I have open.pm for rest.

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Peter J. Holzer

2012-10-30 10:32:13 UTC

Post by Ben Morrow
In any case, the result is exactly what I said: the string contains
one (logical) character. If you apply length() to that string it
will return 1. (This character happens to be represented internally
as two bytes; that is none of your business.) What do you think I
omitted from the story?

Then maybe you shouldn't have chosen two examples which both are same
length in bytes.

(Last night I've reread loads of perlunicode and friends, I feel much

You posted the output of Devel::Peek::Dump, so I thought you were
talking about the *internal* representation.

How many bytes they occupy in an I/O stream depends on the encoding.

LATIN SMALL LETTER A WITH GRAVE is one byte in ISO-8859-1, CP850, ...
LATIN SMALL LETTER A WITH GRAVE is two bytes in UTF-8, UTF-16, ...
LATIN SMALL LETTER A WITH GRAVE is four bytes in UTF-32, ...

CYRILLIC SMALL LETTER A is one byte in ISO-8859-5, KOI-8, ...
CYRILLIC SMALL LETTER A is two bytes in UTF-8, UTF-16, ...
CYRILLIC SMALL LETTER A is four bytes in UTF-32, ...

(And of course, both characters cannot be represented at all in some
encodings: There is no LATIN SMALL LETTER A WITH GRAVE in ISO-8859-5,
and no CYRILLIC SMALL LETTER A in ISO-8859-1)

Post by Eric Pozharski
{7453:22} [0:0]% perl -CS -Mutf8 -wle 'print "à"' | xxd
0000000: c3a0 0a ...
{7459:23} [0:0]% perl -CS -Mutf8 -wle 'print "а"' | xxd
0000000: d0b0 0a ...
{7466:24} [0:0]%
But latin1 is special (I've reread perlunicode and friends), *if*
there's no reason (printing isn't reason) to upgrade to utf8 then

I already explained that. When writing to a file handle, perl doesn't
care whether a string is composed of bytes or characters.

If the file handle has no :encoding() layer, it will try to write each
element of the string as a single byte.

If the file has an :encoding() layer, it will interpret each element of
the string as a character and convert that to a byte sequence according
to that encoding.

So without an encoding layer "\x{E0}" will always be written as the single byte
0xE0, regardless of whether the string is a byte string or a character
string. With an ":encoding(UTF-8)" layer it will always be written as
two bytes 0xC3 0xA0; and with an ":encoding(CP850)" layer, it will
always be written as a single byte 0x85.

What it apparently confusing you is what happens if that fails.

Obviously you can't write a single byte with the value 0x430, you can't
encode CYRILLIC SMALL LETTER A in ISO-8859-1 and you can't encode LATIN
SMALL LETTER A WITH GRAVE in ISO-8859-5.

So what does perl do? It prints a warning to STDERR and writes
a more or less reasonable approximation to the stream. The details
depend on the I/O layer:

If there is no :encoding() layer, the warning is "Wide character in
print" and the utf-8 representation is sent to the stream. And to
confuse matters further, this is done for the whole string, not just
this particular string element:

% perl -Mutf8 -E 'say "->\x{E0}\x{430}<-"'
Wide character in say at -e line 1.
->àа<-

(one string: \x{E0} and \x{430} converted to UTF-8)

% perl -Mutf8 -E 'say "->\x{E0}<-", "->\x{430}<-"'
Wide character in say at -e line 1.
->�<-->а<-

(two strings: \x{E0} printed as a single byte, \x{430} converted to UTF-8)

If there is an :encoding() layer, the warning is "\x{....} does not map
to $charset" and a \x{....} escape sequence is sent to the stream:

% perl -Mutf8 -E 'binmode STDOUT, ":encoding(iso-8859-5)"; say "->\x{E0}<-"'
"\x{00e0}" does not map to iso-8859-5 at -e line 1.
->\x{00e0}<-

But these are responses to an *error* condition. You shouldn't try to
write codepoints > 255 to a byte stream (actually, you shouldn't write
any characters to a byte stream, a byte stream is for bytes), and you
shouldn't try to write latin accented characters to a cyrillic stream.
Or at least you shouldn't be terribly surprised if the result is a
little confusing - garbage in, garbage out.

Post by Eric Pozharski
But even if encoding of stream isn't set concatenation with non-latin1

The term "upgrade" has a rather specific meaning in Perl in context with
byte and character strings, and I don't think you are talking about
that.

Post by Eric Pozharski
{7800:26} [0:0]% perl -Mutf8 -wle 'print "[à][а]"' | xxd
Wide character in print at -e line 1.
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].

You have a single string "[à][а]" here. As I wrote above, print treats
the string as unit and in the absence of an :encoding() layer just dumps
it in UTF-8 encoding. So, yes, both the "à" and the "а" within this
single string will be UTF-8-encoded (as will be the square brackets, but
for them the UTF-8 encoding is the same as for US-ASCII, so you don't
notice that).

And I repeat it again: You are doing something which just doesn't make
sense (writing characters to a byte stream), so don't be surprised if
the result is a little surprising. Do it right and the result will make
sense.

Post by Eric Pozharski
Please rewind the thread. That's exactly what happened couple of posts

I've read these postings but I don't know what you are referring to. If
you are referring to other postings (especially long ones), please cite
the relevant part.

No. Because it's not UTF-8, it's utf8.

I presume that by "utf8" you mean a string with the UTF8 bit set
(testable with the utf8::is_utf8() function). But as I've written
repeatedly, this is completely irrelevant for I/O. A string will be
treated completely identical, whether is has this bit set or not. It is
only the value of the string which is important, not its internal type
and representation.

(Also, I find it very confusing that you post the output of
Devel::Peek::Dump, but then apparently don't refer to it but talk about
something else. Please try to organize your postings in a way that one
can understand what you are talking about. It is very likely that this
exercise will also clear up the confusion in your mind)

Post by Eric Pozharski
As long as utf8 semantics isn't set, anything scalar stays plain
{2786:10} [0:0]% perl -MDevel::Peek -wle 'Dump "à"'
SV = PV(0x9d0e878) at 0x9d29f28
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x9d2ddc8 "\303\240"\0
CUR = 2
LEN = 12
However, when utf8 semantics is set, then those codepoints that fit
{5930:11} [0:0]% perl -MDevel::Peek -Mutf8 -wle 'Dump "à"'
SV = PV(0x9b92880) at 0x9badf10
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0x9bb1eb0 "\303\240"\0 [UTF8 "\x{e0}"]
CUR = 2
LEN = 12

Yes. We've been through that. Ben explained it in excruciating detail.
What don't you understand here?

Mine are us-ascii, I have open.pm for rest.

US-ASCII is a subset of UTF-8, so your files are UTF-8, too ;-). (Most
of mine don't contain non-ASCII characters either) What I meant is that
I don't use any other encoding (like ISO-8859-1 or ISO-8859-15) to
encode non-ASCII characters, so I don't have any need for "use
encoding". If your scripts are all in ASCII and you use open.pm for
"rest", what do you need "use encoding" for? Remember, this subthread
started when you berated Ben for discouraging the use "use encoding".

hp

Eric Pozharski

2012-10-31 18:37:14 UTC

*SKIP*

Post by Eric Pozharski
Please rewind the thread. That's exactly what happened couple of

I've read these postings but I don't know what you are referring to.
If you are referring to other postings (especially long ones), please
cite the relevant part.

[quoting <eli$***@qz.little-neck.ny.us> on]

$ echo 'a' | perl -Mutf8 -wne 's/a/å/;print' | od -xc
0000000 0ae5
345 \n
0000002

[quote off]

*SKIP*

Post by Peter J. Holzer
In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as
cyrillic characters. Your example shows this: "à" (LATIN SMALL
LETTER A WITH GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A)
is "\320\260".

No. Because it's not UTF-8, it's utf8.

I presume that by "utf8" you mean a string with the UTF8 bit set
(testable with the utf8::is_utf8() function).

If "you" above refers to me then you're wrong.

Post by Peter J. Holzer
But as I've written repeatedly, this is completely irrelevant for I/O.
A string will be treated completely identical, whether is has this bit
set or not. It is only the value of the string which is important, not
its internal type and representation.

Try to read it again. Slowly.

Post by Peter J. Holzer
(Also, I find it very confusing that you post the output of
Devel::Peek::Dump, but then apparently don't refer to it but talk
about something else. Please try to organize your postings in a way
that one can understand what you are talking about.

Indeed, only FLAGS and PV are relevant. Sadly that Devel::Peek::Dump
doesn't provide means to filter arbitrary parts of output off (however,
that's not the purpose of D::P). And I consider editing copypastes a
bad taste.

*SKIP*

Post by Peter J. Holzer
Yes. We've been through that. Ben explained it in excruciating detail.
What don't you understand here?

It's not about understanding. I'm trying to make a point that latin1 is
special.

Mine are us-ascii, I have open.pm for rest.

Many years ago to get operations to work on characters instead of bytes
some strings must have been pulled. encoding.pm pulled right strings.
utf8.pm pulled irrelevant strings. Those days text related operations
worked for you because they fitted in latin1 script or you didn't hit
edge cases. However I did (more years ago, in 5.6.0, B<lcfirst()>
worked *only* on bytes, no matter what).

Guess what? I've just figured out I don't need either any more:

{40710:255} [0:0]% xxd foo.koi8-u
0000000: c6d9 d7c1 0a .....
{40731:262} [0:0]% perl -wle '
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
Wide character in print at -e line 5.
фы

Post by Peter J. Holzer
Remember, this subthread started when you berated Ben for discouraging
the use "use encoding".

It comes clear to me now what made you both (you and Ben) believe in
bugginess of F<encoding.pm>. I'm fine with that.

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Peter J. Holzer

2012-11-01 11:16:06 UTC

Post by Eric Pozharski
*SKIP*

Post by Eric Pozharski
Please rewind the thread. That's exactly what happened couple of

^^^^

Post by Peter J. Holzer
I've read these postings but I don't know what you are referring to.
If you are referring to other postings (especially long ones), please
cite the relevant part.

$ echo 'a' | perl -Mutf8 -wne 's/a/å/;print' | od -xc
0000000 0ae5
345 \n
0000002

Then I don't understand what you meant by "that" in the quoted
paragraph, since that seemed to refer to something else.

No. Because it's not UTF-8, it's utf8.

I presume that by "utf8" you mean a string with the UTF8 bit set
(testable with the utf8::is_utf8() function).

If "you" above refers to me

Yes, of course. You used to the term "utf8", so I was wondering what you
meant by it.

Post by Eric Pozharski
then you're wrong.

Then I don't know what you meant by "utf8". Care to explain?

Try to read it again. Slowly.

Read *what* again? The paragraph you quoted is correct and explains the
behaviour you are seeing.

That's not the problem. The problem is that you gave the output of
Devel::Peek::Dump which clearly showed a latin-1 character occupying
*two* bytes and then claimed that it was only one byte long. Which it
clearly wasn't. What you probably meant was that the latin1 character
would be only 1 byte long if written to an output stream without an
encoding layer. But you didn't write that. You just made an assertion
which clearly contradicted the example you had just given and didn't
even give any indication that you had even noticed the contradiction.

Post by Peter J. Holzer
Yes. We've been through that. Ben explained it in excruciating detail.
What don't you understand here?

It's not about understanding. I'm trying to make a point that latin1 is
special.

It is only special in the sense that all its codepoints have a value <=
255. So if you are writing to a byte stream, it can be directly
interpreted as a string of bytes and written to the stream without
modification.

The point that *I* am trying to make is that an I/O stream without an
:encoding() layer isn't for I/O of *characters*, it is for I/O of
*bytes*.

Thus, when you write the string "Käse" to such a stream, you aren't
writing Upper Case K, lower case umlaut a, etc. You are writing 4 bytes
with the values 0x4B, 0xE4, 0x73, 0x65. The I/O-code doesn't care about
whether the string is character string (with the UTF8 bit set) or a byte
string, it just interprets every element of the string as a byte. Those
four bytes could be pixels in image, for all the Perl I/O code knows.

OTOH, if there is an :encoding() layer, the string is taken to be
composed of (unicode) characters. If there is an element with the
codepoint \x{E4} in the string, it is a interpreted as a lower case
umlaut a, and converted to the proper encoding (e.g. one byte 0x84 for
CP850, two bytes 0xC3 0xA4 for UTF-8 and one byte 0xE4 for latin-1). But
again, this happens *always*. The Perl I/O layer doesn't care whether
the string is a character string (with the UTF8 bit set) or not.

Post by Peter J. Holzer
If your scripts are all in ASCII and you use open.pm for "rest", what
do you need "use encoding" for?

Perl aquired unicode support in its current form only in 5.8.0. 5.6.0
did have some experimental support for UTF-8-encoded strings, but it was
different and widely regarded as broken (that's why it was changed for
5.8.0). So what Perl 5.6.0 did or didn't do is irrelevant for this
discussion.

With some luck I managed to skip the 5.6 days and went directly from the
<=5.005 "bytestrings only" era to the modern >=5.8.0 "character
strings" era. However, in the early days of 5.8.x, the documentation was
quite bad and it took a lot of reading, experimenting and thinking to
arrive at a consistent understanding of the Perl string model.

But once you have this understanding, it is really quite simple and
consistent.

Post by Eric Pozharski
{40710:255} [0:0]% xxd foo.koi8-u
0000000: c6d9 d7c1 0a .....
{40731:262} [0:0]% perl -wle '
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
Wide character in print at -e line 5.
фы

This example doesn't have any non-ascii characters in the source code,
so of course it doesn't need 'use utf8'. The only effect of use utf8 it
to tell the perl compiler that the source code is encoded in UTF-8.

But you *do* need some indication of the encoding of STDOUT (did you
notice the warning "Wide character in print at -e line 5."? As long as
you get this warning, your code is wrong).

You could use "use encoding 'utf-8'":

% perl -wle '
use encoding "UTF-8";
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
фы

Or you could use -C on the command line:

% perl -CS -wle '
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
фы

Or could use "use open":

% perl -wle '
use open ":locale";
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
фы

Note: No warning in all three cases. The latter takes the encoding from
the environment, which hopefully matches your terminal settings. So it
works on a UTF-8 or ISO-8859-5 or KOI-8 terminal. But of course it
doesn't work on a latin-1 terminal and you get an appropriate warning:

"\x{0444}" does not map to iso-8859-1 at -e line 6.
"\x{044b}" does not map to iso-8859-1 at -e line 6.
\x{0444}\x{044b}

Post by Peter J. Holzer
Remember, this subthread started when you berated Ben for discouraging
the use "use encoding".

It comes clear to me now what made you both (you and Ben) believe in
bugginess of F<encoding.pm>. I'm fine with that.

I don't know whether encoding.pm is broken in the sense that it doesn't
do what is documented to do (it was, but it is possible that all of
those bugs have been fixed). I do think that it is "broken as designed",
because it conflates two different things:

* The encoding of the source code of the script
* The default encoding of some I/O streams

and it does so even in an inconsistent manner (e.g. the encoding is
applied to STDOUT, but not to STDERR) and finally, because it is too
complex and that will lead to surprising results.

hp

Eric Pozharski

2012-11-02 15:49:31 UTC

with <slrnk94mfm.5vl.hjp-***@hrunkner.hjp.at> Peter J. Holzer wrote:

*SKIP*

Post by Peter J. Holzer
Then I don't know what you meant by "utf8". Care to explain?

Do you know difference between utf-8 and utf8 for Perl? (For long time,
up to yesterday, I believed that that utf-8 is all-caps; I was wrong,
it's caseless.)

*SKIP*

Post by Peter J. Holzer
* The encoding of the source code of the script

Wrong.

[quote perldoc encoding on]

* Internally converts all literals ("q//,qq//,qr//,qw///, qx//") from
the encoding specified to utf8. In Perl 5.8.1 and later, literals in
"tr///" and "DATA" pseudo-filehandle are also converted.

[quote off]

In pre-all-utf8 times qr// was working on bytes without being told to
behave otherwise. That's different now.

Post by Peter J. Holzer
* The default encoding of some I/O streams

We here, in our barbaric world, had (and still have) to process any
binary encoding except latin1 (guess what, CP866 is still alive).
However:

[quote perldoc encoding on]

* Changing PerlIO layers of "STDIN" and "STDOUT" to the encoding
specified.

[quote off]

That's not saying anything about 'default'. It's about 'encoding
specified'.

Post by Peter J. Holzer
and it does so even in an inconsistent manner (e.g. the encoding is
applied to STDOUT, but not to STDERR)

No problems with that here. STDERR is us-ascii, point.

Post by Peter J. Holzer
and finally, because it is too
complex and that will lead to surprising results.

In your elitist latin1 world -- may be so. But we, down here, are
barbarians, you know.

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Peter J. Holzer

2012-11-03 11:03:43 UTC

Post by Eric Pozharski
*SKIP*

Post by Peter J. Holzer
Then I don't know what you meant by "utf8". Care to explain?

Do you know difference between utf-8 and utf8 for Perl?

UTF-8 is the "UCS Transformation Format, 8-bit form" as defined by the
Unicode consortium. It defines a mapping from unicode characters to
bytes and back. When you use it as an encoding in Perl, There will be
some checks that the input is actually a valid unicode character. For
example, you can't encode a surrogate character:

$s2 = encode("utf-8", "\x{D812}");

results in the string "\xef\xbf\xbd", which is UTF-8 for U+FFFD (the
replacement character used to signal invalid characters).

utf8 may mean (at least) three different things in a Perl context:

* It is a perl-proprietary encoding (actually two encodings, but EBCDIC
support in perl has been dead for several years and I doubt it will
ever come back, so I'll ignore that) for storing strings. The
encoding is based on UTF-8, but it can represent code points with up
to 64 bits[1], while UTF-8 is limited to 36 bits by design and to
values <= 0x10FFFF by fiat. It also doesn't check for surrogates, so

$s2 = encode("utf8", "\x{D812}");

results in the string "\xed\xa0\x92", as one would naively expect.

You should never use this encoding when reading or writing files.
It's only for perl internal use and AFAIK it isn't documented
anywhere except possibly in the source code.

* Since the perl interpreter uses the format to store strings with
Unicode character semantics (marked with the UTF8 flag), such strings
are often called "utf8 strings" in the documentation. This is
somewhat unfortunate, because "utf8" looks very similar to "utf-8",
which can cause confusion and because it exposes an implementation
detail (There are several other possible storage formats a perl
interpreter could reasonable use) to the user.

I avoid this usage. I usually talk about "byte strings" or "character
strings", or use even more verbose language to make clear what I am
talking about. For example, in this thread the distinction between
byte strings and character is almost irrelevant, it is only important
whether a string contains an element > 0xFF or not.

* There is also an I/O layer “:utf8”, which is subtly different from
both “:encoding(utf8)” and “:encoding(utf-8)“.

Post by Eric Pozharski
(For long time, up to yesterday, I believed that that utf-8 is
all-caps; I was wrong, it's caseless.)

Yes, the encoding names (as used in Encode::encode, Encode::decode and
the :encoding() I/O-Layers) are case-insensitive.

Post by Peter J. Holzer
* The encoding of the source code of the script

How is this proving me wrong? It confirms what I wrote.

If you use “use encoding 'KOI8-U';”, you can use KOI8 sequences (either
literally or via escape sequences) in your source code. For example, if
you store this program in KOI8-U encoding:

#!/usr/bin/perl
use warnings;
use strict;
use 5.010;
use encoding 'KOI8-U';

my $s1 = "Б";
say ord($s1);
my $s2 = "\x{E2}";
say ord($s2);
__END__

(i.e. the string literal on line 7 is stored as the byte sequence 0x22
0xE2 0x22), the program will print 1041 twice, because:

* The perl compiler knows that the source code is in KOI-8, so a single
byte 0xE2 in the source code represents the character “U+0411
CYRILLIC CAPITAL LETTER BE”. Similarly, Escape sequences of the form
\ooo and \Xxx are taken to denote bytes in the source character set
and translated to unicode. So both the literal Б on line 7 and the
\x{E2} on line 9 are translated to U+0411.

* At run time, the bytecode interpreter sees a string with the single
unicode character U+0411. How this character was represented in the
source code is irrelevant (and indeed, unknowable) to the byte code
interpreter at this stage. It just prints the decimal representation
of 0x0411, which happens to be 1041.

Post by Eric Pozharski
In pre-all-utf8 times qr// was working on bytes without being told to
behave otherwise. That's different now.

Yes, I think I wrote that before. I don't know what this has to do with
the behaviour of “use encoding”, except that historically, “use
encoding” was intended to convert old byte-oriented scripts to the brave new
unicode-centered world with minimal effort. (I don't think it met that
goal: Over the years I have encountered a lot of people who had problems
with “use encoding”, but I don't remember ever reading from someone who
successfully converted their scripts by slapping “use encoding '...'”
at the beginning.)

Post by Peter J. Holzer
* The default encoding of some I/O streams

We here, in our barbaric world, had (and still have) to process any
binary encoding except latin1 (guess what, CP866 is still alive).
[quote perldoc encoding on]
* Changing PerlIO layers of "STDIN" and "STDOUT" to the encoding
specified.
[quote off]
That's not saying anything about 'default'. It's about 'encoding
specified'.

You misunderstood what I meant by "default". When The perl interpreter
creates the STDIN and STOUT file handles, these have some I/O layers
applied to them, without the user having to explicitely having to call
binmode(). These are applied by default, and hence I call them the
default layers. The list of default layers varies between systems
(Windows adds the :crlf layer, Linux doesn't), on command line settings
(-CS adds the :utf8 layer, IIRC), and of course it can also be
manipulated by modules like “encoding”. “use encoding 'CP866';” pushes
the layer “:encoding(CP866)” onto the STDIN and STDOUT handles. You can
still override them with binmode(), but they are there by default, you
don't have to call “binmode STDIN, ":encoding(CP866)"” explicitely
(but you do have to call it explicitely for STDERR, which IMNSHO is
inconsistent).

Post by Peter J. Holzer
and it does so even in an inconsistent manner (e.g. the encoding is
applied to STDOUT, but not to STDERR)

No problems with that here. STDERR is us-ascii, point.

If my scripts handle non-ascii characters, I want those characters also
in my error messages. If a script is intended for normal users (not
sysadmins), I might even want the error messages to be in their native
language instead of English. German can expressed in pure US-ASCII,
although it's awkward. Russian or Chinese is harder.

Post by Peter J. Holzer
and finally, because it is too complex and that will lead to
surprising results.

In your elitist latin1 world -- may be so. But we, down here, are
barbarians, you know.

May I remind you that it was you who was surprised by the behaviour of
“use encoding” in this thread, not me?

In Message <***@orphan.zombinet> you wrote:

| {10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "à"' # hooray!
| à
| {10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
| �
| {10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hoora
| à
|
| Except the middle one (what I should think about), I think encoding.pm
| wins again.

You didn't understand why the the middle one produced this particular
result. So you were surprised by the way “use encoding” translates
string literals. I wasn't surprised. I knew how it works and explained
it to you in my followup.

Still, although I think I understand “use encoding” fairly well (because
I spent a lot of time reading the docs and playing with it when I still
thought it would be a useful tool, and later because I spent a lot of
time arguing on usenet that it isn't useful) I think it is too complex.
I would be afraid of making stupid mistakes like writing "\x{E0}" when I
meant chr(0xE0), and even if I don't make them, the next guy who has to
maintain the scripts probably understands much less about “use encoding”
than I do and is likely to misunderstand my code and introduce errors.

hp

[1] I admit that I was surprised by this. It is documented that strings
consist of 64-bit elements on 64-bit machines, but I thought this
was an obvious documentation error until I actually tried it.

Eric Pozharski

2012-11-06 19:19:34 UTC

*SKIP*

Post by Peter J. Holzer
If you use “use encoding 'KOI8-U';”, you can use KOI8 sequences
(either literally or via escape sequences) in your source code. For
#!/usr/bin/perl
use warnings;
use strict;
use 5.010;
use encoding 'KOI8-U';
my $s1 = "Б";
say ord($s1);
my $s2 = "\x{E2}";
say ord($s2);
__END__
(i.e. the string literal on line 7 is stored as the byte sequence 0x22
* The perl compiler knows that the source code is in KOI-8, so a
single byte 0xE2 in the source code represents the character “U+0411
CYRILLIC CAPITAL LETTER BE”. Similarly, Escape sequences of the form
\ooo and \Xxx are taken to denote bytes in the source character set
and translated to unicode. So both the literal Б on line 7 and the
\x{E2} on line 9 are translated to U+0411.
* At run time, the bytecode interpreter sees a string with the single
unicode character U+0411. How this character was represented in the
source code is irrelevant (and indeed, unknowable) to the byte code
interpreter at this stage. It just prints the decimal representation
of 0x0411, which happens to be 1041.

Indeed, that renders perl somewhat lame. "They" could invent some
property attached at will to any scalar that would reflect some
byte-encoding somewhat connected with this scalar. Then make each other
operation to pay attention to that property. However, that hasn't been
done. Because on the way to all-utf8 Perl sacrifices have to be made.
Now, if that source would be saved as UTF-8 then output wouldn't be any
different.

I had no use for ord() (and I don't have now) but that wouldn't surprise
me if at some point in perl development ord() (in this script) would
return 208. And the only thing that could be done to make it work would
be upgrade, sometime later.

Look, *literals* are converted to utf8 with UTF8 flag on. Maybe that's
what made (and makes) qr// to work, as expected:

{41393:56} [0:0]% perl -wlE '"фыва" =~ m{(\w)}; print $1'

{42187:57} [0:0]% perl -Mutf8 -wle '"фыва" =~ m{(\w)}; print $1'
Wide character in print at -e line 1.
ф
{42203:58} [0:0]% perl -Mencoding=utf8 -wle '"фыва" =~ m{(\w)}; print $1'
ф

For explanation what happens in 1st example see below. I may be wrong
here, but I think, that in 2nd and 3rd example it all turns around $^H
anyway.

Post by Eric Pozharski
In pre-all-utf8 times qr// was working on bytes without being told to
behave otherwise. That's different now.

Yes, I think I wrote that before. I don't know what this has to do
with the behaviour of “use encoding”, except that historically, “use
encoding” was intended to convert old byte-oriented scripts to the
brave new unicode-centered world with minimal effort. (I don't think
it met that goal: Over the years I have encountered a lot of people
who had problems with “use encoding”, but I don't remember ever
reading from someone who successfully converted their scripts by
slapping “use encoding '...'” at the beginning.)

I didn't convert anything. So I don't pretend you can count me in.
Just now I've come to conclusion that C<use encoding 'utf8';> (that's
what I've ever used) is effects of C<use utf8;> plus binmode() on
streams minus posibility to make non us-ascii literals. I've been
always told that I *must* C<use utf8;> and than manually do binmode()s
myself. Nobody ever explained why I can't do that with C<use encoding
'utf8';>.

Now, C<use encoding 'binary-enc';> behaves as above (they have fully
functional UTF-8 script limited by advance of perl to all-utf8), except
actual source isn't UTF-8. I can imagine reasons why that could be
necessary. Indeed, such circumstances would be rare. Myself is in
aproximately full control of environment, thus it's not problem for me.

As of 'lot of people', I'll tell you who I've met. I've seen loads of
13-year-old boys (those are called snowflakes these days) who don't know
how to deal with shit. For those, who don't know how to deal with shit,
jobs.perl.org is the way.

*SKIP*

Post by Peter J. Holzer
(but you do have to call it explicitely for STDERR, which IMNSHO is
inconsistent).

Think about it. What terminal presents (in fonts) is locale dependent.
That locale could be 'POSIX'. There's no 'POSIX.UTF-8'. And see below.

*SKIP*

Post by Eric Pozharski
Except the middle one (what I should think about), I think
encoding.pm wins again.

You didn't understand why the the middle one produced this particular
result. So you were surprised by the way “use encoding” translates
string literals. I wasn't surprised. I knew how it works and explained
it to you in my followup.

That's nice you brought that back. I've already figured it all out.

----
{0:1} [0:0]% perl -Mutf8 -wle 'print "à"'
�
{23:2} [0:0]% perl -Mutf8 -wle 'print "à "'
�
----
{36271:17} [0:0]% perl -Mutf8 -wle 'print "à"'

{36280:18} [0:0]% perl -Mutf8 -wle 'print "à "'
à
----

What's common in those two pairs: it's special Perl-latin1, with UTF8
flag off, none utf8 concerned layer is set on output. What's different:
the former is xterm, the latter is urxvt. In eather case, that's what
is output actually:

{36831:20} [0:1]% perl -Mutf8 -wle 'print "à"' | xxd
0000000: e00a ..
{37121:21} [0:0]% perl -Mutf8 -wle 'print "à "' | xxd
0000000: e020 0a . .

So, 0xe0 has nothing to do in utf-8 output. xterm replaces it with
replacement (what makes sense). In contrary, urxvt applies some weird
heuristic (and it's really weird)

{37657:28} [0:0]% perl -Mutf8 -wle 'print "àá"'
à
{37663:29} [0:0]% perl -Mutf8 -wle 'print "àáâ"'
àá
{37666:30} [0:0]% perl -Mutf8 -wle 'print "àáâã"'
àáâ

*If* it's xterm vs. urxvt then, I think, it's religious (that means it's
not going to change). However, it doesn't look configurable or at least
documented while obviously it could be usable (configurability
provided). Then it may be some weird interaction with fontconfig, or
xft, or some unnamed perl extension, or whatever else. If I won't
forget I'll invsetigate it later after upgrades.

As of your explanation. It's not precise. encoding.pm does what it
always does. It doesn't mangle scalars itself, it *hints* Encode.pm
(and friends) for decoding from encoding specified to utf8. (How
Encode.pm comes into play is beyond my understanding for now.) In case
of C<use encoding 'utf8';> it happens to be decoding from utf-8 to utf8.
Encode.pm tries to decode byte with value more than 0x7F and falls back
for replacement.

That may be undesired. And considering this:

encoding - allows you to write your script in non-ascii or non-utf8

C<use encoding 'utf8';> may constitute abuse. What can I say? I'm
abusing it. May be that's why it works.

*CUT*

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Peter J. Holzer

2012-11-11 11:58:54 UTC

Post by Peter J. Holzer
* At run time, the bytecode interpreter sees a string with the single
unicode character U+0411. How this character was represented in the
source code is irrelevant (and indeed, unknowable) to the byte code
interpreter at this stage. It just prints the decimal representation
of 0x0411, which happens to be 1041.

Well, "they" could do all kinds of shit (to borrow your use of
language), but why should they?

Post by Eric Pozharski
Look, *literals* are converted to utf8 with UTF8 flag on. Maybe that's
{41393:56} [0:0]% perl -wlE '"фыва" =~ m{(\w)}; print $1'
{42187:57} [0:0]% perl -Mutf8 -wle '"фыва" =~ m{(\w)}; print $1'
Wide character in print at -e line 1.
ф
{42203:58} [0:0]% perl -Mencoding=utf8 -wle '"фыва" =~ m{(\w)}; print $1'
ф
For explanation what happens in 1st example see below. I may be wrong
here, but I think, that in 2nd and 3rd example it all turns around $^H
anyway.

You are thinking way too complicated. You don't need to know about $^H
to understand this. It's really very simple.

In the first example, you are dealing with a string of 8 bytes
"\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0". Depending on the version of Perl you
are using, either none of them are word characters, or several of them
are. You don't get a warning, so I assume you use a perl >= 5.12, where
“use feature unicode_strings” exists and is turned on by -E. In this
case, the first byte of your string is a word character (U+00D1 LATIN
CAPITAL LETTER N WITH TILDE), so the script prints "\xd1\x0a".

In the second and third example, you have a string of 4 characters
characters "\x{0444}\x{044b}\x{0432}\x{0430}", all of which are word
characters, so the script prints "\x{0444}\x{0a}" (which then gets
encoded by the I/O layers, but I've explained that already and won't
explain it again).

Post by Eric Pozharski
In pre-all-utf8 times qr// was working on bytes without being told to
behave otherwise. That's different now.

Yes, I think I wrote that before. I don't know what this has to do
with the behaviour of “use encoding”, except that historically, “use
encoding” was intended to convert old byte-oriented scripts to the
brave new unicode-centered world with minimal effort. (I don't think
it met that goal: Over the years I have encountered a lot of people
who had problems with “use encoding”, but I don't remember ever
reading from someone who successfully converted their scripts by
slapping “use encoding '...'” at the beginning.)

Congratulations on figuring that out (except the last one: You can make
non us-ascii literals with “use encoding” (that's one of the reasons why
it was written), the rules are just a bit different than with “use utf8”).
And of course I explicitely wrote that 10 days ago (and Ben possibly
wrote it before that but I'm not going to reread the whole thread).

Post by Eric Pozharski
I've been always told that I *must* C<use utf8;> and than manually do
binmode()s myself. Nobody ever explained why I can't do that with
C<use encoding 'utf8';>.

I don't know who told you that and who didn't explain that. It wasn't
me, that's for sure ;-). I have explained (in this thread and various
others over the last 10 years) what use encoding does and why I think
it's a bad idea to use it. If you understand it and are aware of the
tradeoffs, feel free to use it. (And of course there is no reason to use
“use utf8” unless your *source code* contains non-ascii characters).

Post by Peter J. Holzer
(but you do have to call it explicitely for STDERR, which IMNSHO is
inconsistent).

Think about it. What terminal presents (in fonts) is locale dependent.
That locale could be 'POSIX'. There's no 'POSIX.UTF-8'. And see below.

And how is this only relevant for STDERR but not for STDIN and STDOUT?

Post by Eric Pozharski
Except the middle one (what I should think about), I think
encoding.pm wins again.

That's nice you brought that back. I've already figured it all out.

[...]

Uh, no. That was a completely different problem.

Post by Eric Pozharski
So, 0xe0 has nothing to do in utf-8 output. xterm replaces it with
replacement (what makes sense). In contrary, urxvt applies some weird
heuristic (and it's really weird)

Yes, we've been through that already.

Post by Eric Pozharski
As of your explanation. It's not precise. encoding.pm does what it
always does. It doesn't mangle scalars itself, it *hints* Encode.pm
(and friends) for decoding from encoding specified to utf8. (How
Encode.pm comes into play is beyond my understanding for now.)

Maybe you should be less confident about stuff which is beyond your
understanding.

hp

Eric Pozharski

2012-11-12 16:58:04 UTC

*SKIP*

Maybe you should be less confident about stuff which is beyond your
understanding.

Here's the deal. Explain me what's complicated in this:

[quote encoding.pm on]
[producing $enc and $name goes above]
unless ( $arg{Filter} ) {
DEBUG and warn "_exception($name) = ", _exception($name);
_exception($name) or ${^ENCODING} = $enc;
$HAS_PERLIO or return 1;
}
[dealing wit Filter option and STDIN/STDOUT goes below]
[quote encoding.pm off]

and I grant you and Ben unlimited right to spread FUD on encoding.pm

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Peter J. Holzer

2012-11-12 21:30:03 UTC

Maybe you should be less confident about stuff which is beyond your
understanding.

[quote encoding.pm on]
[producing $enc and $name goes above]
unless ( $arg{Filter} ) {
DEBUG and warn "_exception($name) = ", _exception($name);
_exception($name) or ${^ENCODING} = $enc;
$HAS_PERLIO or return 1;
}
[dealing wit Filter option and STDIN/STDOUT goes below]
[quote encoding.pm off]

So after reading 400 lines of perldoc encoding (presumably not for the
first time) and a rather long discussion thread you are starting to read
the source code to find out what “use encoding” does?

I think you are proving my point that “use encoding” is too complicated
rather nicely.

You will have to read the source code of perl, however. AFAICS
encoding.pm is just a frontend which sets up stuff for the parser to
use. I'm not going to follow you there - the perl parser has too many
tentacles for my taste.

hp

Eric Pozharski

2012-11-13 07:09:46 UTC

with <slrnka2qir.5cu.hjp-***@hrunkner.hjp.at> Peter J. Holzer wrote:
*SKIP*

Post by Peter J. Holzer
I'm not going to follow you there - the perl parser has too many
tentacles for my taste.

I don't feel any better. Pity.

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Ben Morrow

2012-11-05 19:40:55 UTC

Post by Peter J. Holzer
whether the string is character string (with the UTF8 bit set) or a byte
string,

Careful. You're conflating the existing-only-in-the-programmer's-head
concept of 'do I consider this string to contain bytes for IO or
characters for manipulation' with the perl-internal SvUTF8 flag, which
is exactly the mistake we have been trying to stop people making since
5.8.0 was released and we realised the 3rd-Camel model where Perl keeps
track of the characters/bytes distinction isn't workable. It's entirely
possible and sensible for a 'byte string', that is, a string containing
only characters <256 intended for raw IO, to happen to have SvUTF8 set
internally, with byte values >127 represented as 2 bytes.

Ben

Peter J. Holzer

2012-11-05 22:42:11 UTC

Post by Peter J. Holzer
whether the string is character string (with the UTF8 bit set) or a byte
string,

Who is "we"? Before 5.12, you had to make the distinction.
Strings without the SvUTF8 flag simply didn't have Unicode semantics.
Now there is the unicode_strings feature, but

1) it still isn't default
2) it will be years before I can rely on perl 5.12+ being installed on
a sufficient number of machines to use it. I'm not even sure if most
of our machines have 5.10 yet (the Debian machines have, but most of
the RHEL machines have 5.8.x)

So, that distinction has at least existed for 8 years (2002-07-18 to
2010-04-12) and for many of us it will exist at for another few years.

So enforcing the concept I have my head in the Perl code is simply
defensive programming.

Post by Ben Morrow
and we realised the 3rd-Camel model where Perl keeps track of the
characters/bytes distinction isn't workable.

It worked for me ;-).

Post by Ben Morrow
It's entirely possible and sensible for a 'byte string', that is, a
string containing only characters <256 intended for raw IO, to happen
to have SvUTF8 set internally, with byte values >127 represented as 2
bytes.

Theoretically yes. In practice it almost always means that the
programmer forgot to call encode() somewhere.

And the other way around didn't work at all: You couldn't keep a string
with characters > 127 but < 256 in a string without the SvUTF8 flag set
and expect it to work.

hp

Ben Morrow

2012-11-05 23:30:11 UTC

Post by Peter J. Holzer
whether the string is character string (with the UTF8 bit set) or a byte
string,

Who is "we"?

TINW

Post by Peter J. Holzer
Before 5.12, you had to make the distinction.
Strings without the SvUTF8 flag simply didn't have Unicode semantics.
Now there is the unicode_strings feature, but
1) it still isn't default
2) it will be years before I can rely on perl 5.12+ being installed on
a sufficient number of machines to use it. I'm not even sure if most
of our machines have 5.10 yet (the Debian machines have, but most of
the RHEL machines have 5.8.x)
So, that distinction has at least existed for 8 years (2002-07-18 to
2010-04-12) and for many of us it will exist at for another few years.
So enforcing the concept I have my head in the Perl code is simply
defensive programming.

That is all true, and was and is a major problem to those who cared.
However, I was referring to the other half of the problem: Perl no
longer attempts to make any guarantees about the state of the SvUTF8
flag. Any operation might in principle up- or downgrade a string, even
if it wasn't obvious it would need to. This means it isn't safe to store
user data like 'this string is supposed to represent bytes' in that flag
and expect that it will be preserved, and it isn't safe to assume
strings returned from arbitrary functions will have the flag set the way
you expect.

If you want reliable Unicode semantics for Latin-1 characters before
5.12 you have to explicitly utf8::upgrade before each potentially
Unicode-aware operation.

Post by Ben Morrow
and we realised the 3rd-Camel model where Perl keeps track of the
characters/bytes distinction isn't workable.

It worked for me ;-).

There are just too many cases where a string gets upgraded by mistake,
and too many weird corner cases, particularly with pattern-matching.

Ben

Shmuel (Seymour J.) Metz

2012-11-07 01:52:54 UTC

Who is "we"? Before 5.12, you had to make the distinction. Strings
without the SvUTF8 flag simply didn't have Unicode semantics. Now
there is the unicode_strings feature, but

3. 5.8.7 is the last Perl release available on IBM's EBCDIC
operating systems, e.g., z/OS. I don't know whether there
is a similar issue with Unisys.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to ***@library.lspace.org

Peter J. Holzer

2012-11-11 12:09:57 UTC

Post by Shmuel (Seymour J.) Metz

Who is "we"? Before 5.12, you had to make the distinction. Strings
without the SvUTF8 flag simply didn't have Unicode semantics. Now
there is the unicode_strings feature, but

3. 5.8.7 is the last Perl release available on IBM's EBCDIC
operating systems, e.g., z/OS.

True. But what does that have to do with the paragraph you quoted?

Post by Shmuel (Seymour J.) Metz
I don't know whether there is a similar issue with Unisys.

It is my understanding that modern perl versions don't work on any
EBCDIC-based platform, so that would include Unisys[1], HP/MPE and other
EBCDIC-based platforms. Especially since these platforms are quite dead,
unlike z/OS which is still maintained.

hp

[1] Not all Unisys systems used EBCDIC. I think at least the 1100 series
used ASCII.

Shmuel (Seymour J.) Metz

2012-11-11 14:44:21 UTC

Post by Peter J. Holzer
True. But what does that have to do with the paragraph you quoted?

That paragraph appears to suggest upgrading to 5.12; I was pointing
out that that is not always an option.

Post by Peter J. Holzer
unlike z/OS which is still maintained.

As are z/TPF, z/VM, z/VSE and iOS[1].

Post by Peter J. Holzer
[1] Not all Unisys systems used EBCDIC. I think at least the 1100
series used ASCII.

I don't know whether it was ASCII or FieldData, but definitely not
EBCDIC. AFAIK the only Unisys systems to use EBCDIC are the ones
descended from the Burroughs B6500 with MCP.

[1] Or whatever the current name is.

Peter J. Holzer

2012-11-11 20:26:24 UTC