end-of-line conventions

Discussion:

end-of-line conventions

(too old to reply)

kj

2009-08-13 19:26:34 UTC

There are three major conventions for the end-of-line marker:
"\n", "\r\n", and "\r".

In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

These three issues are tested by the following simple script:

my $lines = my $matches = 0;
while (<>) {
$lines++;
if (/z$/) {
$matches++;
chomp;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__

I have three files, unix.txt, dos.txt, and mac.txt, each containing
four lines. Disregarding the end-of-line character(s) these lines
are "foo", "bar", "baz", "frobozz".

The file unix.txt uses "\n" to separate the lines. The output that
I get when I pass it as the argument to the script is this:

% demo.pl unix.txt

baz<>frobozz<

2 matches out of 4 lines

The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
uses "\r". Here's the output I get when I pass these files to the
script:

% demo.pl dos.txt

0 matches out of 4 lines
% demo.pl mac.txt

0 matches out of 1 lines

How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

(Mucking with the value of $/ I was able to get <> to split the
input stream at the right places, but it had no impact on the result
of the regular expression match.)

TIA!

kynn

Ben Morrow

2009-08-13 20:03:39 UTC

Post by kj
"\n", "\r\n", and "\r".
In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.
my $lines = my $matches = 0;
while (<>) {
$lines++;
if (/z$/) {
$matches++;
chomp;
print ">$_<";
}
}
print "$/$matches matches out of $lines lines$/";
__END__
I have three files, unix.txt, dos.txt, and mac.txt, each containing
four lines. Disregarding the end-of-line character(s) these lines
are "foo", "bar", "baz", "frobozz".
The file unix.txt uses "\n" to separate the lines. The output that
% demo.pl unix.txt

baz<>frobozz<

2 matches out of 4 lines
The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
uses "\r". Here's the output I get when I pass these files to the
% demo.pl dos.txt
0 matches out of 4 lines
% demo.pl mac.txt
0 matches out of 1 lines
How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

I would use PerlIO::eol, but I'm not sure how to integrate that into a
script using magic <>. It's possible that something like

BEGIN { binmode ARGV, ":raw:eol" }

will work; if not, you will need to loop over @ARGV and open the files
with the :eol layer yourself. (You could, I suppose, use

use open ":std", ":raw:eol";

but that will affect all filehandles in your program.)

Ben

Tad J McClellan

2009-08-13 20:27:30 UTC

Subject: end-of-line conventions

Have you read the "Newlines" section in

perldoc perlport

??

"\n", "\r\n", and "\r".
In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries.

perl detects its platform when it is *compiled*.

That is, perl decides what line ending to use when it is built.

The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
uses "\r".
How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

You can't.

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

kj

2009-08-13 20:53:43 UTC

Post by Tad J McClellan

Subject: end-of-line conventions

Have you read the "Newlines" section in
perldoc perlport
??

"\n", "\r\n", and "\r".
In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries.

perl detects its platform when it is *compiled*.
That is, perl decides what line ending to use when it is built.

The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
uses "\r".
How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

You can't.

Mind-blowing, to say the least...

Oh, well. Live and lurn. Thanks. And to Ben too.

kynn

Nathan Keel

2009-08-13 23:33:26 UTC

Post by kj
Mind-blowing, to say the least...
Oh, well. Live and lurn. Thanks. And to Ben too.
kynn

Don't worry, use a real OS (not Windows) and you'll not have to think
about these things, though they are easily dealt with, and you'll have
a lot more benefits as well.

Ben Morrow

2009-08-13 20:59:50 UTC

Post by Tad J McClellan

Post by kj
"\n", "\r\n", and "\r".
In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries.

perl detects its platform when it is *compiled*.
That is, perl decides what line ending to use when it is built.

This isn't strictly true. The C compiler used determines what numeric
values to associate with the characters "\n" and "\r"; on all non-EBCDIC
non-Mac-OS-Classic systems (including Win32 and Mac OS X) they are 10
and 13 respectively.

With modern perls (certainly since 5.8.0; I'm not sure what happened
with 5.6) perl decides at build time what default PerlIO layers to use;
on Win32 (and some other systems) this will include :crlf, which
translates "\r\n" newlines into "\n" on input and vice-versa on output.
Internally perl always considers a newline to be whatever the C compiler
calls "\n". (Presumably this means Perl on OS X can't read Mac-native
"\r"-separated files without help.)

This default can be changed, in several ways. It can be changed for
individual filehandles with binmode; for a given lexical scope with the
'open' pragma; and for the whole process by running perl with the PERLIO
environment variable set. I would *always* recommend that anyone wanting
to read text files either sets all filehandles to :raw mode and handles
newlines manually or uses the :eol PerlIO layer. IMHO the :crlf layer is
not useful, and trying to do anything clever with it can have very
surprising results.

Ben

Heiko Eißfeldt

2009-08-13 21:13:17 UTC

Post by kj
"\n", "\r\n", and "\r".

These notations are not unambigious! See perlport documentation section
newlines for details.

Post by kj
In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

<> and chomp use the $/ variable for line endings. Since $/ does not
support regular expressions, you cannot use this mechanism for all
types of line endings.

The $ anchor normally is just the end of the string (with or without an
line ending).

Post by kj
How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

use strict;
use warnings;

my $lines = my $matches = 0;
{
local $/ = undef;
for (<> =~ m{\G([^\012\015]*) \015?\012?}xmsg) {
$lines++;
if (/z$/) {
$matches++;
print ">$_<";
}
}
}
print "\n$matches matches out of $lines lines\n";
__END__

This uses <> with no line end definition, and iterates with a regular
expression suitable for three types of line endings. The line ending is
not included in $_, so chomp is omitted.

If you need the line endings in $_ use the following lines.
for (<> =~ m{\G([^\012\015]* \015?\012?)}xmsg) {
$lines++;
if (/z\s*$/) {
$matches++;
s{[\015\012][\015\012]?}{}xms; # chomp replacement

Hope that helps, heiko

s***@netherlands.com

2009-08-17 21:32:45 UTC

Post by Heiko EiÃfeldt

Post by kj
"\n", "\r\n", and "\r".

These notations are not unambigious! See perlport documentation section
newlines for details.

Post by kj
In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.

<> and chomp use the $/ variable for line endings. Since $/ does not
support regular expressions, you cannot use this mechanism for all
types of line endings.
The $ anchor normally is just the end of the string (with or without an
line ending).

Post by kj
How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

use strict;
use warnings;
my $lines = my $matches = 0;
{
local $/ = undef;
for (<> =~ m{\G([^\012\015]*) \015?\012?}xmsg) {

^^^^^^^^
This won't work, depending on the translation mode opened or
appended to before, opened now, etc.., 0d 0d 0a could be one, two
or 3 eol's.
In fact you don't even have, or couldn't create a reference anchor
to tell the difference.

-sln

Steve C

2009-08-13 22:24:22 UTC

Post by kj
"\n", "\r\n", and "\r".
In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.
my $lines = my $matches = 0;
while (<>) {
$lines++;
if (/z$/) {
$matches++;
chomp;
print ">$_<";
}
}
print "$/$matches matches out of $lines lines$/";
__END__
I have three files, unix.txt, dos.txt, and mac.txt, each containing
four lines. Disregarding the end-of-line character(s) these lines
are "foo", "bar", "baz", "frobozz".
The file unix.txt uses "\n" to separate the lines. The output that
% demo.pl unix.txt

baz<>frobozz<

2 matches out of 4 lines
The file dos.txt uses "\r\n" to separate lines, and the file mac.txt
uses "\r". Here's the output I get when I pass these files to the
% demo.pl dos.txt
0 matches out of 4 lines
% demo.pl mac.txt
0 matches out of 1 lines
How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

Since "\n" eq "\012" on unix, you ought to be able to
do something like this to be the same on all platforms:

my $lines = my $matches = 0;

$/ = "\012";
binmode STDIN;
binmode STDOUT;

while (<>) {
$lines++;
if (/z\012/) {
$matches++;
s/\012//g;
print ">$_<";
}
}

print "$/$matches matches out of $lines lines$/";
__END__

Ben Morrow

2009-08-13 23:19:12 UTC

Post by Steve C
Since "\n" eq "\012" on unix, you ought to be able to
my $lines = my $matches = 0;
$/ = "\012";
binmode STDIN;
binmode STDOUT;
while (<>) {
$lines++;
if (/z\012/) {
$matches++;
s/\012//g;
print ">$_<";
}
}
print "$/$matches matches out of $lines lines$/";
__END__

Did you try it? This completely fails with "\r"-separated files, and
fails to match any lines with "\r\n"-separated files.

Ben

Steve C

2009-08-14 15:18:58 UTC

Post by Ben Morrow

Post by Steve C
Since "\n" eq "\012" on unix, you ought to be able to
my $lines = my $matches = 0;
$/ = "\012";
binmode STDIN;
binmode STDOUT;
while (<>) {
$lines++;
if (/z\012/) {
$matches++;
s/\012//g;
print ">$_<";
}
}
print "$/$matches matches out of $lines lines$/";
__END__

Did you try it? This completely fails with "\r"-separated files, and
fails to match any lines with "\r\n"-separated files.
Ben

I misread the question.

chris

2009-08-14 14:58:04 UTC

Post by kj
"\n", "\r\n", and "\r".
In a variety of situation, Perl must split strings into "lines",
and must therefore follow a particular convention to identify line
boundaries. There are three situations that interest me in
particular: 1. the splitting into lines that happens when one
iterates over a file using the <> operator; 2. the meaning of the
operation performed by chomp; and 3. the meaning of the $ anchor
in regular expressions.
my $lines = my $matches = 0;
while (<>) {
$lines++;
if (/z$/) {
$matches++;
chomp;
print ">$_<";
}
}
print "$/$matches matches out of $lines lines$/";
__END__
I have three files, unix.txt, dos.txt, and mac.txt, each containing
four lines. Disregarding the end-of-line character(s) these lines
are "foo", "bar", "baz", "frobozz".

If you're on linux (it seems you are) I would pass any files of dubious
origin through 'mac2unix' and 'dos2unix' first to ensure that your perl
will parse them correctly.

Jürgen Exner

2009-08-15 13:58:16 UTC

Yes.

Post by kj
"\n", "\r\n", and "\r".

No. The end-of-line markers are "\010", "\013\010", and "\013".

"\n" is Perl's short-hand notation for whatever end-of-line marker
combination is used on the current platform, thus it can be any of the
three.

Post by kj
How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

If you have to deal with cross-platform files then your best bet is to
explicitely check for each combination individually and not to use the
short-hand "\n".

jue

Ben Morrow

2009-08-15 22:39:45 UTC

Post by JÃ¼rgen Exner
Yes.

Post by kj
"\n", "\r\n", and "\r".

No. The end-of-line markers are "\010", "\013\010", and "\013".

ITYM \012 and \015 there. \0-escapes are in octal.

Post by JÃ¼rgen Exner
"\n" is Perl's short-hand notation for whatever end-of-line marker
combination is used on the current platform, thus it can be any of the
three.

"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
Unix. If "\n" was more than one byte/character (take your pick) long,
practically everything would break. AFAIK the only platforms where "\n"
ne "\012" are Mac OS Classic and the EBCDIC platforms, both of which are
obsolete as far as Perl is concerned.

Post by JÃ¼rgen Exner

Post by kj
How can I change the script so that the output for unix.txt, dos.txt,
and mac.txt will be the same as the one shown above for unix.txt?

If you have to deal with cross-platform files then your best bet is to
explicitely check for each combination individually and not to use the
short-hand "\n".

IMHO your best bet is to normalize the newlines before looking for
"\n"s.

Ben

s***@netherlands.com

2009-08-15 23:24:07 UTC

Post by Ben Morrow

Post by JÃ¼rgen Exner
Yes.

Post by kj
"\n", "\r\n", and "\r".

No. The end-of-line markers are "\010", "\013\010", and "\013".

ITYM \012 and \015 there. \0-escapes are in octal.

<snip>

Post by Ben Morrow
Ben

He meant 10/13 respectfully.
Lets get this table going just for grins:

lf crlf cr
dec 10 13,10 13
hex 0a 0d,0a 0d
oct 012 015,012 015

But how should binary intended be interpreted if opened for translation?
Even if ascii and invalidness.

The recovery of a applies to all regexp valid regex cannot create a mixed
mode platform with append. Either all is converted OR invalid, or
none is converted.

No 0a0a0d0d0a0a. Naw, invalid. At best, recover what is possible,
rewrite file, right the ship, destroy old. Don't tell anybody about it.
Delete file, exit with success, or reformat hd, send it to deep magnetic
disk recovery for partial recovery, tracks wiped clean.

-sln

Jürgen Exner

2009-08-16 04:16:32 UTC

Post by Ben Morrow

Post by JÃ¼rgen Exner
No. The end-of-line markers are "\010", "\013\010", and "\013".

ITYM \012 and \015 there. \0-escapes are in octal.

Yes, sorry.

Post by Ben Morrow

Post by JÃ¼rgen Exner
"\n" is Perl's short-hand notation for whatever end-of-line marker
combination is used on the current platform, thus it can be any of the
three.

"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
Unix.

But then how come that the file created by this little program

open FOO, ">" , "foo";
print FOO "k\n" x 20;
close FOO;

is 60 bytes long instead of 40 as would to be expected if the 'k' and
the "\n" each were only one byte long?

C:\tmp>dir foo
15-Aug-09 21:13 60 foo

jue

s***@netherlands.com

2009-08-16 04:40:21 UTC

Post by JÃ¼rgen Exner

Post by Ben Morrow

Post by JÃ¼rgen Exner
No. The end-of-line markers are "\010", "\013\010", and "\013".

ITYM \012 and \015 there. \0-escapes are in octal.

Yes, sorry.

Post by Ben Morrow

Post by JÃ¼rgen Exner
"\n" is Perl's short-hand notation for whatever end-of-line marker
combination is used on the current platform, thus it can be any of the
three.

"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
Unix.

But then how come that the file created by this little program
open FOO, ">" , "foo";
print FOO "k\n" x 20;
close FOO;
is 60 bytes long instead of 40 as would to be expected if the 'k' and
the "\n" each were only one byte long?
C:\tmp>dir foo
15-Aug-09 21:13 60 foo
jue

Depends on what has edited it and how it is written out.
Open in Word/Windows, a 0d only eol and it edits each line
as a odoa. Modify and save it, I think it keeps only od.
But Word jacks a lot of stuff, especially encoding.
-sln

Willem

2009-08-16 07:59:33 UTC

Jürgen Exner wrote:
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?

Because the I/O routine translates the newlines. Just like in C.
Perl probably even uses the C I/O library to write to the file.

SaSW, Willem

--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

s***@netherlands.com

2009-08-16 09:06:22 UTC

Post by Willem
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?
Because the I/O routine translates the newlines. Just like in C.
Perl probably even uses the C I/O library to write to the file.
SaSW, Willem

There are heuristics in Windows programs. Just look at Word, a Microsoft
offering.

-sln

Jürgen Exner

2009-08-16 16:24:12 UTC

Post by Willem
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?
Because the I/O routine translates the newlines.

So, I guess you are saying that there is a context where "\n" does mean
two characters, contrary to Ben's statement:
"\n" can *never* mean "\015\012"

jue

Peter J. Holzer

2009-08-16 17:15:56 UTC

Post by JÃ¼rgen Exner

Post by Willem
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?
Because the I/O routine translates the newlines.

So, I guess you are saying that there is a context where "\n" does mean
"\n" can *never* mean "\015\012"

"\n" is *always* a string containing one character (\x{000A} on most
platforms including Windows). However, when this character is written to
a file handle, an I/O layer may convert this in any way it pleases. It
may just pass it through unchanged, it may convert it into a sequence of
two bytes (e.g. "\x0D\x0A"), or it might even pad all lines to a fixed
length with spaces and not write any new line characters at all.

On input the reverse transformation should be performed.

hp

Ben Morrow

2009-08-16 17:18:42 UTC

Post by JÃ¼rgen Exner

Post by Willem
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?
Because the I/O routine translates the newlines.

So, I guess you are saying that there is a context where "\n" does mean
"\n" can *never* mean "\015\012"

No. "\n" means "\012". Your FOO filehandle has a :crlf layer on it,
which translates "\012" to "\015\012" on output. You will get exactly
the same result if you run

open FOO, ">", "foo";
print FOO "k\012" x 20;
close FOO;

so the CRLF newlines have nothing to do with your use of "\n".

Ben

RedGrittyBrick

2009-08-23 12:28:23 UTC

Post by Ben Morrow

Post by JÃ¼rgen Exner

Post by Willem
) But then how come that the file created by this little program
)
) open FOO, ">" , "foo";
) print FOO "k\n" x 20;
) close FOO;
)
) is 60 bytes long instead of 40 as would to be expected if the 'k' and
) the "\n" each were only one byte long?
Because the I/O routine translates the newlines.

So, I guess you are saying that there is a context where "\n" does mean
"\n" can *never* mean "\015\012"

No. "\n" means "\012". Your FOO filehandle has a :crlf layer on it,
which translates "\012" to "\015\012" on output. You will get exactly
the same result if you run
open FOO, ">", "foo";
print FOO "k\012" x 20;
close FOO;
so the CRLF newlines have nothing to do with your use of "\n".

So it seems.

C:\> perl -e "print length qq(A\n)"
2

C:\> perl -e "print unpack 'H*', qq(A\n)"
410a

C:\> perl -e "print qq(A\n)" | od -t x1
0000000 41 0d 0a
0000003

C:\> perl -e "print qq(A\n)" | od -a
0000000 A cr nl
0000003

For most purposes you can get away with thinking that "\n" represents
whatever line-ending sequence is appropriate for the current platform. I
did for a long time. Obviously there are cases where it becomes apparent
that the truth is a little more complicated. But only a little.

--
RGB

Shmuel (Seymour J.) Metz

2009-08-16 13:30:45 UTC

Are you a betting man.

Post by Ben Morrow
"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
Unix. If "\n" was more than one byte/character (take your pick) long,
practically everything would break.

Wrong; \n is two bytes on DOS and OS/2, and AFAIK nothing breaks except
cgi.pm.

Post by Ben Morrow
AFAIK the only platforms where "\n" ne "\012" are Mac OS Classic and
the EBCDIC platforms, both of which are obsolete as far as Perl is
concerned.

When there's ongoing maintenance then it's a rather lively corpse.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to ***@library.lspace.org

Ben Morrow

2009-08-16 17:24:22 UTC

Post by Shmuel (Seymour J.) Metz
Are you a betting man.

Not usually :).

Post by Shmuel (Seymour J.) Metz

Post by Ben Morrow
"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
Unix. If "\n" was more than one byte/character (take your pick) long,
practically everything would break.

Wrong; \n is two bytes on DOS and OS/2, and AFAIK nothing breaks except
cgi.pm.

No it's not, at least not as far as Perl is concerned. Files have CRLF
line endings, but they are (by default) translated into LF line endings
when the file is read. If you have a file containing

fooCRLF

and you read a line with

open my $FOO, "<", "foo";
my $foo = <$FOO>;

then $foo will be four bytes long, not five.

Post by Shmuel (Seymour J.) Metz

Post by Ben Morrow
AFAIK the only platforms where "\n" ne "\012" are Mac OS Classic and
the EBCDIC platforms, both of which are obsolete as far as Perl is
concerned.

When there's ongoing maintenance then it's a rather lively corpse.

Perl is no longer maintained for Mac OS Classic or the EBCDIC platforms.
I did not mean to imply they were obsolete for other purposes.

Ben

s***@netherlands.com

2009-08-17 22:23:32 UTC

Post by Ben Morrow

Post by Shmuel (Seymour J.) Metz
Are you a betting man.

Not usually :).

Post by Shmuel (Seymour J.) Metz

Post by Ben Morrow
"\n" can *never* mean "\015\012": on Win32 it means "\012", just as on
Unix. If "\n" was more than one byte/character (take your pick) long,
practically everything would break.

Wrong; \n is two bytes on DOS and OS/2, and AFAIK nothing breaks except
cgi.pm.

No it's not, at least not as far as Perl is concerned. Files have CRLF
line endings, but they are (by default) translated into LF line endings
when the file is read. If you have a file containing
fooCRLF
and you read a line with
open my $FOO, "<", "foo";
my $foo = <$FOO>;
then $foo will be four bytes long, not five.

Post by Shmuel (Seymour J.) Metz

Post by Ben Morrow
AFAIK the only platforms where "\n" ne "\012" are Mac OS Classic and
the EBCDIC platforms, both of which are obsolete as far as Perl is
concerned.

When there's ongoing maintenance then it's a rather lively corpse.

Perl is no longer maintained for Mac OS Classic or the EBCDIC platforms.
I did not mean to imply they were obsolete for other purposes.
Ben

Yes this is fairly standard ANSI translations.

This fopen api documentation, phrase sums it up:
"Carriage returnline feed (CR-LF) combinations are translated
into a single line feed character on input.
Line feed characters are translated into CR-LF combinations on output. "

Opening in text mode, translated is the default, and these things happen:

- On reads: CRLF are converted to LF's
- On writes: LF is converted to CRLF's
- EOL character is the LF
- binmode(STDOUT,':raw') is not good for viewing because the console does real \r and \n

Finally, there is no clear cut solution to the OP I don't believe.

If one platform can append CR's and another LF's as eol's, then it can't be
determined that these are seperate eol's. Of course, another comes along and
adds the CRLF pair as eol.

Either way, opening a file in ':raw' mode and doing your own eol
translations, would make this, by definition: if /\015?\012?/ ++$linecnt,
invalid.

I guess there is the C fmode and setmode to read, turn on/off translations.
Unless there is a convergence of platform meanings that don't step on each
other when files are appeneded in translated mode (if it is supported), opening
a file un-translated and doing your own eol translation, would no seem to be
%100 reliable.

-sln

Raw Data:
(18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
(18) = T..W....Xedf..Y..Z

Writing translated text file
--------------------

Reading translated text file
--------------------
tran (18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
(18) = T..W....Xedf..Y..Z
( 3) = T ( d a )
( 2) = W ( a )
( 1) = ( a )
(11) = ( d d ) Xedf ( d d ) Y ( d a )
( 1) = Z

Reading un-translated text file
--------------------
raw (22) = 54 d d a 57 d a d a d d 58 65 64 66 d d 59 d d a 5a
(22) = T ( d d a ) W ( d a d a d d ) Xedf ( d d ) Y ( d d a ) Z
(22) = T...W......Xedf..Y...Z
( 4) = T ( d d a )
( 3) = W ( d a )
( 2) = ( d a )
(12) = ( d d ) Xedf ( d d ) Y ( d d a )
( 1) = Z

=============================================

Writing RAW text file
--------------------

Reading translated text file
--------------------
tran (16) = 54 a 57 a a d d 58 65 64 66 d d 59 a 5a
(16) = T.W....Xedf..Y.Z
( 2) = T ( a )
( 2) = W ( a )
( 1) = ( a )
(10) = ( d d ) Xedf ( d d ) Y ( a )
( 1) = Z

Reading un-translated text file
--------------------
raw (18) = 54 d a 57 a a d d 58 65 64 66 d d 59 d a 5a
(18) = T ( d a ) W ( a a d d ) Xedf ( d d ) Y ( d a ) Z
(18) = T..W....Xedf..Y..Z
( 3) = T ( d a )
( 2) = W ( a )
( 1) = ( a )
(11) = ( d d ) Xedf ( d d ) Y ( d a )
( 1) = Z

Shmuel (Seymour J.) Metz

2009-08-17 14:56:42 UTC

Post by Ben Morrow
No it's not, at least not as far as Perl is concerned.

Perldoc says otherwise. so does the code running on my machine.

Post by Ben Morrow
If you have a file containing
fooCRLF
and you read a line with
open my $FOO, "<", "foo";
my $foo = <$FOO>;
then $foo will be four bytes long, not five.

Perhaps on youir machine. On mine the CR and LF are both there.

Post by Ben Morrow
Perl is no longer maintained for Mac OS Classic or the EBCDIC platforms.

There are still people working on the EBCDIC code.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to ***@library.lspace.org

Ben Morrow

2009-08-18 20:48:35 UTC

Post by Shmuel (Seymour J.) Metz

Post by Ben Morrow
No it's not, at least not as far as Perl is concerned.

Perldoc says otherwise. so does the code running on my machine.

Where in perldoc?

Post by Shmuel (Seymour J.) Metz

Post by Ben Morrow
If you have a file containing
fooCRLF
and you read a line with
open my $FOO, "<", "foo";
my $foo = <$FOO>;
then $foo will be four bytes long, not five.

Perhaps on youir machine. On mine the CR and LF are both there.

OK, now I'm really confused, since I didn't think perl worked like that
anywhere. What do you get from

perl -le"print for unpack "C*", qq{\n}"

What do you get if you open a file in the normal way and print
"a\012b\n" to it?

Post by Shmuel (Seymour J.) Metz

Post by Ben Morrow
Perl is no longer maintained for Mac OS Classic or the EBCDIC platforms.

There are still people working on the EBCDIC code.

Really? AFAIR last time it was brought up on p5p the conclusion was that
nobody was testing it or (apparently) using it, so while none of it
would be broken on purpose no particular effort would be made to keep it
working. I would be pleased to find out I'm wrong about that :).

Ben

Shmuel (Seymour J.) Metz

2009-08-19 13:29:09 UTC

Post by Ben Morrow
OK, now I'm really confused, since I didn't think perl worked like that
anywhere. What do you get from
perl -le"print for unpack "C*", qq{\n}"

syntax error at -e line 1, near "*,"
Execution of -e aborted due to compilation errors.

Presumably due to differences in shell processing between OS/2 and *ix[1].

I'm going to hack out a quick script to print "a\012b\n" and to convert
"\n" to hex.

Post by Ben Morrow
Really? AFAIR last time it was brought up on p5p the conclusion was that
nobody was testing it or (apparently) using it, so while none of it
would be broken on purpose no particular effort would be made to keep it
working. I would be pleased to find out I'm wrong about that :).

What I can say for sure is that while 5.8.7 is the last z/OS build
available from the IBM tools site, there have been comments about EBCDIC
bug fixes in later versions of Perl.

[1] Yes, I know that there are multiple shells in, e.g., Linux, but
AFAIK they all treat quotes the same.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to ***@library.lspace.org

Ben Morrow

2009-08-19 15:55:54 UTC

Post by Shmuel (Seymour J.) Metz

Post by Ben Morrow
OK, now I'm really confused, since I didn't think perl worked like that
anywhere. What do you get from
perl -le"print for unpack "C*", qq{\n}"

syntax error at -e line 1, near "*,"
Execution of -e aborted due to compilation errors.
Presumably due to differences in shell processing between OS/2 and *ix[1].

More due to my inability to convert quoting styles on the fly. Sorry,
what I meant was of course

perl -le"print for unpack q{C*}, qq{\n}"

Ben

Shmuel (Seymour J.) Metz

2009-08-24 11:20:42 UTC

Post by Ben Morrow
More due to my inability to convert quoting styles on the fly. Sorry,
what I meant was of course
perl -le"print for unpack q{C*}, qq{\n}"

Now I'm really confused. That gave the result you were expecting (10
decimal), but way back when I tried using \n in a regex and had to change
it to \x0a to get it to work. It doesn't seem reasonable that the meaning
of \n would differ between string constants and literals, and it also
doesn't seem reasonable that either would have changed since 5.6, so what
is or was going on?

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to ***@library.lspace.org

Ben Morrow

2009-08-30 10:14:20 UTC

Post by Shmuel (Seymour J.) Metz

Post by Ben Morrow
More due to my inability to convert quoting styles on the fly. Sorry,
what I meant was of course
perl -le"print for unpack q{C*}, qq{\n}"

Now I'm really confused. That gave the result you were expecting (10
decimal), but way back when I tried using \n in a regex and had to change
it to \x0a to get it to work. It doesn't seem reasonable that the meaning
of \n would differ between string constants and literals, and it also
doesn't seem reasonable that either would have changed since 5.6, so what
is or was going on?

I have no idea, unless there was some sort of shell substitution going
on that was causing perl not to see the \n at all. Presumably this was
on OS/2, not some EBCDIC platform (where the numerical values *are*
different)? Does

perl -le"print "match" if qq{\x0a} =~ /\n/"

work correctly now?

Ben

Shmuel (Seymour J.) Metz

2009-08-31 13:06:55 UTC

I have no idea, unless there was some sort of shell substitution going on
that was causing perl not to see the \n at all.

The code in question was inside a Perl file, not in the command line.

Presumably this was on OS/2, not some EBCDIC platform (where the
numerical values *are* different)?

Yes; in fact, at the moment the only systems I have access to are Linux
and OS/2.

perl -le"print "match" if qq{\x0a} =~ /\n/"

[H:\] perl -le"print "match" if qq{\x0a} =~ /\n/"

[H:\] perl -le"print 'match' if qq{\x0a} =~ /\n/"
match

[H:\]

Note; I don't have a 5.6 to test that against; I'm running 5.10. But I
find it hard to believe that it would have changed, so I don't understand
why the original match against imbedded \n didn't work.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to ***@library.lspace.org

Shmuel (Seymour J.) Metz

2009-08-16 13:38:37 UTC

Post by JÃ¼rgen Exner
No. The end-of-line markers are "\010", "\013\010", and "\013".

ITYM "\013", "\015\012", and "\015"; 10 and 13 are the decimal values for
LF and CR, but the leading 0 is Perl notation for octal. IAC, those are
not the only possible values, although AFAIK they are they only values for
ASCII-based platforms.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to ***@library.lspace.org

Shmuel (Seymour J.) Metz

2009-08-13 22:08:58 UTC

There are three major conventions for the end-of-line marker: "\n",
"\r\n", and "\r".

No; \n is the end-of-line indicator and may[1] be one of CR, CR/LF or LF.
The escape \n is *not* a synonymn for LF and the end-of-line marker is
never \r\n.

[1] Or may not.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to ***@library.lspace.org

Continue reading on narkive:

Search results for 'end-of-line conventions' (Questions and Answers)

Are people so afraid of Ron Paul and liberty that they are forced to deny reality?

started 2012-01-22 13:37:46 UTC

Sonnet by Billy Collins analysis/summary help.?

started 2011-04-10 20:34:55 UTC

End of the world?

started 2007-12-02 13:52:45 UTC

When was the last time you went to a Jehovah's Witness convention?

started 2013-07-30 03:38:00 UTC

religion & spirituality

What is an "end - stopped line" in a poetry?

started 2006-09-28 03:21:23 UTC

arts & humanities

34 Replies
81 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

kj 2009-08-13 19:26:34 UTC

Ben Morrow 2009-08-13 20:03:39 UTC

Tad J McClellan 2009-08-13 20:27:30 UTC

kj 2009-08-13 20:53:43 UTC

Nathan Keel 2009-08-13 23:33:26 UTC

Ben Morrow 2009-08-13 20:59:50 UTC

Heiko Eißfeldt 2009-08-13 21:13:17 UTC

s***@netherlands.com 2009-08-17 21:32:45 UTC

Steve C 2009-08-13 22:24:22 UTC

Ben Morrow 2009-08-13 23:19:12 UTC

Steve C 2009-08-14 15:18:58 UTC

chris 2009-08-14 14:58:04 UTC

Jürgen Exner 2009-08-15 13:58:16 UTC

Ben Morrow 2009-08-15 22:39:45 UTC

s***@netherlands.com 2009-08-15 23:24:07 UTC

Jürgen Exner 2009-08-16 04:16:32 UTC

s***@netherlands.com 2009-08-16 04:40:21 UTC

Willem 2009-08-16 07:59:33 UTC

s***@netherlands.com 2009-08-16 09:06:22 UTC

Jürgen Exner 2009-08-16 16:24:12 UTC

Peter J. Holzer 2009-08-16 17:15:56 UTC

Ben Morrow 2009-08-16 17:18:42 UTC

RedGrittyBrick 2009-08-23 12:28:23 UTC

Shmuel (Seymour J.) Metz 2009-08-16 13:30:45 UTC

Ben Morrow 2009-08-16 17:24:22 UTC

s***@netherlands.com 2009-08-17 22:23:32 UTC

Shmuel (Seymour J.) Metz 2009-08-17 14:56:42 UTC

Ben Morrow 2009-08-18 20:48:35 UTC

Shmuel (Seymour J.) Metz 2009-08-19 13:29:09 UTC

Ben Morrow 2009-08-19 15:55:54 UTC

Shmuel (Seymour J.) Metz 2009-08-24 11:20:42 UTC

Ben Morrow 2009-08-30 10:14:20 UTC

Shmuel (Seymour J.) Metz 2009-08-31 13:06:55 UTC

Shmuel (Seymour J.) Metz 2009-08-16 13:38:37 UTC

Shmuel (Seymour J.) Metz 2009-08-13 22:08:58 UTC

about - legalese

Loading...