Discussion:
utf8 and HTML Entities
(too old to reply)
Nick Gerber
2007-09-19 12:59:02 UTC
Permalink
Hi

I'm lost :-(

I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.

How do I translate the HTML Entities into proper utf-8?

Thanks
Ben Bullock
2007-09-19 13:58:40 UTC
Permalink
Post by Nick Gerber
I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.
How do I translate the HTML Entities into proper utf-8?
Since this must be a commonly encountered problem, my first guess would be
to try cpan to save myself the bother of writing it myself. I rapidly found:

http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/Entities.pm

Please note that I can't vouch for this software since I have not tried it.

As far as utf8 goes you need to use the "Encode" module.
Nick Gerber
2007-09-20 14:49:29 UTC
Permalink
I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
me that could not make it to do the conversion for me. I'll try again.

Thanks
Post by Ben Bullock
Post by Nick Gerber
I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.
How do I translate the HTML Entities into proper utf-8?
Since this must be a commonly encountered problem, my first guess would be
http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/Entities.pm
Please note that I can't vouch for this software since I have not tried it.
As far as utf8 goes you need to use the "Encode" module.
Helmut Wollmersdorfer
2007-09-21 05:27:16 UTC
Permalink
Post by Nick Gerber
I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
me that could not make it to do the conversion for me. I'll try again.
That's my way which works for millions of HTML (or XML) files:

use HTML::Entities;

my $ENCODING = 'utf8'; # or iso-8859-7, CP1250 etc.

open (HTML, "<:encoding($ENCODING)", "$DIR/$file")
or die "Can't open: $1!";

my $data = <HTML>;

my $content = decode_entities($data);

binmode(STDOUT, ":utf8");

print "$content\n";

It is also save (in most cases) to use

my $content = decode_entities(decode_entities($data));

which decodes something like

&amp;amp;



| $ perl -version
| This is perl, v5.8.8 built for i486-linux-gnu-thread-multi

Helmut Wollmersdorfer
s***@netherlands.co
2007-09-21 01:31:44 UTC
Permalink
Post by Nick Gerber
Hi
I'm lost :-(
I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.
How do I translate the HTML Entities into proper utf-8?
Thanks
Should be enough here to get you going:



sub convertEntities
{
my ($self, $str_ref, $opts) = @_;
my $alt_str = '';
my $res = 0;
my ($entchr);

# Usage info:
# Option bitmask: 1=char reference, 2=general reference, 4=parameter reference
# Default option is char and general references (&)
# Ignore Parameter references (%) in Attvalue and Content
# Process PE's in DTD and Entity decls

$opts = 3 unless defined $opts;

while ($$str_ref =~ /$self->{'RxEntConv'}/gc)
{
# Unicode character reference
if (defined $4) {
# decimal
if (($opts & 1) && defined ($entchr = getEntityUchar($self, $4))) {
$alt_str .= "$1$entchr";
$res = 1;
} else {
$alt_str .= "$1$2#$4;";
}
} elsif (defined $5) {
# hex
if (($opts & 1) && length($5) < 9 && defined ($entchr = getEntityUchar($self, hex($5)))) {
$alt_str .= "$1$entchr";
$res = 1;
} else {
$alt_str .= "$1$2#$5;";
}
}
else {
# General reference
if ($2 eq '&') {
if (($opts & 2) && exists $self->{'general_ent_subst'}->{$3}) {
$alt_str .= $1;

# expand general references,
# bypass if seen in the recursion ring
# ----
if (defined $self->{'ring_ent_subst'}->{$3}) {
$alt_str .= "$1$2$3;";
} else {
# recurse expansion
# ----
my ($entname, $alt_entval) = ($3, undef);
my $entval = $self->{'general_ent_subst'}->{$entname};
$self->{'ring_ent_subst'}->{$entname} = 1;

if (defined ($alt_entval = convertEntities ($self, \$entval, 2))) {
$alt_str .= $$alt_entval;
} else {
$alt_str .= $self->{'general_ent_subst'}->{$entname};
}
$self->{'ring_ent_subst'}->{$entname} = undef;
$res = 1;
}
} else {
$alt_str .= "$1$2$3;";
}
} else {
# Parameter reference
if (($opts & 4) && exists $self->{'parameter_ent_subst'}->{$3}) {
$alt_str .= "$1$self->{'parameter_ent_subst'}->{$3}";
$res = 1;
} else {
$alt_str .= "$1$2$3;";
}
}
}
}
if ($res) {
$alt_str .= substr $$str_ref, pos($$str_ref);
return \$alt_str;
}
return undef;
}

sub getEntityUchar
{
my ($self, $code) = @_;
if (($code >= 0x01 && $code <= 0xD7FF) ||
($code >= 0xE000 && $code <= 0xFFFD) ||
($code >= 0x10000 && $code <= 0x10FFFF)) {
return chr($code);
}
return undef;
}

sub addEntity
{
my ($self, $peflag, $entname, $entval) = @_;

# Non-normalized, internal entities only
# (no external defs yet, ie:SYSTEM/PUBLIC/NDATA)
return undef unless
($entval =~ s/^\s*'([^']*?)'\s*$/$1/s || $entval =~ s/^\s*"([^"]*?)"\s*$/$1/s);

# Replacement text: convert parameter and character references only
my ($alt_entval);
if (defined ($alt_entval = convertEntities ($self, \$entval, 5))) {
$entval = $$alt_entval;
}
my $enttype = 'general_ent_subst';
$enttype = 'parameter_ent_subst' if ($peflag);

if (exists $self->{'$enttype'}->{$entname}) {
# warn, pre-existing ent name
return undef;
}
$self->{$enttype}->{$entname} = $entval;
$self->{'Entities'} .= "|(?:$entname)";
# recompile regexp
$self->{'RxEntConv'} = qr/(.*?)(&|%)($self->{'Entities'});/s;
return \$entval;
}



@UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
@UC_Nchar = (
"\\x{B7}",
"\\x{0300}-\\x{036F}",
"\\x{203F}-\\x{2040}",
);
$Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
$Nchar = "[-\\w:\\.".join ('',@UC_Nchar).join ('',@UC_Nstart)."]";
$Name = "(?:$Nstrt$Nchar*?)";

$RxENTITY = qr/^\s+(?:($Name)|(?:%\s+($Name)))\s+(.*?)$/s;
Mumia W.
2007-09-21 06:36:05 UTC
Permalink
Post by Nick Gerber
Hi
I'm lost :-(
I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.
How do I translate the HTML Entities into proper utf-8?
Thanks
[ long program snipped ]
No, that's too much.

Mr. Gerber didn't post any code or data, and so he didn't get many
responses because no one knew exactly what he was talking about.

As Mr. Bullock said, HTML::Entities should do it. Here is an example:

#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;

binmode(STDOUT, ':utf8');
local $/;
my $data = <DATA>;

$data = decode_entities($data);

print $data, "\n";

__DATA__
&#x8184; &#x8185; &#x8186;
&aacute; &eacute; &iacute; &oacute; &uacute;
&auml; &euml; &iuml; &ouml; &uuml;
Nick Gerber
2007-09-25 10:14:14 UTC
Permalink
Thanks all.

Nick
Post by Mumia W.
Post by Nick Gerber
Hi
I'm lost :-(
I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.
How do I translate the HTML Entities into proper utf-8?
Thanks
[ long program snipped ]
No, that's too much.
Mr. Gerber didn't post any code or data, and so he didn't get many
responses because no one knew exactly what he was talking about.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;
binmode(STDOUT, ':utf8');
local $/;
my $data = <DATA>;
$data = decode_entities($data);
print $data, "\n";
__DATA__
&#x8184; &#x8185; &#x8186;
&aacute; &eacute; &iacute; &oacute; &uacute;
&auml; &euml; &iuml; &ouml; &uuml;
Loading...