Discussion:
LibXML UTF8 - Input is not proper UTF-8, indicate encoding !
(too old to reply)
Vlajko Knezic
2005-03-05 04:37:14 UTC
Permalink
Not so sure what is going on here but is something to do with the way UTF8
is handled in Perl and/or LibXML



The sctript below:

- accepts a value from a form text field;

- builds XML document around it,

- deparses the document to the string using toString(),

- parses the string into the XML document using parse_string()

- transforms XML document into HTML document using XSL
transformation



Everything works well until UTF8 character is entered in the text field (for
example é) . In that case when trying to run parse_string() code crashes
with the message:

=====================================================================

:2: parser error : Input is not proper UTF-8, indicate encoding
!<test><test_text>abcé</test_text></test> ^:2: error:
Bytes: 0xE9 0x3C 0x2F 0x74<test><test_text>abcé</test_text></test>
^ at C:/_work/vsurvey/site/test1.cgi line
24=====================================================================



I know that the code below does not make much sense but this is an
abstraction of the much more complex code. Environment is Perl 5.8; Apache;
Windows XP.



Hints and/or explanation what was coded wrong and how should it be fixed are
very much appreciated.



Vlajko Knezic,

Toronto, Ontario



---------------------------------------------------------------------------------------------------------------------

test.cgi



#! c:/Perl/bin/Perl.exe



use CGI;

use XML::LibXML;

use XML::LibXSLT;

use CGI::Carp qw( fatalsToBrowser );

use Encode;



my $mDocument = XML::LibXML::Document-> new();

my $parser = XML::LibXML->new();



$mDocument->setEncoding("UTF8");

my $mCGI = new CGI;

print $mCGI->header;

my $mTest_text = $mCGI->param('test');;



my $mTest = $mDocument-> createElement("test");

my $mTestText = $mDocument-> createElement("test_text");

$mTestText->appendTextNode($mTest_text);

$mTest->appendChild($mTestText);

$mDocument->setDocumentElement( $mTest );

$mDocument->setEncoding("UTF8");

my $mTestXML = $mDocument->toString();

my $mParsedTestXML = $parser->parse_string($mTestXML);



my $mParsedXMLXSL = $parser->parse_file('test.xsl');

my $mParserXSL = XML::LibXSLT->new();

my $mParsedXSL = $mParserXSL->parse_stylesheet($mParsedXMLXSL);

my $mPageHTML = $mParsedXSL->transform($mParsedTestXML);

my $mPrintPageHTML = $mParsedXSL->output_string($mPageHTML);

print $mPrintPageHTML;



test.xsl



<?xml version="1.0"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:output method="html" encoding="UTF-8" indent="yes"
omit-xml-declaration="yes"/>

<xsl:template match="//test">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

</head>

<html>

<body>

<xsl:value-of select="test_text"/>

<form name="test" type="post" target="_self">

<input type="text" name="test" /><input type="submit" name="button"/>

</form>

</body>

</html>

</xsl:template>

</xsl:stylesheet>
Brian McCauley
2005-03-05 14:24:15 UTC
Permalink
Post by Vlajko Knezic
Not so sure what is going on here but is something to do with the way UTF8
is handled in Perl and/or LibXML
I've seen something similar - I fixed it by performing an explicit
utf8::upgrade() on every string before passing it to any of the LibXML
library methods.

If that doesn't help then it looks to me like this could be an issue
with CGI.pm and/or your web browser.
Post by Vlajko Knezic
Everything works well until UTF8 character is entered in the text field (for
example é) .
I suspect this is not what is happening. It appears that the browser is
sending the form sumbission data using some other encoding (e.g. Latin1(
and Perl's CGI.pm is assuming it's UTF8 thus generating an invalid utf8
string.
Post by Vlajko Knezic
<input type="text" name="test" />
Since your text field does not specify an encoding the browser is free
to choose any it likes but the recommendation is that it should choose
the one used to encode the document containing the form. I notice your
Post by Vlajko Knezic
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8"/>
I would not expect Content-Type to be setable via <meta> but then again
I may be wrong.

For more informed discussion about this I suggest you
go to a group where discussion of how browsers handle HTML forms is
on-topic.
Wes Groleau
2005-03-05 17:19:55 UTC
Permalink
Post by Brian McCauley
Post by Vlajko Knezic
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8"/>
I would not expect Content-Type to be setable via <meta> but then again
I may be wrong.
That tag works on all my web pages, on numerous Mac and windoze browsers.
--
Wes Groleau

Truth often suffers more from the heat of its defenders
than from the arguments of its opposers.
-- William Penn
Loading...