[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ale] OT: Illegal XML Characters



I think I have figured out what is happening. As Dan pointed out, it seems
that in the abscence of an ecoding scheme, the parser assumes UTF-8. That
was the case here. I check the response header, and no encoding was
specified.

The first offending characters were the sequence 0xC0 0x80. I happen to know
that these characters corresponded to null characters in a SQL Server
database. Since the null character is an ASCII character, it should be sent
as 0x00. For some reason it was being mapped to 0xC0 0x80 in the process
that creates the XML.

The second offending character was sequence 0xB7. Per UTF-8, and non-ascii
character has its bit values distributed across multiple bytes with the
first byte indicating how many bytes follow to mark the end of the
character. 0xB7 should map to UTF-8 sequence 0xC2 0xB7.

>From what I understand, the company producing the datat uses IIS + JRUN +

SQL Server. Apparently somewhere along the way the data isn't being
translated properly into UTF-8, or perhaps they are using another encoding
scheme and are not declaring it as they should, so the parser assumes UTF-8
and finds the XML invalid.

Mike

-----Original Message-----
From: David Corbin [mailto:dcorbin at imperitek.com]
To: ale at ale.org
Sent: Friday, March 22, 2002 6:47 AM
To: mgm at atsga.com; ale at ale.org
Subject: Re: [ale] OT: Illegal XML Characters


I am not an XML expert, but might it be the supplier is using UTF-8
without declaring it?.  I'm reasonably sure that XML supports a way to
declare the character set in use.  If that is the case, then it would
simple enough to add the declaration yourself.

Of course, that doesn't change the fact that the XML is still
(technically) invalid.  To answer your question more directly look to
see what the default character set is if none is declared.

David Corbin


Mike Millson wrote:

>I am in a situation where I am having to parse the xml that another company
>is passing to a client of mine. The data often contains illegal characters.
>The most usual culprits are hex C0, 80, and b7. Having these bytes in the
>xml stream causes my parser to die. I have run the xml through an
>independent xml validator on the web, and the validator  says the xml is
>bad. I have forgotten the error with C0 and 80. The message for b7 is
>"Error: Input error: Illegal UTF-8 start byte <0xb7>."
>
>I need to be able to clearly explain to the client that it is the other
>company passing invalid xml, not my parsing that is at issue, and they
>should validate their xml before sending it out the door. I'm going through
>the W3C documentation and haven't anything to clearly explain (at least to
>me) why the above characters are not legal.
>
>Anyone have any ideas how to explain why C0, 80, and b7 are not valid xml
>characters?
>
>Thank you,
>Mike
>
>
>---
>This message has been sent through the ALE general discussion list.
>See http://www.ale.org/mailing-lists.shtml for more info. Problems should
be
>sent to listmaster at ale dot org.
>
>




---
This message has been sent through the ALE general discussion list.
See http://www.ale.org/mailing-lists.shtml for more info. Problems should be
sent to listmaster at ale dot org.



---
This message has been sent through the ALE general discussion list.
See http://www.ale.org/mailing-lists.shtml for more info. Problems should be 
sent to listmaster at ale dot org.