|
it
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
xml, character encoding, asp question
I've been doing a lot of work both creating and consuming web services, and I notice there seems to be a discontinuity between a number of the different cogs in the wheel centering around windows-1252 and that it is not equivalent to iso-8859-1. Looking in the registry under HKEY_CLASSES_ROOT\MIME\Database\Charset and \Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are mapped to code page 1252, which I'm assuming is windows-1252 in execution terms. So if I set the codepage=1252 and Response.Charset=iso-8859-1 in ASP, it seems that I'm *really* going to get out windows-1252, not iso-8859-1. This becomes somewhat noticable in html since a lot of commonly used elements (like the free-floating bullet •), which *aren't* really 8859-1, get interpreted as such in browsers. I occasionally run into problems, however, because MSXML doesn't appear to be using the mime database to determine how to process the encoding declaration (or at least it's got some different mapping hidden somewhere). MSXML appears to treat the range 128-159 the way the ansi standard defines them - undefined control sequences. As such, when you're processing xml (either xml to xml or xml to html via xsl), if you get what is *intended* to be a bullet (149) or curly quotes or any of those other extensions that are really windows-1252 in your xml, msxml won't make the association and translate the characters properly going between character sets. And unfortunately a lot of web services don't accept or generate "windows-1252" as an encoding declaration. So... 1) Am I correct in assuming that MSXML is using different encoding routines than IIS/ASP? 2) Is there a @Codepage I can specify that will produce real latin 1 in asp? 3) Will ASP.Net be more standards compliant? and/or does ASP.Net use the mime database under the covers too? 4) just as an aside anybody have a clue why when output via xsl for encoding utf-8 doesn't display properly in IE? Thanks -Mark Hello Mark,
MSXML has two methos to load XML:LoadXML method and the Load method. The LoadXML method always takes a Unicode BSTR that is encoded in UCS-2 or UTF-16 only. If you pass in anything other than a valid Unicode BSTR to LoadXML, it will fail to load. The Load method implements the following algorithm for determining the character encoding or character set of the XML document: 1.If the Content-Type HTTP header defines a character set, this character set overrides anything in the XML document itself. This obviously doesn't apply to SAFEARRAY and IStream mechanisms because there is no HTTP header. 2.If there is a 2-byte Unicode byte-order mark, it assumes the encoding is UTF-16. It can handle both big endian and little endian. 3.If there is a 4-byte Unicode byte order mark (0xFF 0xFE 0xFF 0xFE), it assumes the encoding is UTF-32. It can handle both big endian and lttle endian. 4.Otherwise, it assumes the encoding is UTF-8 unless it finds an XML declaration with an encoding attribute that specifies some other character set (such as ISO-8859-1, Windows-1252, Shift-JIS, and so on). "Windows-1252" should be right thing to produce latin 1. ASP.NET also has codepage property and simliar with ASP, however, the charator will be UNICODE in its code behind. Luke Hi Luke...
Thanks for responding, but the response is a little too narrow to address any of the questions I asked. We're using the Load() method to load the response from web services, so the detection of the encoding is not the issue. The issue is that the mappings between character sets that MSXML uses doesn't appear to be the same as other apis available to ASP (like Server.HTMLEncode() and Server.UrlEncode()) and other C++ apis (like WideCharToMultiByte() and MultiByteToWideChar()). Near as I can tell, everything other than MSXML doing encoding conversion seems to be working from the HKEY_CLASSES_ROOT\MIME\Database\Charset & CodePage system. Also near as I can tell, that system doesn't differentiate between windows-1252 and iso-8859-1, even though they are *not* equivalent (1252 is a superset of 8859-1). I probably wouldn't be running into as many annoying inconsistencies if MSXML was standards-noncompliant in the same way, but MSXML *does* recognize the difference between windows-1252 and iso-8859-1 and does process/output things differently. And since many of the web services we consume come from other vendors, we don't have the option of just telling them to use "windows-1252" instead of "iso-8859-1" in their xml encoding headers. First, I'm looking for ways to get MSXML and ASP to work together consistently, if possible. If not, at least try to define what to avoid. It's also of parenthetical interest whether ASP.Net has fixed any of these inconsistencies; I haven't done trial cases myself to test it yet. Take the small bullet as a good example. Putting • in your html gets you a small bullet in IE, though this is only a legitimate interpretation if your encoding is windows-1252 - not iso-8859-1 or any other non-windows-12* encoding. 149 is a legal character in unicode just not the bullet character. In unicode the bullet character is 8226. If I have a literal 149 character in an xml document with a declared encoding of windows-1252, MSXML will interpret that up to 8826 as part of the character set mapping when the xml is parsed; how it gets represented when I spiel it out via xsl or Response.Write depends on the output encoding I use. If that same xml document, however, has a declared encoding of iso-8859-1, MSXML doesn't map the 149 to anything at all - it doesn't recognize that it has any particular meaning. So if my xsl stylesheet applied to that dom outputs utf8, what comes out is a two byte representation of 149 - c2 95. IE doesn't recognize those characters as meaning anything in particular and what it displays is garbage. Hence the reason for my posting. Ironically, there are some web services out there which have the same misunderstanding of the difference between windows-1252 and iso-8859-1 that you do. They generate xml with an encoding of "iso-8859-1" when they are including 1252 characters between 128-159. It's frustrating that while MSXML is more standards compliant in recognizing the difference, that standards compliance causes garbage to come out the back end of the meat grinder. Thanks Mark Hi Mark,
I think we can specify the encoding in xsl, for example: <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="html" encoding="iso-8859-1" /> <xsl:template match="Books"> I test above code in IE and it can display char 149 correctly. Luke Hi Luke...
Again, thanks for responding. We're getting closer to an understanding of the problem, but not yet any resolution. Yes, you can change the output encoding designation in xsl, and yes you can use "iso-8859-1" and it will output a literal 149 and yes IE will display it - usually. But this only delivers us to the doorstep of understanding the inconsistencies that make this difficult to work with in ASP. If you want to have any good support for internationalization on your website, you really can't use windows-1252 OR iso-8859-1 (same thing as far as ASP goes) as your ASP page's code page because the output encoding from IIS (or the encoding IE receives depending on how you look at it) because that will influence how IE tries to process form elements that it tries to encode for resubmission. The big problem is that an IE page with 1252 encoding lets you copy/paste, say, chinese into the form element and it looks good in the form element, but IE does a terrible job encoding those inputs on a url. It uses a non-standard encoding format to construct the url and the tools in ASP for interpreting are marginal. To get really *good* support for url encoding from IE (or other browsers), you have to set your page encoding to utf-8. If you do that, IE will use utf-8 to stream international user input in the url encoding, and it does it in a standard way. But if you use utf-8 encoding and you're working with xml in your asp page, then the *real* difference between windows-1252 and iso-8859-1 *does* become a problem. Because, as i've been saying, MSXML is standards-compliant and does recognize the difference while the rest of ASP is *not* standards compliant in how it handles the two. So these inconsistencies really put a web developer in a bind. Which feature do you want to drop - internationalization? Use of web services? Use of xml? Or do you just have to bend over backward as a developer trying to develop all of your own tools to work around the fact that the MS tools for this are inconsistent? Seems like the last one to me, but I thought I would ask to see if these sorts of things were on the MS radar screen. Thanks -mark Hi Mark,
I understand your complaining on this issue. It is really a tough issue to take care all these staff. The best thing I can suggest is to migrate to ASP.NET. It has better support for internationalization and web service. You can handle the web service with XML classes in .NET, convert it to utf8 and send result to client side. Luke I can't help you much here Mark, but I can sympathise. We're going to be
hitting this problem ourselves soon so I'm especially interested in this thread. I know all to well that 'Windows Latin-1' (code page 1252) is *not* the same as the ISO latin-1 set (iso 8859/1). There are some subtle differences where MS have tried to make better use of some the lesser-used parts of the ISO set. Tony Proctor Show quote "Mark" <mmodrall@nospam.nospam> wrote in message news:80D7A6B6-EFF6-4988-B819-61B4A05E31F5@microsoft.com... > Hi... > > I've been doing a lot of work both creating and consuming web services, and > I notice there seems to be a discontinuity between a number of the different > cogs in the wheel centering around windows-1252 and that it is not equivalent > to iso-8859-1. > > Looking in the registry under HKEY_CLASSES_ROOT\MIME\Database\Charset and > \Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are > mapped to code page 1252, which I'm assuming is windows-1252 in execution > terms. So if I set the codepage=1252 and Response.Charset=iso-8859-1 in ASP, > it seems that I'm *really* going to get out windows-1252, not iso-8859-1. > This becomes somewhat noticable in html since a lot of commonly used elements > (like the free-floating bullet •), which *aren't* really 8859-1, get > interpreted as such in browsers. > > I occasionally run into problems, however, because MSXML doesn't appear to > be using the mime database to determine how to process the encoding > declaration (or at least it's got some different mapping hidden somewhere). > MSXML appears to treat the range 128-159 the way the ansi standard defines > them - undefined control sequences. As such, when you're processing xml > (either xml to xml or xml to html via xsl), if you get what is *intended* to > be a bullet (149) or curly quotes or any of those other extensions that are > really windows-1252 in your xml, msxml won't make the association and > translate the characters properly going between character sets. And > unfortunately a lot of web services don't accept or generate "windows-1252" > as an encoding declaration. > > So... > 1) Am I correct in assuming that MSXML is using different encoding routines > than IIS/ASP? > > 2) Is there a @Codepage I can specify that will produce real latin 1 in asp? > > 3) Will ASP.Net be more standards compliant? and/or does ASP.Net use the > mime database under the covers too? > > 4) just as an aside anybody have a clue why when output via xsl for > encoding utf-8 doesn't display properly in IE? > > Thanks > -Mark > Re: question (2) Mark, I've found a reference to a code page that I didn't
know existed: 28591. This is suppose to be exactly equivalent to ISO 8859/1. If this works (I haven't tried it) then it won't solve all problems though. The Euro symbol, for instance, is a very important character in Windows Latin-1, but it isn't present in the ISO Latin-1. I believe ISO cope with it using a newer ISO 8859/15 (Latin-9). The code page equivalent for this, apparently, is 20865. Tony Proctor Show quote "Mark" wrote: > Hi... > > I've been doing a lot of work both creating and consuming web services, and > I notice there seems to be a discontinuity between a number of the different > cogs in the wheel centering around windows-1252 and that it is not equivalent > to iso-8859-1. > > Looking in the registry under HKEY_CLASSES_ROOT\MIME\Database\Charset and > \Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are > mapped to code page 1252, which I'm assuming is windows-1252 in execution > terms. So if I set the codepage=1252 and Response.Charset=iso-8859-1 in ASP, > it seems that I'm *really* going to get out windows-1252, not iso-8859-1. > This becomes somewhat noticable in html since a lot of commonly used elements > (like the free-floating bullet •), which *aren't* really 8859-1, get > interpreted as such in browsers. > > I occasionally run into problems, however, because MSXML doesn't appear to > be using the mime database to determine how to process the encoding > declaration (or at least it's got some different mapping hidden somewhere). > MSXML appears to treat the range 128-159 the way the ansi standard defines > them - undefined control sequences. As such, when you're processing xml > (either xml to xml or xml to html via xsl), if you get what is *intended* to > be a bullet (149) or curly quotes or any of those other extensions that are > really windows-1252 in your xml, msxml won't make the association and > translate the characters properly going between character sets. And > unfortunately a lot of web services don't accept or generate "windows-1252" > as an encoding declaration. > > So... > 1) Am I correct in assuming that MSXML is using different encoding routines > than IIS/ASP? > > 2) Is there a @Codepage I can specify that will produce real latin 1 in asp? > > 3) Will ASP.Net be more standards compliant? and/or does ASP.Net use the > mime database under the covers too? > > 4) just as an aside anybody have a clue why when output via xsl for > encoding utf-8 doesn't display properly in IE? > > Thanks > -Mark > |
|||||||||||||||||||||||