Bug #1538
closedEntity/Character Refeference Conversion Problems
0%
Description
In Morpho, we have run into some problem with the use of special, non-ascii
characters (like the 'degree' symbol or greek 'mu'). [Any character represented
by a byte with a decimal value > 127 is in this class of special characters.]
These characters have been copied from Word or PDF documents into Morpho fields
and then put into eml xml docs. Unfortunately, they are not not necessarily in
the correct format for xml documents and have caused parser problems.
The solution that was implemented in Morpho was to use entity/character
references. Any character with a value greater than 127 is written as 'xx;'
where 'xxx' is the decimal value of the character. On a Windows machine, the
'deg' symbol becomes '°' and 'mu' becomes 'μ'. XML parsers
automatically convert these character entities to the character for display, but
the conversion depends on the assumed character set.
The metacat problem is that when one submits a document containing such
character references (xx;) and then reads the document back, one does not get
the character refenence, but rather the character itself! I assume this is due
to the XML parser. This is a violation of the idea that metacat should return
exactly the same data given it.
Morpho already handles this by converting back to character references any info
sent it by Metacat with character values greater than 127. But metacat actually
sends back the wrong character for some symbols! (e.g. a 'mu' becomes a '1/4'
symbol. I assume this is due to different character set assumption under linux
and windows. In any case, there is some data corruption here that we should
figureout how to avoid.