Project

General

Profile

Actions

Bug #1538

closed

Entity/Character Refeference Conversion Problems

Added by Dan Higgins almost 20 years ago. Updated over 19 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
metacat
Target version:
Start date:
04/28/2004
Due date:
% Done:

0%

Estimated time:
Bugzilla-Id:
1538

Description

In Morpho, we have run into some problem with the use of special, non-ascii
characters (like the 'degree' symbol or greek 'mu'). [Any character represented
by a byte with a decimal value > 127 is in this class of special characters.]
These characters have been copied from Word or PDF documents into Morpho fields
and then put into eml xml docs. Unfortunately, they are not not necessarily in
the correct format for xml documents and have caused parser problems.

The solution that was implemented in Morpho was to use entity/character
references. Any character with a value greater than 127 is written as '&#xxx;'
where 'xxx' is the decimal value of the character. On a Windows machine, the
'deg' symbol becomes '°' and 'mu' becomes 'μ'. XML parsers
automatically convert these character entities to the character for display, but
the conversion depends on the assumed character set.

The metacat problem is that when one submits a document containing such
character references (&#xxx;) and then reads the document back, one does not get
the character refenence, but rather the character itself! I assume this is due
to the XML parser. This is a violation of the idea that metacat should return
exactly the same data given it.

Morpho already handles this by converting back to character references any info
sent it by Metacat with character values greater than 127. But metacat actually
sends back the wrong character for some symbols! (e.g. a 'mu' becomes a '1/4'
symbol. I assume this is due to different character set assumption under linux
and windows. In any case, there is some data corruption here that we should
figureout how to avoid.

Actions

Also available in: Atom PDF