Bug #2564: escaped "less than" in inlinedata causes invalid eml output - Metacat - Ecoinformatics Redmine

Actions

Copy link

Bug #2564

closed

escaped "less than" in inlinedata causes invalid eml output

Added by Matthew Perry over 18 years ago. Updated over 16 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Michael Daigle

Category:

metacat

Target version:

1.9

Start date:

10/12/2006

Due date:

% Done:

Estimated time:

Bugzilla-Id:

2564

Description

From: inigo san gil <isangil@lternet.edu>
> > My valid EML file

(http://www.cedarcreek.umn.edu/data/emlFiles/pl00e001.xml) has the
following line (content):

|Field|Plot|Ntrt|NitrAdd|Date|Taxon|Species |Biomass |Prop <== Labels

However, once harvested, the metacat link to the document
http://metacat.lternet.edu/knb/metacat?action=read&qformat=xml&docid=knb-lter-cdr.7901001.2

has that same line slightly changed to: |Field|Plot|Ntrt|NitrAdd|Date|Taxon|Species |Biomass |Prop <== Labels

I noticed that the evil < sign appeared in the
inlinedata element. Inline content is handled differently than the
rest of the document - it is stored on the file system (the
metacat_inline_data folder) rather than in the relational db.

It is interesting to note that a < sign anywhere else in the eml
document will be handled correctly (well... it will be displayed as
< at least ... see bug 2517 ). Only in the inlinedata section will this cause the eml
output from metacat to be invalid.

This is most likely related to the DocumentImpl.toXml() function, specifically around line 1158 of DocumentImpl.java

Reader reader = Eml200SAXHandler.readInlineDataFromFileSystem(fileName);

Actions

Copy link

Updated by Jing Tao over 16 years ago

I inserted an eml document into metacat and here is the content before
xerces parsing it:

<eml:eml
xmlns:eml="eml://ecoinformatics.org/eml-2.0.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" packageId="eml.1.1"
system="knb" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.0.1
eml.xsd" scope="system"><dataset scope="document">
<title>Checking > < " ' &</title><creator scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
<contact scope="document">
<individualName>
<surName>Jackson</surName>
</individualName>
</contact>
<access authSystem="ldap://ldap.ecoinformatics.org:389/dc=ecoinformatics,dc=org"
order="allowFirst"
scope="document"><allow><principal>public</principal><permission>read</permission></allow></access></dataset></eml:eml>

You can tell in the "title" element they are encoded as numeric character
references:
Checking > < " ' &amp

However, after xerces parse the eml the charcters became (xerces
automatically did this):
Checking > < " ' &

So we have to add another method - normalize in Metacat. After
normalizing, they changed back:
Checking > < " ' &

In "inline" element, same thing will happen - numeric character references
were changed to characters. However, we didn't add normalize method to
handle characters in "inline" element purposely (Inline element is
handled differently to other elements in eml, the content of inline
element is stored in a external file rather than database). The reason is
that in "inline" element, the data could be a xml segment. If we normailize it, it can be
mess. But if we don't normalize it, it can cause the trouble like this.

Do you have any good suggestion?

Actions

Copy link

Updated by Jing Tao over 16 years ago

Here is some thought from Duane:

Hi Jing,

Thanks for looking into this further. This seems like a complex problem.

I'm not sure this is a good suggestion, but here is the only solution I can think of:

1. Prior to running Xerces parser, replace all instances of '<' with '<' in inline data. (Same for other character entities.)
2. During insertion, Xerces parser converts '<' back to '<'.
3. Inline data file after insertion contains the original '<'.

Seems kind of crazy to do this, but maybe it would work.

We'll continue to think about this more.

Thanks,
Duane

Actions

Copy link