Bug #371
closed
problems with wizard-generated eml-physical documents
Added by Matt Jones almost 23 years ago.
Updated over 22 years ago.
Category:
morpho - general
Description
The text import wizard generates eml-physical documents. They have some issues:
1) fieldDelimiter and recordDelimiter are written out in English instead of
using actual ASCII symbol (e.g., "comma" instead of ","). Need to decide on a
convention. I would suggest using the ascii character, or its octal (\054) or
hex (0x2C) equivalents for non-printing characters. thus, a newline would be
"0x0A" and a carraige return would be "0x0D", but a comma would be ",".
2) Wrong number of header lines entered into metadata. The file says there was
1 header line, when in fact there was 0.
3) The "size" element contains the string "bytes" which is not supposed to be
there because it is defined in the "unit" attribute of the size element. So, it
should be:
<size unit="bytes">26</size>
4) the SYSTEMID in the DOCTYPE line is wrongly set to "eml-entity" when it
should be set to "eml-physical"
Concerning Item#1: The use of words to describe delimiters seemss apppropriate
(and will happen) as long as we allow the user to enter eml-physical information
by hand. Only a computer geek will understand octal or hex (but such may be
required for machine parsing!) - How do we specify a 'space' other than by the
word or octal or hex?
Concerning # of header lines::: Is the line containing column tiles (if
present) considered a header line?
items #2, #3, and #4 in original comments have been fixed
Yes, the line containing column names is a header line, as are any other line
that precede the actual data. In the case I described, there were no column
headings (I unchecked the "use first row as headings" box), and so the number of
header lines should have been 0.
As far as symbols for non-printing characters go -- I think the convention
should be something that is universal, and therefore an octal or hex value. Its
not pretty, but I think it is better than making up word symbols for all of the
non-printing characters that exist in Unicode. User interfaces can always
translate the non-printing characters into a symbol (like MS Word does for
spaces, tabs, and hard returns, or as a word like "space"), but the actual XML
should contain a more universal representation. Interestingly, XML provides a
mechanism to use the hex value for Unicode characters, as does url-encoding for
http, so maybe we should just go with that (hex). Its really just a matter of
defining it in the EML documentation for eml-physical and then using that
convention in Morpho.
Field delimiter notation changed to XML hex notation; i.e. " " for a tab
Original Bugzilla ID was 371
Also available in: Atom
PDF