Project

General

Profile

Bug #2512

require text content in elements to be non-empty

Added by Matt Jones about 15 years ago. Updated almost 13 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
eml - general bugs
Target version:
Start date:
08/15/2006
Due date:
% Done:

0%

Estimated time:
Bugzilla-Id:
2512

Description

Current EML schemas allow text content to be empty, which defeats validation rules by allowing users to provide content such as:
<attributeName> </attributeName>
I propose that these uses of empty strings should not be valid. We can acheive this by redefining the datatype we use for strings to have a minimum length of 1 and a pattern that requires some non-whitespace characters.

In XML Schema, we can declare the element to be of type eml:nonemptystring
where eml:nonemptystring is a simple type derived from xs:string like this:
<simpleType name='nonemptystring'>
<restriction base='string'>
<minLength value='1'/>
<pattern value='\s*(^\s)+.*'/>
</restriction>
</simpleType>

I'm not sure if that regular expression quite gets what we want, but it is close and would need some testing. It is intended to sleect (zero or more whitespace characters) followed by (one or more non-whitespace characters) followed by (any additional characters). We probably could remove the plus symbol as its redundant with the subsequent .*

History

#1 Updated by Margaret O'Brien almost 13 years ago

targeting for 2.1.0, although may drop back to unspecified.

#2 Updated by Margaret O'Brien almost 13 years ago

The pattern for this type will be something resembling:
<xs:pattern value="[\s]*[\S][\s\S]*"/>

I am assuming that we still want to allow newlines in strings, and the dot (.) specifically does not match these. At least some current xs:strings have these (e.g. <title> in test/eml-datasetWithCitation.xml).
need to test against some docs with \r\n as well

#3 Updated by Margaret O'Brien almost 13 years ago

We need to look at the effect on instance documents of switching all xs:string to NonEmptyStringType. This type-switch will probably have a bigger effect on the ability of authors to migrate their documents than the changes to the document structure itself. Structure changes will be accomplished by the xsl stylesheet, but retyping all strings means that content could now be required where none previously existed.

To start, I considered just the anonymous simple type elements that are
required by EML and are type="xs:string". It seemed reasonable that if an element was optional, that its content could also be optional. In all, there are 81 of these, which are generally easy to retype with a statement like:
sed -e '/\<xs:element\ name/{
/minOccurs=\"0\"/!s/xs:string/res:NonEmptyStringType/
}
'

There are other elements which could be examined and retyped manually, or would
be caught by a general s/xs:string/res:NonEmptyStringType/ E.g., see <keyword>
(eml-resource.xsd) -- a complexType/simpleContent, so the reference to
xs:string occurs below the element declaration. Other elements (and many
attributes) use xs:restriction base="xs:string" as the start of an enumeration
list, but changing these to base="NonEmptyStringType" seems superfluous.

So to start, only one schema file, "eml-resource.xsd", has been checked into
CVS, so that others can try out the effect of NonEmptyStringType while
its scope is small. Particularly, I was thinking about Morpho. 7 element
declarations occur in this file that were formerly of xs:string, and now are
NonEmptyStringType. See the list below. I think that Morpho wizards deal with
only title, references and keyword, although any are available in the tree
editor. My local copy has all 81 (anonymous, simple) element declarations
retyped (in 17 schema docs), plus the 5 anonymous attributes. I am testing a
variety of EML201 documents from the LTER metacat against this schema as I
convert them -- basically while I work on the XSL stylesheet.

title
distribution/connectionDefinition/parameterDefinition/name
distribution/connectionDefinition/parameterDefinition/description
distribution/connection/parameter/name
distribution/connection/parameter/description
distribution/offline/MediumName
references (multiple paths)
keyword (a named type)

#4 Updated by Jing Tao almost 13 years ago

I checked the morpho code and we use those three path at new package wizard.
title
distribution/offline/MediumName
keyword (a named type)

Morpho also checks if the the input is a empty string. If it's, morpho will ask user to input something there.

#5 Updated by Margaret O'Brien almost 13 years ago

The optional elements have had their xs:strings retyped to res:NonEmptyStringType.

#6 Updated by Redmine Admin over 8 years ago

Original Bugzilla ID was 2512

Also available in: Atom PDF