Ecoinformatics Redmine: Peter McCartneyhttps://projects.ecoinformatics.org/ecoinfo/https://projects.ecoinformatics.org/ecoinfo/ecoinfo/favicon.ico?14691340362005-08-24T21:51:52ZEcoinformatics Redmine
Redmine EML - Bug #2079: Move project element to top-levelhttps://projects.ecoinformatics.org/ecoinfo/issues/2079#change-71092005-08-24T21:51:52ZPeter McCartneypeter.mccartney@asu.edu
<p>I dont object as i think this would be a backward-compatible change. However, <br />i think it's a patch to what is really desired - to be able to manage and <br />exchange metadata in more atomic units than whole eml documents. I can easily <br />forsee similar requests to treat spatial reference, party, attributes, etc as <br />"root" elements (they arent really root, are they?). I think we need to begin <br />some longer term thinking about eliminating the <eml> tag itself which would <br />let content for each xsd file in eml potentially be a root element.</p>
<p>If we continued to support the eml tag but just agreed it was ok to make xml <br />documents that validate only to eml-project or eml-party, I think even that is <br />technically still backwards compatible.</p> EML - Bug #1662: id key definitions in EMLhttps://projects.ecoinformatics.org/ecoinfo/issues/1662#change-57862004-09-01T21:34:38ZPeter McCartneypeter.mccartney@asu.edu
<p>I checked in changes to eml-resource.xsd and buildDocbook.xsl as a proposed fix<br />to some of this bug. The fix introduces an optional system attribute to the<br />references element, which corresponds to the existing system attribute<br />accompanying the id element through out EML. if used, it qualifys that the<br />matching id pointed to by the references tag has a scope defined by the system<br />attribute. Thus, EML documents can contain more than one id attribute with the<br />same value provided that they are scoped with different system attributes.<br />Changes to the "reusable content" section reflect this. if system is defined, it<br />must be defined for both the target and the referencing element and it must<br />match. Otherwise it must be absent from both.</p>
<p>I think there is still some ambiguity in the wording of the docs that could<br />benefit from some additional input. the eml root has an attribute called scope<br />which is "document" by default and "system" by option. We now allow there to be<br />more than one system referenced. if system tags are used once, does this mean<br />they must be provided in all cases? or, can i write an eml where the ids for<br />attribute are document scope, but the ids supplied for literature cited are<br />scoped to a particular system? Do we really need the "scope" attribute? if it is<br /> left as default, is it invalid to use the system atrribute anywhere in the<br />document?</p> EML - Bug #1662 (In Progress): id key definitions in EMLhttps://projects.ecoinformatics.org/ecoinfo/issues/16622004-08-27T00:24:56ZPeter McCartneypeter.mccartney@asu.edu
<p>there are several problems emerging with the unique key definitions in eml. in<br />eml.xsd, there is a key definition that requires all instances of the @id<br />attribute to be unique within the document. when content is to be duplicated<br />there is to be one instance of the content with an id assigned and all other<br />instances are to use the <references> tag to point to that id. id's in a<br />document may be declared as document or system scope, meaning that they are<br />declared to be unique only within the document or within a broader naming<br />authority that is identified in the @system attribute.</p>
<p>Here are some of the problems:<br />1. there no enforced unique constraint on the @system attribute, although it is<br />implicit. Thus it is possible to create a dataset element usint an id with<br />system=cesdataset and a creator with system=asupersonnel. when those systems<br />have each assigned a similar value, you get conflicts. to avoid them, users are<br />forced to change identifiers and break the pointers back to the original source<br />of the content.</p>
<p>2. the spirit of the id and references tags was to insert some degree of<br />normalization that xml inherently lacks. however, it can be a rather arbitrary<br />choice with in the document which instance is the one that gets the content and<br />which ones get the reference pointer. This makes it very difficult for people<br />trying to write tools to edit eml documents since one could easily drop an<br />element that contains elements that contain content that other elements are<br />pointing to. This gratuitously complicates programming for EML and is likely to<br />discourage potential contributors of tools for working with the standard.</p>
<p>3. EML method allows you to embed the eml of related datasets that were used to<br />produce the current one in the methods discussion. conflicts can arise between<br />the identifiers of the embedded datasets. attemping to resolve conflicts between<br />documents using references could mean you have to edit those documents rather<br />than embed them.</p> EML - Bug #1197: dictionary needed for externallyDefinedFormathttps://projects.ecoinformatics.org/ecoinfo/issues/1197#change-44212003-12-17T20:45:21ZPeter McCartneypeter.mccartney@asu.edu
<p>ok the issue seems to be</p>
<p>1) we need a controlled enumeration for externallyDefinedFormat that is both<br />recongized by users and parsable by applications</p>
<p>2) mime types were created to serve this purpose. Project alexandria<br />investigated this and decided that a combination of both format name and mime<br />types was needed, since the appropriate mime type is not always adquate. Read <br /><a class="external" href="http://www.alexandria.ucsb.edu/middleware/dtds/ADL-access-report.dtd">http://www.alexandria.ucsb.edu/middleware/dtds/ADL-access-report.dtd</a> to see<br />their discussion. Basically they provide three elements i their metadata schema<br />for downloads - format, mime, and encoding.</p>
<p>3) vendors are slowly adding mime types but very few scientific data formats<br />have been added. if we define mimes for these formats we could register them<br />only by putting an x- in front of it. and of course these definitions would be<br />depracated when the owner puts in a definition.</p>
<p>4) dataFormat is required, so if the data are in Oracle, we need to have<br />SOMETHING to put here, even if the information is superflous once<br />connectionDefinition is filled out. its not clear to me if mimes even apply to<br />connections - perhaps these are all octet-streams?</p>
<p>5) going beyond the enumeration issue, if we were to adopt a dictionary, we have<br />the option of storing other metadata on a format that could be useful. the<br />example i show here lists each part of a multipart format and its mime type. we<br />use a file similar to this in our Xylopia data service to determine what parts<br />of a file format need to be gathered up into the zip package. in my example it<br />lists extensions which works fine for dealing with shapefiles, dbf, mapinfo,<br />geoTiff, and so on. the only other multipart type that does not use extensions<br />to identify its parts is arcinfo coverages. in this case the rules rely on<br />foldernames and filenames under those folders to handle the different parts.<br />because coverages within one folder share a common metadata folder, you can not<br />move coverages by zipping up the files.. you must open it and save it as some<br />other format for transport.</p>
<p>there was some debate about the utility of this multipart info, so im willing to<br />table that part of the issue and continue to do it internally ourselves. but it<br />would be really nice if we could agree how to ensure that shapefile will always<br />be shapefile and not Shapefile, shape file, shape, esrishapefile...etc.</p>
<p>the attachment i put in (and edited) was an example of such a dictionary<br />showing how multiprt, single part and service formats could all be handled using<br />a strategy similar to ADA where we define format types, and then list the mimes<br />for each of the parts. Matt felt this was inappropriate as there is in fact a<br />multipart mime type. <br />so a variant on this would be to put the mime attribute in the<br />externallyDefinedElement tag rather than in the part tag (or both). the nice<br />ting about this is that like stmml.xsd, it abstracts users from complicated<br />terminology yet does enable maching processing through mime types when they<br />exist. if we leave the mime element out of eml, then the dictionary can have the<br />most up-todate mime for any given format and we dont have to edit eml files when<br />new mime types appear.</p> EML - Bug #1197: dictionary needed for externallyDefinedFormathttps://projects.ecoinformatics.org/ecoinfo/issues/1197#change-44202003-12-17T20:19:09ZPeter McCartneypeter.mccartney@asu.edu
<blockquote>
<p><?xml version="1.0" encoding="UTF-8"?><br /><dataFormats><br /><externallyDefinedFormat name="Shapefile" description="ESRI shapefile" ><br /><part extension="shp" mime="application/octet-stream"/><br /><part extension="dbf" mime="application/octet-stream"/><br /><part extension="shx" mime="application/octet-stream"/><br /><part extension="prj" mime="application/octet-stream"/><br /><part extension="idx" mime="application/octet-stream"/><br /></externallyDefinedFormat><br /><externallyDefinedFormat name ="dBase4" description ="dBase file format"><br /><part extension="dbf" mime="application/octet-stream"/><br /><part extension="idx" mime="application/octet-stream"/><br /></externallyDefinedFormat><br /><externallyDefinedFormat name ="dBase4" description ="dBase file format"><br /><part extension="dbf" mime="application/octet-stream"/><br /><part extension="idx" mime="application/octet-stream"/><br /></externallyDefinedFormat><br /><externallyDefinedFormat name="MSSQLServer7.0" description="MS SQL server version 7.0"/><br /><archiveFormat name="zip" description="pkzip compressed archive format"><br /><part extension="zip" mime="application/zip"/><br /></archiveFormat></p>
<p></dataFormats></p>
</blockquote> EML - Bug #1197: dictionary needed for externallyDefinedFormathttps://projects.ecoinformatics.org/ecoinfo/issues/1197#change-44192003-12-17T18:36:51ZPeter McCartneypeter.mccartney@asu.edu
<p>Here is a possible format for a dictionary file to provide an anuthority and<br />reference for data formats (and archive formats)</p> EML - Bug #1197 (New): dictionary needed for externallyDefinedFormathttps://projects.ecoinformatics.org/ecoinfo/issues/11972003-10-31T17:17:21ZPeter McCartneypeter.mccartney@asu.edu
<p>Externally defined format is useless for automatic processing unless you have<br />some idea what to look for. This is a step backwards from FGDC which at least<br />provided enumerations for the common file formats at the time.</p> SEEK - Bug #1056: XQuery research: processors, SQL translators, XQuery exampleshttps://projects.ecoinformatics.org/ecoinfo/issues/1056#change-40052003-06-16T19:18:37ZPeter McCartneypeter.mccartney@asu.edu
<p>OK, Based on the the conference call of 6/16/03, I think we've agreed to put<br />this off untill later in the project. The argument was that by the time we<br />reduced the XQuery syntax down to a level of complexity for which we can<br />reasonably write processors now, we dont have much more than what we had with<br />the filter query syntax worked out in Seattle (query.xsd).</p>
<p>Implications are that we give up any syntax for defining the return object as<br />part of the query request. This functionality might be provided in a separate<br />retrieve request.</p> EML - Bug #918: example tag in eml:documentationhttps://projects.ecoinformatics.org/ecoinfo/issues/918#change-35722002-11-21T20:28:42ZPeter McCartneypeter.mccartney@asu.edu
<p>oops...the example i saw was actually for a non-leaf node in which it lookes<br />like someone was trying to show examples for the both the leaf nodes it<br />contains. so let me rewrite this bug to say:</p>
<p>it seems like we have inconsistent methods for examples for non-leaf nodes. some<br />are empty, some have text that says in so many words "see examples for child<br />nodes" and at least one shows an XML snippet. this might make automated<br />processing of examples difficult if we cant predict whats going to be in there</p> EML - Bug #918 (Resolved): example tag in eml:documentationhttps://projects.ecoinformatics.org/ecoinfo/issues/9182002-11-21T20:19:32ZPeter McCartneypeter.mccartney@asu.edu
<p>Ive noticed at least one place in eml-attribute where the example tag in the<br />documention inlines the element name as part of the text using > and <<br />symbols. It seems this practic is generall dropped (for good reason) so we might<br />want to scan for the odd case where it still exists. I didnt fix it in<br />eml-attribute because matt currently has that in a branch and i dont know where<br />the best place to put it is.</p> EML - Bug #655: need better model for numeric domains for attributeshttps://projects.ecoinformatics.org/ecoinfo/issues/655#change-22082002-10-25T19:58:08ZPeter McCartneypeter.mccartney@asu.edu
<p>Does this belong in numeric domain or in measurement scale? seems like it<br />qualifies measuremnt scale rather than domain.</p> EML - Bug #654 (Resolved): scope of the unit elementhttps://projects.ecoinformatics.org/ecoinfo/issues/6542002-10-25T17:02:33ZPeter McCartneypeter.mccartney@asu.edu
<p>Discussion of stmml has revealed several isses, one of which is the fact that <br />units, as expressed by stmml, are applicable only to measurable quantities. <br />Many variables that ecologists put in an eml dataset and might intuitively <br />appear to have "unit's (geologic age, sex, or species names, for example) do <br />not have units and thus must be decleared "dimensionless" or "undefined" <br />because unit is required for all attributes. I dont think its intuitively <br />apparent to users that these are domains not units and that they should be <br />described as such.</p>
<p>in the required element <measurmentScale> we class all attributes as nominal, <br />ordinal, interval or ratio. Srictly speaking only interval scales have units, <br />the rest are dimensionless. In practice, there is still some value of knowing <br />the units of the denominator and/or numerator in ratios of two dimensions, so <br />we probably dont want to throw out the baby with the bath water there.</p>
<p>To help clarify this, we might consider merging units within measurementScale <br />so that things may be set requred when relevant. an example might be:</p>
<p><measurementScale><br /> <interval><br /> <standardUnit><br /> metersPerSecond<br /> </standardUnit><br /> </interval><br /></measurementScale><br />a variant does away with embedding custom units in additionalMetadata would be:<br /><measurementScale><br /> <interval><br /> <unit library="http://ecoinformatics.org/emlUnitDictionary.xml"><br /> metersPerSecond<br /> </unit><br /> </interval><br /></measurementScale></p>
<p>this would mean any custom unit definitions would need to be published online.</p>
<p>content model for measurement scale might look like:<br />element measurementScale(nominal | ordinal | interval | ratio)<br />element nominal <br />element ordinal<br />element interval (unit)<br />element ratio (i'm not sure what would go here - it seems like we're hacking <br />unit definitions in emlUnitDictionary for ratios already but maybe that should <br />be pulled out and we provide a structured ratio definition here that references <br />two (or more?) true dimensions)</p>
<p>all attributes would still have a domain element - the existing bug on that <br />still applies</p> EML - Bug #638: request for id/ref in attributeDomainhttps://projects.ecoinformatics.org/ecoinfo/issues/638#change-21612002-10-18T15:21:20ZPeter McCartneypeter.mccartney@asu.edu
<p>I agree that it makes sense to do it at attributeDomain rather than each of the<br />options below it.</p>
<p>I cant think of too many datasets we have where a particularly lengthy<br />enumeration list is used more than once without creating a table as part of the<br />dataset to contain the enumeration, but i suppose there are people who never<br />build such tables in their datasets, or for some reason dont include them in the<br />release.</p> EML - Bug #637: attributeDomain should be requiredhttps://projects.ecoinformatics.org/ecoinfo/issues/637#change-21582002-10-18T15:09:55ZPeter McCartneypeter.mccartney@asu.edu
<p>In hindsight, i wish we had thought to put all of these things (domain,<br />precision, unit, etc, nested under the appropriate measurement scale element<br />since that is the one property that is truely relevant for all attributes. that<br />way if the data were nominal (eg site name) nominal, we wouldnt have to force<br />them to put in non-answers for things like unit and precision. If they are<br />requred, then we have to have a clear option for when the element is not<br />relevnt. what is most useless is a required field that has some uncontrolled<br />text in it that means "not relevant" , but cant be interpreted without reading<br />it. my gut feeling is that few people really define a domain for attribute and<br />if you make it required you will get 10 - 20 entries where someone put "no<br />domain defined" in the textDomain element for every one that actually thinks<br />about their data and puts in something meaningfull.</p>
<p>Why dont you contact the person to see if they simply missed the point or see if<br />they would have simply answered "none" had they been forced to fill in a domain<br />field? do we have any sense of what proportion of attributes have any meaningful<br />domain restriction beyond whats implied by the scale, units and storage type?</p> EML - Bug #624: eml-methods/methodsType needs clarification on choice/sequencehttps://projects.ecoinformatics.org/ecoinfo/issues/624#change-21222002-10-11T19:50:41ZPeter McCartneypeter.mccartney@asu.edu
<p>Im on irc right now with david and have realized that I made an error in my<br />files that i never caught. The intended design was to be</p>
<p>methodsType(methodStep+,sampling?,qualityControl*)</p>