Project

General

Profile

« Previous | Next » 

Revision 8798

Added more description for the model. cleaned up a few formatting issues.

View differences:

docs/user/metacat/source/semantic-annotation.rst
67 67
   * `oa:hasSelector <http://www.openannotation.org/spec/core/specific.html#FragmentSelector>`_ : Specifies the part of a resource to which an annotation applies. An XPath FragmentSelector will commonly be used for annotating XML-based metadata resources
68 68
   * `oa:annotatedBy <http://www.openannotation.org/spec/core/core.html#Provenance>`_ : [subProperty of prov:wasAttributedTo] The object of the relationship is a resource that identifies the agent responsible for creating the Annotation. 
69 69

  
70
   * `oboe:Measurement <http://www.w3.org/ns/prov#wasInformedBy>`_ : The primary body of semantic annotations on attributes.
71
   * `oboe:ofCharacteristic <http://www.w3.org/ns/prov#used>`_ : Specifies which Characteristic (subclass) is measured
72
   * `oboe:usesStandard <http://www.w3.org/ns/prov#wasInformedBy>`_ : Specifies in which Standard (Unit subclass) the measurement is recorded.
70
   * `oboe:Measurement <http://ecoinformatics.org/oboe/oboe.1.0/oboe-core.owl#Measurement>`_ : The primary body of semantic annotations on attributes.
71
   * `oboe:ofCharacteristic <http://ecoinformatics.org/oboe/oboe.1.0/oboe-core.owl#ofCharacteristic>`_ : Specifies which Characteristic (subclass) is measured
72
   * `oboe:usesStandard <http://ecoinformatics.org/oboe/oboe.1.0/oboe-core.owl#usesStandard>`_ : Specifies in which Standard (Unit subclass) the measurement is recorded.
73 73
   
74 74

  
75 75
::
76 76

  
77

  
78
Model details
79
--------------
80
Using the ``weight`` column in our example data package, we can illustrate the annotation model's use of concepts from OBOE, OA, and PROV.
81
The primary entry point for the annotation is ``#ann.4.1`` and was asserted by Ben Leinfelder (foaf:name) , identified with his ORCID URI (oa:annotatedBy).
82
The body of the annotation (oa:hasBody) is comprised of an oboe:Measurement instance, ``#weight``, that measures ``Mass`` (oboe:ofCharacteristic) in ``Gram`` (oboe:usesStandard).
83
The target of the annotation (oa:hasTarget) points to the EML metadata resource (oa:hasSource) that documents the data table and selects a particular part of the metadata that describes 
84
the specific ``weight`` data attribute (oa:hasSelector). Because the EML metadata is serialized as XML, we can use an XPath oa:FragmentSelector to identify the data column being annotated.
85
Note that our XPath expression identifies ``weight`` as the second column in the first data table in the data package: #xpointer(/eml/dataSet/dataTable[1]/attributeList/attribute[2].
86

  
87
In order to bind the column annotation of the metadata to the physical data object (the three-column CSV file), we need to traverse the packaging model where an additional annotation expresses the relationship 
88
between the data and metadata objects. The annotation, ``#ann.1.1``, asserts that the Metadata file (#eml.1.1) describes (cito:documents) the data file (#data.1.1). More specifically, the annotation target specifies 
89
where in the EML the #data.1.1 object is described by using an oa:FragmentSelector with an XPath pointer to the first data file documented in the EML: #xpointer(/eml/dataSet/dataTable[1].
90

  
91
Note that the annotation model uses a slightly different model than the original ORE resource map model recommended by DataONE. While it is more complicated to include pointers to data documentation within the metadata,
92
we have found that the current ORE maps are not sufficiently descriptive on their own and any consumers must also consult the metadata to figure out which object is the csv, which is the pdf, which is the script, etc...
93
By incorporating the metadata pointer within the annotation model, we hope to be able to hanlde data packages that use manu different metadata serializations without having to write custom handlers for each formatId.
94

  
77 95
Indexing
78 96
--------
79
The Metacat Index component has been enhanced to parse semantic models provided as RDF. The general purpose RdfXmlSubprocessor can be used with SparqlFields to extract key concepts from any given model that is added to the document store.
80
The processor assumes that the identifier of the RDF document is the name of the graph being inserted into the triple store.
97
The Metacat Index component has been enhanced to parse semantic models provided as RDF. 
98
The general purpose RdfXmlSubprocessor can be used with SparqlFields to extract key concepts from any given model that is added to the Metacat MN document store.
99

  
100
The processor assumes that the identifier of the RDF document is the name of the graph being inserted into the triple store and provides that graph name to the query engine for substitution in any query syntax ($GRAPH_NAME).
81 101
The SPARQL requirements are that the solution[s] return the identifier (pid) of the object being annotated, and the index field being populated with the given value[s].
82 102
If multiple fields are to be extracted from the model for indexing, a distinct SPARQL query should be used for each field.
103

  
83 104
The query can (and is largely expected to) be constrained to the named graph that contains only that set of annotation triples. While the infrastructure can (and likely will) share the same triple store, 
84 105
we should not assume other models have been loaded when processing any given graph. This means that any solutions will rely on only the named graph being processed during indexing.
85 106

  
86
New Index Fields. Currently these are dynamic, multi-valued string fields which allow us to index the new semantic content without changing the SOLR schema. 
87
They are multi-valued because they will store the entire class subsumption hierarchy (up) for any matching concepts
88
and because they will store annotations from the same metadata resources for different attributes.
89
	* ``characteristic_sm``
90
	* ``standard_sm``
91

  
92

  
93

  
94 107
The SPARQL query used to determine the Characteristics measured in a dataset is shown below. Note that the query includes superclasses in the returned solutions so that 
95 108
the index returns a match for both general and specific criteria.
96 109

  
......
122 135
	 	}
123 136
	
124 137
::
138

  
139
Index Fields 
140
_________________
141

  
142
Currently, these dynamic, multi-valued string fields allow us to index the new semantic content without changing the SOLR schema. 
143
They are multi-valued because they will store the entire class subsumption hierarchy (up) for any matching concepts
144
and because they will store annotations from the same metadata resources for different attributes.
145
	* ``characteristic_sm`` - indexes the oboe:Characteristic[s] for oboe:Measurement[s] in the datapackage
146
	* ``standard_sm`` - indexes the oboe:Standard[s] for oboe:Measurement[s] in the datapackage
147

  
148

  
125 149
	
126 150
Example
127 151
_______
128 152

  
129
Continuing with example model, these concepts would be indexed for the data attributes.
153
Continuing with example model, these concepts would be indexed for the data attributes described in the datapackage metadata.
130 154

  
131 155
+---------------------------+-------------------+---------------------+-------------------------------------------------------------------------------------+
132 156
| Object                    |  Field Name       | Field Type          |                                                Value                                |
......
153 177
+---------------------------+-------------------+---------------------+-------------------------------------------------------------------------------------+
154 178

  
155 179
Queries
156
_______-
180
_______
157 181
These indexed fields will be used primarily by MetacatUI to enhance discovery - both in terms of recall (concept hierarchies are exploited) and precision (concepts like Mass, do not result in false-positives for "Massachusetts"). 
158 182
As more aspects of the annotation model (e.g., observation Entity) are included in the index, the queries can incorporate them for greater query precision. Unfortunately, the flat nature of the SOLR index will prevent us from 
159 183
constructing queries that take full advantage of the underlying semantic annotation. We can filter results so that only those that measured Length Characteristics and Tree Entities, 

Also available in: Unified diff