Revision 8798
Added by ben leinfelder over 10 years ago
docs/user/metacat/source/semantic-annotation.rst | ||
---|---|---|
67 | 67 |
* `oa:hasSelector <http://www.openannotation.org/spec/core/specific.html#FragmentSelector>`_ : Specifies the part of a resource to which an annotation applies. An XPath FragmentSelector will commonly be used for annotating XML-based metadata resources |
68 | 68 |
* `oa:annotatedBy <http://www.openannotation.org/spec/core/core.html#Provenance>`_ : [subProperty of prov:wasAttributedTo] The object of the relationship is a resource that identifies the agent responsible for creating the Annotation. |
69 | 69 |
|
70 |
* `oboe:Measurement <http://www.w3.org/ns/prov#wasInformedBy>`_ : The primary body of semantic annotations on attributes.
|
|
71 |
* `oboe:ofCharacteristic <http://www.w3.org/ns/prov#used>`_ : Specifies which Characteristic (subclass) is measured
|
|
72 |
* `oboe:usesStandard <http://www.w3.org/ns/prov#wasInformedBy>`_ : Specifies in which Standard (Unit subclass) the measurement is recorded.
|
|
70 |
* `oboe:Measurement <http://ecoinformatics.org/oboe/oboe.1.0/oboe-core.owl#Measurement>`_ : The primary body of semantic annotations on attributes.
|
|
71 |
* `oboe:ofCharacteristic <http://ecoinformatics.org/oboe/oboe.1.0/oboe-core.owl#ofCharacteristic>`_ : Specifies which Characteristic (subclass) is measured
|
|
72 |
* `oboe:usesStandard <http://ecoinformatics.org/oboe/oboe.1.0/oboe-core.owl#usesStandard>`_ : Specifies in which Standard (Unit subclass) the measurement is recorded.
|
|
73 | 73 |
|
74 | 74 |
|
75 | 75 |
:: |
76 | 76 |
|
77 |
|
|
78 |
Model details |
|
79 |
-------------- |
|
80 |
Using the ``weight`` column in our example data package, we can illustrate the annotation model's use of concepts from OBOE, OA, and PROV. |
|
81 |
The primary entry point for the annotation is ``#ann.4.1`` and was asserted by Ben Leinfelder (foaf:name) , identified with his ORCID URI (oa:annotatedBy). |
|
82 |
The body of the annotation (oa:hasBody) is comprised of an oboe:Measurement instance, ``#weight``, that measures ``Mass`` (oboe:ofCharacteristic) in ``Gram`` (oboe:usesStandard). |
|
83 |
The target of the annotation (oa:hasTarget) points to the EML metadata resource (oa:hasSource) that documents the data table and selects a particular part of the metadata that describes |
|
84 |
the specific ``weight`` data attribute (oa:hasSelector). Because the EML metadata is serialized as XML, we can use an XPath oa:FragmentSelector to identify the data column being annotated. |
|
85 |
Note that our XPath expression identifies ``weight`` as the second column in the first data table in the data package: #xpointer(/eml/dataSet/dataTable[1]/attributeList/attribute[2]. |
|
86 |
|
|
87 |
In order to bind the column annotation of the metadata to the physical data object (the three-column CSV file), we need to traverse the packaging model where an additional annotation expresses the relationship |
|
88 |
between the data and metadata objects. The annotation, ``#ann.1.1``, asserts that the Metadata file (#eml.1.1) describes (cito:documents) the data file (#data.1.1). More specifically, the annotation target specifies |
|
89 |
where in the EML the #data.1.1 object is described by using an oa:FragmentSelector with an XPath pointer to the first data file documented in the EML: #xpointer(/eml/dataSet/dataTable[1]. |
|
90 |
|
|
91 |
Note that the annotation model uses a slightly different model than the original ORE resource map model recommended by DataONE. While it is more complicated to include pointers to data documentation within the metadata, |
|
92 |
we have found that the current ORE maps are not sufficiently descriptive on their own and any consumers must also consult the metadata to figure out which object is the csv, which is the pdf, which is the script, etc... |
|
93 |
By incorporating the metadata pointer within the annotation model, we hope to be able to hanlde data packages that use manu different metadata serializations without having to write custom handlers for each formatId. |
|
94 |
|
|
77 | 95 |
Indexing |
78 | 96 |
-------- |
79 |
The Metacat Index component has been enhanced to parse semantic models provided as RDF. The general purpose RdfXmlSubprocessor can be used with SparqlFields to extract key concepts from any given model that is added to the document store. |
|
80 |
The processor assumes that the identifier of the RDF document is the name of the graph being inserted into the triple store. |
|
97 |
The Metacat Index component has been enhanced to parse semantic models provided as RDF. |
|
98 |
The general purpose RdfXmlSubprocessor can be used with SparqlFields to extract key concepts from any given model that is added to the Metacat MN document store. |
|
99 |
|
|
100 |
The processor assumes that the identifier of the RDF document is the name of the graph being inserted into the triple store and provides that graph name to the query engine for substitution in any query syntax ($GRAPH_NAME). |
|
81 | 101 |
The SPARQL requirements are that the solution[s] return the identifier (pid) of the object being annotated, and the index field being populated with the given value[s]. |
82 | 102 |
If multiple fields are to be extracted from the model for indexing, a distinct SPARQL query should be used for each field. |
103 |
|
|
83 | 104 |
The query can (and is largely expected to) be constrained to the named graph that contains only that set of annotation triples. While the infrastructure can (and likely will) share the same triple store, |
84 | 105 |
we should not assume other models have been loaded when processing any given graph. This means that any solutions will rely on only the named graph being processed during indexing. |
85 | 106 |
|
86 |
New Index Fields. Currently these are dynamic, multi-valued string fields which allow us to index the new semantic content without changing the SOLR schema. |
|
87 |
They are multi-valued because they will store the entire class subsumption hierarchy (up) for any matching concepts |
|
88 |
and because they will store annotations from the same metadata resources for different attributes. |
|
89 |
* ``characteristic_sm`` |
|
90 |
* ``standard_sm`` |
|
91 |
|
|
92 |
|
|
93 |
|
|
94 | 107 |
The SPARQL query used to determine the Characteristics measured in a dataset is shown below. Note that the query includes superclasses in the returned solutions so that |
95 | 108 |
the index returns a match for both general and specific criteria. |
96 | 109 |
|
... | ... | |
122 | 135 |
} |
123 | 136 |
|
124 | 137 |
:: |
138 |
|
|
139 |
Index Fields |
|
140 |
_________________ |
|
141 |
|
|
142 |
Currently, these dynamic, multi-valued string fields allow us to index the new semantic content without changing the SOLR schema. |
|
143 |
They are multi-valued because they will store the entire class subsumption hierarchy (up) for any matching concepts |
|
144 |
and because they will store annotations from the same metadata resources for different attributes. |
|
145 |
* ``characteristic_sm`` - indexes the oboe:Characteristic[s] for oboe:Measurement[s] in the datapackage |
|
146 |
* ``standard_sm`` - indexes the oboe:Standard[s] for oboe:Measurement[s] in the datapackage |
|
147 |
|
|
148 |
|
|
125 | 149 |
|
126 | 150 |
Example |
127 | 151 |
_______ |
128 | 152 |
|
129 |
Continuing with example model, these concepts would be indexed for the data attributes. |
|
153 |
Continuing with example model, these concepts would be indexed for the data attributes described in the datapackage metadata.
|
|
130 | 154 |
|
131 | 155 |
+---------------------------+-------------------+---------------------+-------------------------------------------------------------------------------------+ |
132 | 156 |
| Object | Field Name | Field Type | Value | |
... | ... | |
153 | 177 |
+---------------------------+-------------------+---------------------+-------------------------------------------------------------------------------------+ |
154 | 178 |
|
155 | 179 |
Queries |
156 |
_______-
|
|
180 |
_______ |
|
157 | 181 |
These indexed fields will be used primarily by MetacatUI to enhance discovery - both in terms of recall (concept hierarchies are exploited) and precision (concepts like Mass, do not result in false-positives for "Massachusetts"). |
158 | 182 |
As more aspects of the annotation model (e.g., observation Entity) are included in the index, the queries can incorporate them for greater query precision. Unfortunately, the flat nature of the SOLR index will prevent us from |
159 | 183 |
constructing queries that take full advantage of the underlying semantic annotation. We can filter results so that only those that measured Length Characteristics and Tree Entities, |
Also available in: Unified diff
Added more description for the model. cleaned up a few formatting issues.