Project

General

Profile

1 7521 leinfelder
.. raw:: latex
2
3
  \newpage
4
5
6 7501 leinfelder
Metacat Indexing
7
===========================
8 8101 leinfelder
Metacat v2.1 introduces support for building a SOLR index of Metacat content.
9
While we continue to support the "pathquery" search mechanism, this will be phased out
10
in favor of the more efficient SOLR query interface.
11 7501 leinfelder
12
13 8101 leinfelder
Metacat deployments that opt to use the Metacat SOLR index will be able to take advantage
14
of:
15 7501 leinfelder
16 8101 leinfelder
* fast search performance
17
* built-in paging features
18
* customizable return formats (for advanced admins)
19 7501 leinfelder
20
Indexed documents and fields
21
-----------------------------
22 8101 leinfelder
Metacat integrates the existing DataONE index library which includes many common metadata formats
23
out-of-the-box:
24 7501 leinfelder
25
1. EML
26
2. FGDC
27 8101 leinfelder
3. Dryad*
28 7501 leinfelder
29
30
Default indexed fields
31
-----------------------
32 8101 leinfelder
For a complete listing of the indexed fields, please see the DataONE documentation.
33 7501 leinfelder
34 8101 leinfelder
http://mule1.dataone.org/ArchitectureDocs-current/design/SearchMetadata.html
35 7501 leinfelder
36 8101 leinfelder
Metacat also reports on the currently-indexed fields, simply navigate to:
37
38
http://mule1.dataone.org/ArchitectureDocs-current/apis/MN_APIs.html#MNQuery.getQueryEngineDescription
39
40
with "solr" as the engine.
41
42
Index configuration
43 7501 leinfelder
----------------------------
44 8101 leinfelder
Metacat-index is deployed as a separate web application (metacat-index.war) and should be deployed
45 8265 leinfelder
as a sibling of the Metacat webapp (metacat.war). Deploying metacat-index.war is only required when SOLR support
46 9162 leinfelder
is desired (e.g., for MetacatUI) and can safely be omitted if it will not be utilized for any given Metacat deployment.
47 7501 leinfelder
48 9162 leinfelder
49 8101 leinfelder
During the initial installation/upgrade, an empty index will be initialized in the configured "solr-home" location.
50
Metacat-index will index all the existing Metacat content when the webapp next initializes.
51
Note: the configured solr-home directory should not exist before configuring Metacat with indexing for the first time,
52
otherwise the blank index will not be created for metacat-index to utilize.
53 7501 leinfelder
54 8101 leinfelder
Additional advanced configuration options are available in the metacat.properties file (shared between Metacat and Metacat-index).
55
56
57 7501 leinfelder
Adding additional document types and fields
58
--------------------------------------------
59 8101 leinfelder
TBD: Step-by-step guide for adding new documents and indexed fields.
60 7501 leinfelder
61
62
Querying the index
63
--------------------
64 8101 leinfelder
The SOLR index can be queried using standard SOLR syntax and return options.
65
The DataONE query interface exposes the SOLR query engine.
66 7501 leinfelder
67 8101 leinfelder
http://mule1.dataone.org/ArchitectureDocs-current/apis/MN_APIs.html#MNQuery.query
68 7501 leinfelder
69 8101 leinfelder
Please see the SOLR documentation for examples and exhaustive syntax information.
70
71
http://lucene.apache.org/solr/
72
73
74 7501 leinfelder
Access Policy enforcement
75
-------------------------
76 8101 leinfelder
Access control is enforced by the index such that only records that are readable by the
77
user performing the query are returned to the user. Any SOLR query submitted will be
78
augmented with access control criteria corresponding to if and how the user is currently
79
authenticated. Both certificate-based (DataONE API) and JSESSIONID-based (Metacat API)
80
authentication are simultaneously supported.
81 7501 leinfelder
82
83
Regenerating the index from scratch
84
-----------------------------------
85 8101 leinfelder
When the SOLR index has been drastically modified, a complete regeneration of the
86 7501 leinfelder
index may be necessary. In order to accomplish this:
87
88 8101 leinfelder
Step-by-step instructions:
89 7501 leinfelder
90 8101 leinfelder
1. Entirely remove the solr-home directory
91
2. Step through the Metacat admin interface main properties screen, specifying the solr-home directory you wish to use
92
3. Restart the webapp container (Tomcat).
93 7521 leinfelder
94 8101 leinfelder
Content can also be submitted for index regeneration by using the the Metacat API:
95 7521 leinfelder
96 8101 leinfelder
1. Login as the Metacat administrator
97
2. Navigate to: <host>/<metacat_context>/metacat?action=reindex[&pid={pid}]
98
3. If the pid parameter is omitted, all objects in Metacat will be submitted for reindexing.
99 7521 leinfelder
100 8101 leinfelder
101
102 7521 leinfelder
Class design overview
103
----------------------
104
105
.. figure:: images/indexing-class-diagram.png
106
107
   Figure 1. Class design overview.
108
109
..
110
  @startuml images/indexing-class-diagram.png
111
112 7531 leinfelder
	package "Current cn-index-processor (library)" {
113 7526 leinfelder
114
		interface IDocumentSubprocessor {
115
			+ boolean canProcess(Document doc)
116
			+ initExpression(XPath xpath)
117
			+ Map<String, SolrDoc> processDocument(String identifier, Map<String, SolrDoc> docs, Document doc)
118 7521 leinfelder
		}
119 7526 leinfelder
		class AbstractDocumentSubprocessor {
120
			- List<SolrField> fields
121 7528 tao
			+ setMatchDocument(String matchDocument)
122
			+ setFieldList(List<SolrField> fieldList)
123 7521 leinfelder
		}
124 7526 leinfelder
		class ResourceMapSubprocessor {
125
		}
126
		class ScienceMetadataDocumentSubprocessor {
127
		}
128
129
		interface ISolrField {
130
			+ initExpression(XPath xpathObject)
131
			+ List<SolrElementField> getFields(Document doc, String identifier)
132
		}
133
		class SolrField {
134
			- String name
135 7521 leinfelder
			- String xpath
136 7526 leinfelder
			- boolean multivalue
137 7521 leinfelder
		}
138 7526 leinfelder
		class CommonRootSolrField {
139 7522 leinfelder
		}
140 7531 leinfelder
		class RootElement {
141
		}
142
		class LeafElement {
143
		}
144 7526 leinfelder
		class FullTextSolrField {
145
		}
146
		class MergeSolrField {
147
		}
148
		class ResolveSolrField {
149
		}
150
		class SolrFieldResourceMap {
151
		}
152 7528 tao
153
		class SolrDoc {
154
		      - List<SolrElementField> fieldList
155
		}
156
157
		class SolrElementField {
158
		      - String name
159
		      - String value
160
		}
161 7521 leinfelder
162
	}
163
164 7526 leinfelder
	IDocumentSubprocessor <|-- AbstractDocumentSubprocessor
165
	AbstractDocumentSubprocessor <|-- ResourceMapSubprocessor
166
	AbstractDocumentSubprocessor <|-- ScienceMetadataDocumentSubprocessor
167
168
	ISolrField <|-- SolrField
169
	SolrField <|-- CommonRootSolrField
170 7531 leinfelder
	CommonRootSolrField o--"1" RootElement
171
	RootElement o--"*" LeafElement
172 7526 leinfelder
	SolrField <|-- FullTextSolrField
173
	SolrField <|-- MergeSolrField
174
	SolrField <|-- ResolveSolrField
175 7531 leinfelder
	SolrField <|-- SolrFieldResourceMap
176 7521 leinfelder
177 7526 leinfelder
	AbstractDocumentSubprocessor o--"*" ISolrField
178 7522 leinfelder
179 7528 tao
	IDocumentSubprocessor --> SolrDoc
180
181
	SolrDoc o--"*" SolrElementField
182
183 7531 leinfelder
	package "SOLR (library)" {
184 7528 tao
185
        abstract class SolrServer {
186
            + add(SolrInputDocument doc)
187
            + deleteByQuery(String id)
188
            + query(SolrQuery query)
189
        }
190
        class EmbeddedSolrServer {
191
        }
192
        class HttpSolrServer {
193
        }
194
195
    }
196
197
    SolrServer <|-- EmbeddedSolrServer
198
    SolrServer <|-- HttpSolrServer
199
200 8101 leinfelder
	package "Metact-index (webapp)" {
201 7522 leinfelder
202 7535 tao
		class ApplicationController {
203
		    - List<SolrIndex> solrIndex
204
		    + regenerateIndex()
205
		}
206
207 7531 leinfelder
		class SolrIndex {
208 7527 tao
			- List<IDocumentSubprocessor> subprocessors
209 7531 leinfelder
			- SolrServer solrServer
210 7527 tao
			+ insert(String pid, InputStream data)
211
			+ update(String pid, InputStream data)
212 7526 leinfelder
			+ remove(String pid)
213 7522 leinfelder
		}
214 7531 leinfelder
215 7532 leinfelder
		class SystemMetadataEventListener {
216 7531 leinfelder
			- SolrIndex solrIndex
217 8101 leinfelder
			+ itemAdded(ItemEvent<SystemMetadata>)
218
			+ itemRemoved(ItemEvent<SystemMetadata>)
219 7522 leinfelder
		}
220 7526 leinfelder
221
	}
222
223 7531 leinfelder
	package "Metacat (webapp)" {
224
225
		class MetacatSolrIndex {
226 7532 leinfelder
			- SolrServer solrServer
227
			+ InputStream query(SolrQuery)
228 7531 leinfelder
		}
229
230
		class HazelcastService {
231 8101 leinfelder
			- IMap hzIndexQueue
232 7532 leinfelder
			- IMap hzSystemMetadata
233 8101 leinfelder
			- IMap hzObjectPath
234 7531 leinfelder
		}
235 7532 leinfelder
236 7531 leinfelder
	}
237 7528 tao
238 7532 leinfelder
	MetacatSolrIndex o--"1" SolrServer
239
	HazelcastService .. SystemMetadataEventListener
240 7522 leinfelder
241 7535 tao
	ApplicationController o--"*" SolrIndex
242 7532 leinfelder
	SolrIndex o--"1" SolrServer
243
	SolrIndex "1"--o SystemMetadataEventListener
244 7531 leinfelder
	SolrIndex o--"*" IDocumentSubprocessor: Assembled using Spring bean configuration
245 7522 leinfelder
246 7526 leinfelder
247 7531 leinfelder
248 7521 leinfelder
249
  @enduml