Project

General

Profile

1
.. raw:: latex
2

    
3
  \newpage
4

    
5

    
6
Metacat Indexing
7
===========================
8
Metacat v2.1 introduces support for building a SOLR index of Metacat content.
9
While we continue to support the "pathquery" search mechanism, this will be phased out 
10
in favor of the more efficient SOLR query interface.
11

    
12

    
13
Metacat deployments that opt to use the Metacat SOLR index will be able to take advantage 
14
of:
15

    
16
* fast search performance
17
* built-in paging features
18
* customizable return formats (for advanced admins)
19

    
20
Indexed documents and fields
21
-----------------------------
22
Metacat integrates the existing DataONE index library which includes many common metadata formats
23
out-of-the-box:
24

    
25
1. EML
26
2. FGDC
27
3. Dryad*
28

    
29

    
30
Default indexed fields
31
-----------------------
32
For a complete listing of the indexed fields, please see the DataONE documentation.
33

    
34
http://mule1.dataone.org/ArchitectureDocs-current/design/SearchMetadata.html
35

    
36
Metacat also reports on the currently-indexed fields, simply navigate to:
37

    
38
http://mule1.dataone.org/ArchitectureDocs-current/apis/MN_APIs.html#MNQuery.getQueryEngineDescription
39

    
40
with "solr" as the engine.
41

    
42
Index configuration
43
----------------------------
44
Metacat-index is deployed as a separate web application (metacat-index.war) and should be deployed 
45
as a sibling of the Metacat webapp (metacat.war). Deploying metacat-index.war is only required when SOLR support
46
is desired and can safely be omitted if it will not be utilized for any given Metacat deployment.
47

    
48
During the initial installation/upgrade, an empty index will be initialized in the configured "solr-home" location.
49
Metacat-index will index all the existing Metacat content when the webapp next initializes.
50
Note: the configured solr-home directory should not exist before configuring Metacat with indexing for the first time, 
51
otherwise the blank index will not be created for metacat-index to utilize.
52

    
53
Additional advanced configuration options are available in the metacat.properties file (shared between Metacat and Metacat-index).
54

    
55

    
56
Adding additional document types and fields
57
--------------------------------------------
58
TBD: Step-by-step guide for adding new documents and indexed fields.
59

    
60

    
61
Querying the index
62
--------------------
63
The SOLR index can be queried using standard SOLR syntax and return options. 
64
The DataONE query interface exposes the SOLR query engine.
65

    
66
http://mule1.dataone.org/ArchitectureDocs-current/apis/MN_APIs.html#MNQuery.query
67

    
68
Please see the SOLR documentation for examples and exhaustive syntax information.
69

    
70
http://lucene.apache.org/solr/
71

    
72

    
73
Access Policy enforcement
74
-------------------------
75
Access control is enforced by the index such that only records that are readable by the 
76
user performing the query are returned to the user. Any SOLR query submitted will be 
77
augmented with access control criteria corresponding to if and how the user is currently 
78
authenticated. Both certificate-based (DataONE API) and JSESSIONID-based (Metacat API) 
79
authentication are simultaneously supported.
80

    
81

    
82
Regenerating the index from scratch
83
-----------------------------------
84
When the SOLR index has been drastically modified, a complete regeneration of the 
85
index may be necessary. In order to accomplish this:
86

    
87
Step-by-step instructions:
88

    
89
1. Entirely remove the solr-home directory
90
2. Step through the Metacat admin interface main properties screen, specifying the solr-home directory you wish to use
91
3. Restart the webapp container (Tomcat).
92

    
93
Content can also be submitted for index regeneration by using the the Metacat API:
94

    
95
1. Login as the Metacat administrator
96
2. Navigate to: <host>/<metacat_context>/metacat?action=reindex[&pid={pid}]
97
3. If the pid parameter is omitted, all objects in Metacat will be submitted for reindexing.
98

    
99

    
100

    
101
Class design overview
102
----------------------
103

    
104
.. figure:: images/indexing-class-diagram.png
105

    
106
   Figure 1. Class design overview.
107
   
108
..
109
  @startuml images/indexing-class-diagram.png
110
  
111
	package "Current cn-index-processor (library)" {
112
	
113
		interface IDocumentSubprocessor {
114
			+ boolean canProcess(Document doc)
115
			+ initExpression(XPath xpath)
116
			+ Map<String, SolrDoc> processDocument(String identifier, Map<String, SolrDoc> docs, Document doc)
117
		}
118
		class AbstractDocumentSubprocessor {
119
			- List<SolrField> fields
120
			+ setMatchDocument(String matchDocument)
121
			+ setFieldList(List<SolrField> fieldList) 
122
		}
123
		class ResourceMapSubprocessor {
124
		}
125
		class ScienceMetadataDocumentSubprocessor {
126
		}
127
			  
128
		interface ISolrField {
129
			+ initExpression(XPath xpathObject)
130
			+ List<SolrElementField> getFields(Document doc, String identifier)
131
		}
132
		class SolrField {
133
			- String name
134
			- String xpath
135
			- boolean multivalue
136
		}
137
		class CommonRootSolrField {
138
		}
139
		class RootElement {
140
		}
141
		class LeafElement {
142
		}
143
		class FullTextSolrField {
144
		}
145
		class MergeSolrField {
146
		}
147
		class ResolveSolrField {
148
		}
149
		class SolrFieldResourceMap {
150
		}
151
		
152
		class SolrDoc {
153
		      - List<SolrElementField> fieldList
154
		}
155
		
156
		class SolrElementField {
157
		      - String name
158
		      - String value
159
		}
160
		    
161
	}
162
	
163
	IDocumentSubprocessor <|-- AbstractDocumentSubprocessor
164
	AbstractDocumentSubprocessor <|-- ResourceMapSubprocessor
165
	AbstractDocumentSubprocessor <|-- ScienceMetadataDocumentSubprocessor
166

    
167
	ISolrField <|-- SolrField
168
	SolrField <|-- CommonRootSolrField
169
	CommonRootSolrField o--"1" RootElement
170
	RootElement o--"*" LeafElement
171
	SolrField <|-- FullTextSolrField
172
	SolrField <|-- MergeSolrField
173
	SolrField <|-- ResolveSolrField			
174
	SolrField <|-- SolrFieldResourceMap
175
	
176
	AbstractDocumentSubprocessor o--"*" ISolrField
177
	
178
	IDocumentSubprocessor --> SolrDoc
179
	
180
	SolrDoc o--"*" SolrElementField
181
	
182
	package "SOLR (library)" {
183
          
184
        abstract class SolrServer {
185
            + add(SolrInputDocument doc)
186
            + deleteByQuery(String id)
187
            + query(SolrQuery query)
188
        }
189
        class EmbeddedSolrServer {
190
        }
191
        class HttpSolrServer {
192
        }
193
    
194
    }
195
    
196
    SolrServer <|-- EmbeddedSolrServer
197
    SolrServer <|-- HttpSolrServer
198
	
199
	package "Metact-index (webapp)" {
200
		  
201
		class ApplicationController {
202
		    - List<SolrIndex> solrIndex
203
		    + regenerateIndex()
204
		}
205
		
206
		class SolrIndex {
207
			- List<IDocumentSubprocessor> subprocessors
208
			- SolrServer solrServer
209
			+ insert(String pid, InputStream data)
210
			+ update(String pid, InputStream data)
211
			+ remove(String pid)
212
		}
213

    
214
		class SystemMetadataEventListener {
215
			- SolrIndex solrIndex
216
			+ itemAdded(ItemEvent<SystemMetadata>)
217
			+ itemRemoved(ItemEvent<SystemMetadata>)
218
		}
219
	
220
	}
221
	
222
	package "Metacat (webapp)" {
223
		  
224
		class MetacatSolrIndex {
225
			- SolrServer solrServer
226
			+ InputStream query(SolrQuery)
227
		}
228
		
229
		class HazelcastService {
230
			- IMap hzIndexQueue
231
			- IMap hzSystemMetadata
232
			- IMap hzObjectPath
233
		}
234
		
235
	}
236
	
237
	MetacatSolrIndex o--"1" SolrServer
238
	HazelcastService .. SystemMetadataEventListener
239
	
240
	ApplicationController o--"*" SolrIndex
241
	SolrIndex o--"1" SolrServer	
242
	SolrIndex "1"--o SystemMetadataEventListener
243
	SolrIndex o--"*" IDocumentSubprocessor: Assembled using Spring bean configuration
244
	
245
	
246
	
247
  
248
  @enduml
(19-19/23)