Revision 6853
Added by Matt Jones almost 13 years ago
docs/dev/metacat/source/harvester.rst | ||
---|---|---|
1 | 1 |
Harvester and Harvest List Editor |
2 | 2 |
================================= |
3 | 3 |
|
4 |
Under construction! |
|
4 |
Metacat's Harvester is an optional feature that can be used to automatically |
|
5 |
retrieve EML documents from one or more custom data management system (e.g., |
|
6 |
SRB or PostgreSQL) and to insert (or update) those documents to the home |
|
7 |
repository. The local sites control when they are harvested, and which documents |
|
8 |
are harvested. |
|
5 | 9 |
|
6 |
Heading 1 |
|
7 |
------------ |
|
10 |
For example, the Long Term Ecological Research Network (LTER) uses the Metacat |
|
11 |
Harvester to create a centralized repository of data stored on twenty-six |
|
12 |
different sites that store EML metadata, but that use different data management |
|
13 |
systems. Once the data have been harvested and placed into a centralized |
|
14 |
repository, they are replicated to the KNB network, exposing the information |
|
15 |
to an even larger scientific community. |
|
8 | 16 |
|
9 |
Heading 2 |
|
10 |
------------ |
|
17 |
Once the Harvester is properly configured, listed documents are retrieved and |
|
18 |
uploaded on a regularly scheduled basis. You must configure both the home |
|
19 |
Metacat and the remote sites (aka the "harvest sites") before using this |
|
20 |
feature. Local sites must also provide the Metacat server with a list of |
|
21 |
documents that should be harvested. |
|
11 | 22 |
|
23 |
Configuring Harvester |
|
24 |
--------------------- |
|
25 |
Before you can use the Harvester to retrieve documents, you must configure the |
|
26 |
feature using the settings in the metacat.properties file. Note that you must |
|
27 |
also configure each site that the Harvester will connect to and retrieve |
|
28 |
documents from (see section 7.2 for details). |
|
29 |
|
|
30 |
The Harvester configuration information is managed in the metacat.properties |
|
31 |
file, which is located at:: |
|
32 |
|
|
33 |
<CONTEXT_DIR>/WEB_INF/metacat.properties |
|
34 |
|
|
35 |
The Harvester properties are grouped together and begin after the comment line:: |
|
36 |
|
|
37 |
# Harvester properties |
|
38 |
|
|
39 |
To configure Harvester, edit the metacat.properties and set appropriate values |
|
40 |
for the harvesterAdministrator and smtpServer property. You may also wish to |
|
41 |
customize the other Harvester paramaters, each discussed in the table below. |
|
42 |
|
|
43 |
Harvester Properties and their Functions |
|
44 |
---------------------------------------- |
|
45 |
|
|
46 |
+------------------------------------+-------------------------------------------------------------------------------------------------+-+ |
|
47 |
| Property | Description and Values | | |
|
48 |
+====================================+=================================================================================================+=+ |
|
49 |
| connectToMetacat | Determine whether Harvester should connect to Metacat to upload retrieved documents. | | |
|
50 |
| | Set to true (the default) under most circumstances. To test whether Harvester can | | |
|
51 |
| | retrieve documents from a site without actually connecting to Metacat | | |
|
52 |
| | to upload the documents, set the value to false. | | |
|
53 |
| | | | |
|
54 |
| | Values: true/false | | |
|
55 |
+------------------------------------+-------------------------------------------------------------------------------------------------+-+ |
|
56 |
| delay | The number of hours that Harvester will wait before beginning its first harvest. | | |
|
57 |
| | For example, if Harvester is run at 1:00 p.m., and the delay is set to 12, | | |
|
58 |
| | Harvester will begin its first harvest at 1:00 a.m. | | |
|
59 |
| | | | |
|
60 |
| | Default: 0 | | |
|
61 |
+------------------------------------+-------------------------------------------------------------------------------------------------+-+ |
|
62 |
| harvesterAdministrator | The email address of the Harvester Administrator. Harvester will send | | |
|
63 |
| | email reports to this address after every harvest. Enter multiple email addresses by separating | | |
|
64 |
| | each address with a comma or semicolon (e.g., name1@abc.edu,name2@abc.edu). | | |
|
65 |
| | | | |
|
66 |
| | Values: An email address, or multiple email addresses separated by commas or semi-colons | | |
|
67 |
+------------------------------------+-------------------------------------------------------------------------------------------------+-+ |
|
68 |
| logPeriod | The number of days to retain Harvester log entries. Harvester log entries | | |
|
69 |
| | record information such as which documents were harvested, from which sites, | | |
|
70 |
| | and whether any errors were encountered during the harvest. Log entries older | | |
|
71 |
| | than logPeriod number of days are purged from the database at the end of each harvest. | | |
|
72 |
| | | | |
|
73 |
| | Default: 90 | | |
|
74 |
+------------------------------------+-------------------------------------------------------------------------------------------------+-+ |
|
75 |
| maxHarvests | The maximum number of harvests that Harvester should execute before | | |
|
76 |
| | shutting down. If the value of maxHarvests is set to 0 or a | | |
|
77 |
| | negative number, Harvester will execute indefinitely. | | |
|
78 |
| | | | |
|
79 |
| | Default: 0 | | |
|
80 |
+------------------------------------+-------------------------------------------------------------------------------------------------+-+ |
|
81 |
| period | The number of hours between harvests. Harvester will run a new harvest | | |
|
82 |
| | every specified period of hours (either indefinitely or until the maximum | | |
|
83 |
| | number of harvests have run, depending on the value of maxHarvests). | | |
|
84 |
| | | | |
|
85 |
| | Default: 24 | | |
|
86 |
+------------------------------------+-------------------------------------------------------------------------------------------------+-+ |
|
87 |
| smtpServer | The SMTP server that Harvester uses for sending email messages to the | | |
|
88 |
| | Harvester Administrator and Site Contacts. | | |
|
89 |
| | (e.g., somehost.institution.edu). Note that the default value only works | | |
|
90 |
| | if the Harvester host machine is configured as a SMTP server. | | |
|
91 |
| | | | |
|
92 |
| | Default: localhost | | |
|
93 |
+------------------------------------+-------------------------------------------------------------------------------------------------+-+ |
|
94 |
| Harvester Operation Properties | The Harvester Operation properties are used by Harvester to report information | | |
|
95 |
| (GetDocError, GetDocSuccess, etc.) | about performed operations for inclusion in log entries and email messages. | | |
|
96 |
| | Under most circumstances the values of these properties should not be modified. | | |
|
97 |
+------------------------------------+-------------------------------------------------------------------------------------------------+-+ |
|
98 |
|
|
99 |
Configuring a Harvest Site (Instructions for Site Contact) |
|
100 |
---------------------------------------------------------- |
|
101 |
|
|
102 |
After Metacat's Harvester has been configured, remote sites can register and |
|
103 |
send information about which files should be retrieved. Each remote site must |
|
104 |
have a site contact who is responsible for registering the site and creating a |
|
105 |
list of EML files to harvest (the "Harvest List"), as well as for reviewing |
|
106 |
harvest reports. The site contact can unregister the site from the Harvester |
|
107 |
at any time. |
|
108 |
|
|
109 |
To use Harvester: |
|
110 |
|
|
111 |
1. Register with Harvester |
|
112 |
2. Compose a Harvest List (you will likely wish to use the Harvest List Editor) |
|
113 |
3. Prepare your EML Documents for Harvest |
|
114 |
4. Review the Harvester Reports |
|
115 |
|
|
116 |
Register with Harvester |
|
117 |
~~~~~~~~~~~~~~~~~~~~~~~ |
|
118 |
|
|
119 |
To register a remote site with Harvester, the Site Contact should log in to |
|
120 |
Metacat's Harvester Registration page and enter information about the site and |
|
121 |
how it should be harvested. |
|
122 |
|
|
123 |
1. Using a Web browser, log in to Metacat's Harvester Registration page. |
|
124 |
The Harvester Registration page is inside the skins directory. For example, |
|
125 |
if the Metacat server that you wish to register with resides at the following URL: |
|
126 |
|
|
127 |
:: |
|
128 |
|
|
129 |
http://somehost.somelocation.edu:8080/knb/index.jsp |
|
130 |
|
|
131 |
then the Harvester Registration page would be accessed at: |
|
132 |
|
|
133 |
:: |
|
134 |
|
|
135 |
http://somehost.somelocation.edu:8080/knb/style/skins/knb/harvesterRegistrationLogin.jsp |
|
136 |
|
|
137 |
.. figure:: images/screenshots/image065.jpg |
|
138 |
:align: center |
|
139 |
|
|
140 |
Metacat's Harvester Registration page. |
|
141 |
|
|
142 |
2. Enter your Metacat account information and click Submit to log in to your |
|
143 |
Metacat from the Harvester Registration page. |
|
144 |
|
|
145 |
Note: In some cases, you may need to log in to an anonymous "site" account |
|
146 |
rather than your personal account so that the registered data will not appear |
|
147 |
to have been registered by a single user. For example, an information |
|
148 |
manager (jones) who is registering data created by a team of scientists |
|
149 |
(jones, smith, and barney) from the Georgia Coastal Ecosystems site might |
|
150 |
log in to a dedicated account (named with the site's acronym, "GCE") to |
|
151 |
indicate that the registered data is from the entire site rather than "jones". |
|
152 |
|
|
153 |
3. Enter information about your site and how often you want to schedule harvests |
|
154 |
and then click the Register button (Figure 7.2). The Harvest List URL should |
|
155 |
point to the location of the Harvest List, which is an XML file that lists |
|
156 |
the documents to harvest. If you do not yet have a Harvest List, please see |
|
157 |
the next section for more information about creating one. |
|
158 |
|
|
159 |
.. figure:: images/screenshots/image067.jpg |
|
160 |
:align: center |
|
161 |
|
|
162 |
Enter information about your site and how often you want to schedule harvests. |
|
163 |
|
|
164 |
The example settings in the previous figure instruct Harvester to harvest |
|
165 |
documents from the site once every two weeks. The Harvester will access the |
|
166 |
site's Harvest List at URL "http://somehost.institution.edu/~myname/harvestList.xml", |
|
167 |
and will send email reports to the Site Contact at email address |
|
168 |
"myname@institution.edu". Note that you can enter multiple email addresses by |
|
169 |
separating each address with a comma or a semi-colon. For example, |
|
170 |
"myname@institution.edu,anothername@institution.edu" |
|
171 |
|
|
172 |
Compose a Harvest List (The Harvest List Editor) |
|
173 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
174 |
The Harvest List is an XML file that contains a list of documents to be harvested. |
|
175 |
The list is created by the site contact and stored on the site contact's site |
|
176 |
at the location specified during the Harvester registration process (see |
|
177 |
previous section for details). The list can be generated by hand, or you can |
|
178 |
use Metacat's Harvest List Editor to automatically generate and structure the |
|
179 |
list to conform to the required XML schema (displayed in figure at the end of |
|
180 |
this section). In this section we will look at what information is required when |
|
181 |
building a Harvest List, and how to configure and use the Harvest List Editor. |
|
182 |
Note that you must have a source distribution of Metacat in order to use the |
|
183 |
Harvest List Editor. |
|
184 |
|
|
185 |
The Harvest List contains information that helps Metacat identify and retrieve |
|
186 |
each specified EML file. Each document in the list must be described with a |
|
187 |
docid, documentType, and documentURL (see table). |
|
188 |
|
|
189 |
Table: Information that must be included in the Harvest List about each EML file |
|
190 |
+--------------+-------------------------------------------------------------------------------------------------+ |
|
191 |
| Item | Description | |
|
192 |
+==============+=================================================================================================+ |
|
193 |
| docid | The docid uniquely identifies each EML document. Each docid consists of three elements: | |
|
194 |
| | | |
|
195 |
| | ``scope`` The document group to which the document belongs | |
|
196 |
| | ``identifier`` A number that uniquely identifies the document within the scope. | |
|
197 |
| | ``revision`` Anumber that indicates the current revision. | |
|
198 |
| | | |
|
199 |
| | For example, a valid docid could be: demoDocument.1.5, where demoDocument represents | |
|
200 |
| | the scope, 1 the identifier, and 5 the revision number. | |
|
201 |
+--------------+-------------------------------------------------------------------------------------------------+ |
|
202 |
| documentType | The documentType identifies the type of document as EML | |
|
203 |
| | e.g., "eml://ecoinformatics.org/eml-2.0.0". | |
|
204 |
+--------------+-------------------------------------------------------------------------------------------------+ |
|
205 |
| documentURL | The documentURL specifies a place where Harvester can locate and retrieve the | |
|
206 |
| | document via HTTP. The Metacat Harvester must be given read access to the contents at this URL. | |
|
207 |
| | e.g. "http://www.lternet.edu/~dcosta/document1.xml". | |
|
208 |
+--------------+-------------------------------------------------------------------------------------------------+ |
|
209 |
|
|
210 |
The example Harvest List below contains two <document> elements that specify the |
|
211 |
information that Harvester needs to retrieve a pair of EML documents and |
|
212 |
upload them to Metacat. |
|
213 |
|
|
214 |
:: |
|
215 |
|
|
216 |
|
|
217 |
<!-- Example Harvest List --> |
|
218 |
<?xml version="1.0" encoding="UTF-8" ?> |
|
219 |
<hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList" > |
|
220 |
<document> |
|
221 |
<docid> |
|
222 |
<scope>demoDocument</scope> |
|
223 |
<identifier>1</identifier> |
|
224 |
<revision>5</revision> |
|
225 |
</docid> |
|
226 |
<documentType>eml://ecoinformatics.org/eml-2.0.0</documentType> |
|
227 |
<documentURL>http://www.lternet.edu/~dcosta/document1.xml</documentURL> |
|
228 |
</document> |
|
229 |
<document> |
|
230 |
<docid> |
|
231 |
<scope>demoDocument</scope> |
|
232 |
<identifier>2</identifier> |
|
233 |
<revision>1</revision> |
|
234 |
</docid> |
|
235 |
<documentType>eml://ecoinformatics.org/eml-2.0.0</documentType> |
|
236 |
<documentURL>http://www.lternet.edu/~dcosta/document2.xml</documentURL> |
|
237 |
</document> |
|
238 |
</hrv:harvestList> |
|
239 |
|
|
240 |
Rather than formatting the list by hand, you may wish to use Metacat's Harvest |
|
241 |
List Editor to compose and edit it. The Harvest List Editor displays a Harvest |
|
242 |
List as a table of rows and fields. Each table row corresponds to |
|
243 |
a single <document> element in the corresponding Harvest List file (i.e., one |
|
244 |
EML document). The row numbers are used only for visual reference and are |
|
245 |
not editable. |
|
246 |
|
|
247 |
To add a new document to the Harvest List, enter values for all five editable |
|
248 |
fields (all fields except the "Row #" field). Partially filled-in rows will |
|
249 |
cause errors that will result in an invalid Harvest List. |
|
250 |
|
|
251 |
The buttons at the bottom of the Editor can be used to Cut, Copy, and Paste |
|
252 |
rows from one location to another. Select a row and click the desired button, |
|
253 |
or paste the default values (which are specified in the Editor's configuration |
|
254 |
file, discussed later in this section) into the currently selected row by |
|
255 |
clicking the Paste Defaults button. Note: Only one row can be selected at any |
|
256 |
given time: all cut, copy, and paste operations work on only a single row |
|
257 |
rather than on a range of rows. |
|
258 |
|
|
259 |
To run the Harvest List Editor, from the terminal on which the Metacat |
|
260 |
source code is installed: |
|
261 |
|
|
262 |
1. Open a system command window or terminal window. |
|
263 |
2. Set the METACAT_HOME environment variable to the value of the Metacat |
|
264 |
installation directory. Some examples follow: |
|
265 |
|
|
266 |
On Windows: |
|
267 |
|
|
268 |
:: |
|
269 |
|
|
270 |
set METACAT_HOME=C:\somePath\knb |
|
271 |
|
|
272 |
On Linux/Unix (bash shell): |
|
273 |
|
|
274 |
:: |
|
275 |
|
|
276 |
export METACAT_HOME=/home/somePath/metacat |
|
277 |
|
|
278 |
3. cd to the following directory: |
|
279 |
|
|
280 |
On Windows: |
|
281 |
|
|
282 |
:: |
|
283 |
|
|
284 |
cd %METACAT_HOME%\lib\harvester |
|
285 |
|
|
286 |
On Linux/Unix: |
|
287 |
|
|
288 |
:: |
|
289 |
|
|
290 |
cd $METACAT_HOME/lib/harvester |
|
291 |
|
|
292 |
4. Run the appropriate Harvester shell script, as determined by the operating system: |
|
293 |
|
|
294 |
On Windows: |
|
295 |
|
|
296 |
:: |
|
297 |
|
|
298 |
runHarvestListEditor.bat |
|
299 |
|
|
300 |
On Linux/Unix: |
|
301 |
|
|
302 |
:: |
|
303 |
|
|
304 |
sh runHarvestListEditor.sh |
|
305 |
|
|
306 |
The Harvest List Editor will open. |
|
307 |
|
|
308 |
If you would like to customize the Harvest List Editor (e.g., specify a |
|
309 |
default list to open automatically whenever the editor is opened and/or |
|
310 |
default values), create a file called .harvestListEditor (note the leading |
|
311 |
dot character). Use a plain text editor to create the file and place the file |
|
312 |
in the Site Contact's home directory. To determine the home directory, open a |
|
313 |
system command window or terminal window and type the following: |
|
314 |
|
|
315 |
On Windows: |
|
316 |
|
|
317 |
:: |
|
318 |
|
|
319 |
echo %USERPROFILE% |
|
320 |
|
|
321 |
On Linux/Unix: |
|
322 |
|
|
323 |
:: |
|
324 |
|
|
325 |
echo $HOME |
|
326 |
|
|
327 |
The configuration file contains a number of optional properties that can make |
|
328 |
using the Editor more convenient. A sample configure file is displayed below, and |
|
329 |
more information about each configuration property is contained in the table. |
|
330 |
|
|
331 |
A sample .harvestListEditor configuration file |
|
332 |
|
|
333 |
:: |
|
334 |
|
|
335 |
defaultHarvestList=C:/temp/harvestList.xml |
|
336 |
defaultScope=demo_document |
|
337 |
defaultIdentifier=1 |
|
338 |
defaultRevision=1 |
|
339 |
defaultDocumentURL=http://www.lternet.edu/~dcosta/ |
|
340 |
defaultDocumentType=eml://ecoinformatics.org/eml-2.0.0 |
|
341 |
|
|
342 |
Harvest List Editor Configuration Properties |
|
343 |
|
|
344 |
+---------------------+----------------------------------------------------------------------------------------------+ |
|
345 |
| Property | Description | |
|
346 |
+=====================+==============================================================================================+ |
|
347 |
| defaultHarvestList | The location of a Harvest List file that the Editor will | |
|
348 |
| | automatically open for editing on startup. Set this property | |
|
349 |
| | to the path to the Harvest List file that you expect to edit most frequently. | |
|
350 |
| | | |
|
351 |
| | Examples: | |
|
352 |
| | ``/home/jdoe/public_html/harvestList.xml`` | |
|
353 |
| | ``C:/temp/harvestList.xml`` | |
|
354 |
+---------------------+----------------------------------------------------------------------------------------------+ |
|
355 |
| defaultScope | The value pasted into the Editor's Scope field when the Paste | |
|
356 |
| | Defaults button is clicked. The Scope field should contain | |
|
357 |
| | a symbolic identifier that indicates the family of documents | |
|
358 |
| | to which the EML document belongs. | |
|
359 |
| | | |
|
360 |
| | Example: xyz_dataset | |
|
361 |
| | Default: dataset | |
|
362 |
+---------------------+----------------------------------------------------------------------------------------------+ |
|
363 |
| defaultIdentifer | The value pasted into the Editor's Identifier field when the | |
|
364 |
| | Paste Defaults button is clicked. The Scope field should contain | |
|
365 |
| | a numeric value indicating the identifier for this particular EML document within the Scope. | |
|
366 |
+---------------------+----------------------------------------------------------------------------------------------+ |
|
367 |
| defaultRevision | The value pasted into the Editor's Revision field when the Paste Defaults button | |
|
368 |
| | is clicked. The Scope field should contain a numeric value indicating the | |
|
369 |
| | revision number of this EML document within the Scope and Identifier. | |
|
370 |
| | | |
|
371 |
| | Example: 2 | |
|
372 |
| | Default: 1 | |
|
373 |
+---------------------+----------------------------------------------------------------------------------------------+ |
|
374 |
| defaultDocumentType | The document type specification pasted into the | |
|
375 |
| | Editor's DocumentType field when the Paste Defaults button is clicked. | |
|
376 |
| | | |
|
377 |
| | Default: ``eml://ecoinformatics.org/eml-2.0.0`` | |
|
378 |
+---------------------+----------------------------------------------------------------------------------------------+ |
|
379 |
| defaultDocumentURL | The URL or partial URL pasted into the Editor's URL field | |
|
380 |
| | when the Paste Defaults button is clicked. Typically, this | |
|
381 |
| | value is set to the portion of the URL shared by all harvested EML documents. | |
|
382 |
| | | |
|
383 |
| | Example: | |
|
384 |
| | ``http://somehost.institution.edu/somepath/`` | |
|
385 |
| | Default: ``http://`` | |
|
386 |
+---------------------+----------------------------------------------------------------------------------------------+ |
|
387 |
|
|
388 |
|
|
389 |
XML Schema for Harvest Lists |
|
390 |
|
|
391 |
:: |
|
392 |
|
|
393 |
<?xml version="1.0" encoding="UTF-8"?> |
|
394 |
<!-- edited with XMLSPY v5 rel. 4 U (http://www.xmlspy.com) by Matt Jones (NCEAS) --> |
|
395 |
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:hrv="eml://ecoinformatics.org/harvestList" xmlns="eml://ecoinformatics.org/harvestList" targetNamespace="eml://ecoinformatics.org/harvestList" elementFormDefault="unqualified" attributeFormDefault="unqualified"> |
|
396 |
<xs:annotation> |
|
397 |
<xs:documentation>This module defines the required information for the harvester to collect documents from the local site. The local system containing this document must give the Metacat Harvester read access to this document.</xs:documentation> |
|
398 |
</xs:annotation> |
|
399 |
<xs:annotation> |
|
400 |
<xs:appinfo> |
|
401 |
<tooltip/> |
|
402 |
<summary/> |
|
403 |
<description/> |
|
404 |
</xs:appinfo> |
|
405 |
</xs:annotation> |
|
406 |
<xs:element name="harvestList"> |
|
407 |
<xs:annotation> |
|
408 |
<xs:documentation>This represents the local document information that is used to inform the Harvester of the docid, document type, and location of the document to be harvested.</xs:documentation> |
|
409 |
</xs:annotation> |
|
410 |
<xs:complexType> |
|
411 |
<xs:sequence> |
|
412 |
<xs:element name="document" maxOccurs="unbounded"> |
|
413 |
<xs:complexType> |
|
414 |
<xs:sequence> |
|
415 |
<xs:element name="docid"> |
|
416 |
<xs:annotation> |
|
417 |
<xs:documentation>The complete document identifier to be used by metacat. The docid is a compound element that gives a scope for the identifier, an integer local identifer that is unique within that scope, and a revision. Each revision is assumed to specify a unique, non-changing document, so once a particular revision is harvested, there is no need for it to be harvested again. To trigger a harvest of a document that has been updated, increment the revision number for that identifier.</xs:documentation> |
|
418 |
</xs:annotation> |
|
419 |
<xs:complexType> |
|
420 |
<xs:sequence> |
|
421 |
<xs:element name="scope" type="xs:string"> |
|
422 |
<xs:annotation> |
|
423 |
<xs:documentation>The system prefix of a metacat docid that defines the scope within which the identifier is unique.</xs:documentation> |
|
424 |
</xs:annotation> |
|
425 |
</xs:element> |
|
426 |
<xs:element name="identifier" type="xs:long"> |
|
427 |
<xs:annotation> |
|
428 |
<xs:documentation>The local (site specific) portion of the identifier (docid) that is unique within the context of the scope.</xs:documentation> |
|
429 |
</xs:annotation> |
|
430 |
</xs:element> |
|
431 |
<xs:element name="revision" type="xs:long"> |
|
432 |
<xs:annotation> |
|
433 |
<xs:documentation>The revision identifier for this document, indicating a unique document version.</xs:documentation> |
|
434 |
</xs:annotation> |
|
435 |
</xs:element> |
|
436 |
</xs:sequence> |
|
437 |
</xs:complexType> |
|
438 |
</xs:element> |
|
439 |
<xs:element name="documentType" type="xs:string"> |
|
440 |
<xs:annotation> |
|
441 |
<xs:documentation>The type of document to be harvested, indicated by a namespace string, formal public identifier, mime type, or other type indicator. </xs:documentation> |
|
442 |
</xs:annotation> |
|
443 |
</xs:element> |
|
444 |
<xs:element name="documentURL" type="xs:anyURI"> |
|
445 |
<xs:annotation> |
|
446 |
<xs:documentation>The documentURL field contains the URL of the document to be harvested. The Metacat Harvester must be given read access to the contents at this URL.</xs:documentation> |
|
447 |
</xs:annotation> |
|
448 |
</xs:element> |
|
449 |
</xs:sequence> |
|
450 |
</xs:complexType> |
|
451 |
</xs:element> |
|
452 |
</xs:sequence> |
|
453 |
</xs:complexType> |
|
454 |
</xs:element> |
|
455 |
</xs:schema> |
|
456 |
|
|
457 |
Prepare EML Documents for Harvest |
|
458 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
459 |
To prepare a set of EML documents for harvest, ensure that the following is true for each document: |
|
460 |
|
|
461 |
* The document contains valid EML |
|
462 |
* The document is specified in a ``<document>`` element in the site's Harvest List |
|
463 |
* The file resides at the location specified by its URL in the Harvest List |
|
464 |
|
|
465 |
Review Harvester Reports |
|
466 |
~~~~~~~~~~~~~~~~~~~~~~~~ |
|
467 |
Harvester sends an email report to the Site Contact after every scheduled site |
|
468 |
harvest. The report contains information about the performed operations, such |
|
469 |
as which EML documents were harvested and whether any errors were encountered. |
|
470 |
Errors are indicated by operations that display a status value of 1; a status |
|
471 |
value of 0 indicates that the operation completed successfully. |
|
472 |
|
|
473 |
When errors are reported, the Site Contact should try to determine whether the |
|
474 |
source of the error is something that can be corrected at the site. Common |
|
475 |
causes of errors include: |
|
476 |
|
|
477 |
* a document URL specified in the Harvest List does not match the location of the actual EML file on the disk |
|
478 |
* the Harvest List does not contain valid XML as specified in the harvestList.xsd schema |
|
479 |
* the URL to the Harvest List (specified during registration) does not match the actual location of the Harvest List on the disk |
|
480 |
* an EML document that Harvester attempted to upload to Metacat does not contain valid EML |
|
481 |
|
|
482 |
If the Site Contact is unable to determine the cause of the error and its |
|
483 |
resolution, he or she should contact the Harvester Administrator for assistance. |
|
484 |
|
|
485 |
Unregister with Harvester |
|
486 |
~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
487 |
To discontinue harvests, the Site Contact must unregister with Harvester. |
|
488 |
To unregister: |
|
489 |
|
|
490 |
1. Using a Web browser, log in to Metacat's Harvester Registration page. |
|
491 |
The Harvester Registration page is inside the skins directory. For example, |
|
492 |
if the Metacat server that you wish to register with resides at the |
|
493 |
following URL: |
|
494 |
|
|
495 |
:: |
|
496 |
|
|
497 |
http://somehost.somelocation.edu:8080/knb/index.jsp |
|
498 |
|
|
499 |
then the Harvester Registration page would be accessed at: |
|
500 |
|
|
501 |
:: |
|
502 |
|
|
503 |
http://somehost.somelocation.edu:8080/knb/style/skins/knb/harvesterRegistrationLogin.html |
|
504 |
|
|
505 |
2. Enter and submit your Metacat account information. On the subsequent screen, |
|
506 |
click Unregister to remove your site and discontinue harvests. |
|
507 |
|
|
508 |
Running Harvester |
|
509 |
----------------- |
|
510 |
The Harvester can be run as a servlet or in a command window. Under most |
|
511 |
circumstances, Harvester is best run continuously as a background servlet |
|
512 |
process. However, if you expect to use Harvester infrequently, or if wish only |
|
513 |
to test that Harvester is functioning, it may desirable to run it from a |
|
514 |
command window. |
|
515 |
|
|
516 |
Running Harvester as a Servlet |
|
517 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
518 |
To run Harvester as a servlet (from a source code installation): |
|
519 |
|
|
520 |
1. Remove the comment symbols around the HarvesterServlet entry in the source |
|
521 |
code. The HarvesterServlet entry is located in the ``lib/web.xml.tomcatN`` |
|
522 |
file, where tomcatN corresponds to the version of Tomcat you are running. |
|
523 |
For example, if you are running Tomcat 6, edit file lib/web.xml.tomcat6. |
|
524 |
|
|
525 |
:: |
|
526 |
|
|
527 |
<!-- |
|
528 |
<servlet> |
|
529 |
<servlet-name>HarvesterServlet</servlet-name> |
|
530 |
<servlet-class>edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet</servlet-class> |
|
531 |
<init-param> |
|
532 |
<param-name>debug</param-name> |
|
533 |
<param-value>1</param-value> |
|
534 |
</init-param> |
|
535 |
<init-param> |
|
536 |
<param-name>listings</param-name> |
|
537 |
<param-value>true</param-value> |
|
538 |
</init-param> |
|
539 |
<load-on-startup>1</load-on-startup> |
|
540 |
</servlet> |
|
541 |
--> |
|
542 |
|
|
543 |
2. Save the edited file. |
|
544 |
3. Shut down Tomcat. |
|
545 |
4. Redeploy Metacat by running the following two Ant commands from the |
|
546 |
top-level directory of your Metacat installation: |
|
547 |
|
|
548 |
:: |
|
549 |
|
|
550 |
ant cleanweb |
|
551 |
ant install |
|
552 |
|
|
553 |
5. Restart Tomcat. Note that you will have to edit the ``metacat.properties`` |
|
554 |
file to specify harvester settings. |
|
555 |
|
|
556 |
About thirty seconds after you restart Tomcat, the Harvester servlet will |
|
557 |
start executing. The first harvest will occur after the number of hours |
|
558 |
specified in the metacat.properties file. The servlet will continue running |
|
559 |
new harvests until the maximum number of harvests have been completed, or until |
|
560 |
Tomcat shuts down (harvest frequency and maximum number of harvests are also |
|
561 |
set in the Harvester properties). |
|
562 |
|
|
563 |
Running Harvester in a Command Window |
|
564 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
565 |
|
|
566 |
To run Harvester in a Command Window: |
|
567 |
|
|
568 |
1. Open a system command window or terminal window. |
|
569 |
2. Set the ``METACAT_HOME`` environment variable to the value of the |
|
570 |
Metacat installation directory. |
|
571 |
|
|
572 |
On Windows: |
|
573 |
|
|
574 |
:: |
|
575 |
|
|
576 |
set METACAT_HOME=C:\somePath\metacat |
|
577 |
|
|
578 |
On Linux/Unix (bash shell): |
|
579 |
|
|
580 |
:: |
|
581 |
|
|
582 |
export METACAT_HOME=/home/somePath/metacat |
|
583 |
|
|
584 |
3. cd to the following directory: |
|
585 |
|
|
586 |
On Windows: |
|
587 |
|
|
588 |
:: |
|
589 |
|
|
590 |
cd %METACAT_HOME%\lib\harvester |
|
591 |
|
|
592 |
On Linux/Unix: |
|
593 |
|
|
594 |
:: |
|
595 |
|
|
596 |
cd $METACAT_HOME/lib/harvester |
|
597 |
|
|
598 |
4. Run the appropriate Harvester shell script, as determined by the operating system: |
|
599 |
|
|
600 |
On Windows: |
|
601 |
|
|
602 |
:: |
|
603 |
|
|
604 |
runHarvester.bat |
|
605 |
|
|
606 |
On Linux/Unix: |
|
607 |
|
|
608 |
:: |
|
609 |
|
|
610 |
sh runHarvester.sh |
|
611 |
|
|
612 |
The Harvester application will start executing. The first harvest will occur |
|
613 |
after the number of hours specified in the ``metacat.properties file``. The |
|
614 |
servlet will continue running new harvests until the maximum number of harvests |
|
615 |
have been completed, or until you interrupt the process by hitting CTRL/C in |
|
616 |
the command window (harvest frequency and maximum number of harvests are also |
|
617 |
set in the Harvester properties). |
|
618 |
|
|
619 |
Reviewing Harvest Reports |
|
620 |
------------------------- |
|
621 |
Harvester sends an email report to the Harvester Administrator after every |
|
622 |
harvest. The report contains information about the performed operations, such |
|
623 |
as which sites were harvested as well as which EML documents were harvested |
|
624 |
and whether any errors were encountered. Errors are indicated by operations |
|
625 |
that display a status value of 1; a status value of 0 indicates that the |
|
626 |
operation completed successfully. |
|
627 |
|
|
628 |
The Harvester Administrator should review the report, paying particularly |
|
629 |
close attention to any reported errors and accompanying error messages. When |
|
630 |
errors are reported at a particular site, the Harvester Administrator should |
|
631 |
contact the Site Contact to determine the source of the error and its |
|
632 |
resolution. Common causes of errors include: |
|
633 |
|
|
634 |
* a document URL specified in the Harvest List does not match the location of the actual EML file on the disk |
|
635 |
* the Harvest List does not contain valid XML as specified in the harvestList.xsd schema |
|
636 |
* the URL to the Harvest List (specified during registration) does not match the actual location of the Harvest List on the disk |
|
637 |
* an EML document that Harvester attempted to upload to Metacat does not contain valid EML |
|
638 |
|
|
639 |
Errors that are independent of a particular site may indicate a problem with |
|
640 |
Harvester itself, Metacat, or the database connection. Refer to the error |
|
641 |
message to determine the source of the error and its resolution. |
|
642 |
|
Also available in: Unified diff
Converted Harvester chapter to RST.