Project

General

Profile

metacat / docs / user / metacat / source / harvester.rst @ 8265

1
Harvester and Harvest List Editor
2
=================================
3

    
4
Metacat's Harvester is an optional feature that can be used to automatically 
5
retrieve EML documents from one or more custom data management system (e.g., 
6
SRB or PostgreSQL) and to insert (or update) those documents to the home 
7
repository. The local sites control when they are harvested, and which documents 
8
are harvested. 
9

    
10
For example, the Long Term Ecological Research Network (LTER) uses the Metacat 
11
Harvester to create a centralized repository of data stored on twenty-six 
12
different sites that store EML metadata, but that use different data management 
13
systems. Once the data have been harvested and placed into a centralized 
14
repository, they are replicated to the KNB network, exposing the information 
15
to an even larger scientific community.
16

    
17
Once the Harvester is properly configured, listed documents are retrieved and 
18
uploaded on a regularly scheduled basis. You must configure both the home 
19
Metacat and the remote sites (aka the "harvest sites") before using this 
20
feature. Local sites must also provide the Metacat server with a list of 
21
documents that should be harvested.
22

    
23
Configuring Harvester
24
---------------------
25
Before you can use the Harvester to retrieve documents, you must configure the 
26
feature using the settings in the metacat.properties file. Note that you must 
27
also configure each site that the Harvester will connect to and retrieve 
28
documents from (see section 7.2 for details). 
29

    
30
The Harvester configuration information is managed in the metacat.properties 
31
file, which is located at:: 
32

    
33
  <CONTEXT_DIR>/WEB_INF/metacat.properties
34

    
35
The Harvester properties are grouped together and begin after the comment line:: 
36

    
37
  # Harvester properties
38

    
39
To configure Harvester, edit the metacat.properties and set appropriate values 
40
for the harvesterAdministrator and smtpServer property. You may also wish to 
41
customize the other Harvester paramaters, each discussed in the table below. 
42

    
43
Harvester Properties and their Functions
44
----------------------------------------
45

    
46
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
47
| Property                           | Description and Values                                                                          | |
48
+====================================+=================================================================================================+=+
49
| connectToMetacat                   | Determine whether Harvester should connect to Metacat to upload retrieved documents.            | |
50
|                                    | Set to true (the default) under most circumstances. To test whether Harvester can               | |
51
|                                    | retrieve documents from a site without actually connecting to Metacat                           | |
52
|                                    | to upload the documents, set the value to false.                                                | |
53
|                                    |                                                                                                 | |
54
|                                    | Values: true/false                                                                              | |
55
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
56
| delay                              | The number of hours that Harvester will wait before beginning its first harvest.                | |
57
|                                    | For example, if Harvester is run at 1:00 p.m., and the delay is set to 12,                      | |
58
|                                    | Harvester will begin its first harvest at 1:00 a.m.                                             | |
59
|                                    |                                                                                                 | |
60
|                                    | Default: 0                                                                                      | |
61
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
62
| harvesterAdministrator             | The email address of the Harvester Administrator. Harvester will send                           | |
63
|                                    | email reports to this address after every harvest. Enter multiple email addresses by separating | |
64
|                                    | each address with a comma or semicolon (e.g., name1@abc.edu,name2@abc.edu).                     | |
65
|                                    |                                                                                                 | |
66
|                                    | Values: An email address, or multiple email addresses separated by commas or semi-colons        | |
67
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
68
| logPeriod                          | The number of days to retain Harvester log entries. Harvester log entries                       | |
69
|                                    | record information such as which documents were harvested, from which sites,                    | |
70
|                                    | and whether any errors were encountered during the harvest. Log entries older                   | |
71
|                                    | than logPeriod number of days are purged from the database at the end of each harvest.          | |
72
|                                    |                                                                                                 | |
73
|                                    | Default: 90                                                                                     | |
74
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
75
| maxHarvests                        | The maximum number of harvests that Harvester should execute before                             | |
76
|                                    | shutting down. If the value of maxHarvests is set to 0 or a                                     | |
77
|                                    | negative number, Harvester will execute indefinitely.                                           | |
78
|                                    |                                                                                                 | |
79
|                                    | Default: 0                                                                                      | |
80
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
81
| period                             | The number of hours between harvests. Harvester will run a new harvest                          | |
82
|                                    | every specified period of hours (either indefinitely or until the maximum                       | |
83
|                                    | number of harvests have run, depending on the value of maxHarvests).                            | |
84
|                                    |                                                                                                 | |
85
|                                    | Default: 24                                                                                     | |
86
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
87
| smtpServer                         | The SMTP server that Harvester uses for sending email messages to the                           | |
88
|                                    | Harvester Administrator and Site Contacts.                                                      | |
89
|                                    | (e.g., somehost.institution.edu). Note that the default value only works                        | |
90
|                                    | if the Harvester host machine is configured as a SMTP server.                                   | |
91
|                                    |                                                                                                 | |
92
|                                    | Default: localhost                                                                              | |
93
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
94
| Harvester Operation Properties     | The Harvester Operation properties are used by Harvester to report information                  | |
95
| (GetDocError, GetDocSuccess, etc.) | about performed operations for inclusion in log entries and email messages.                     | |
96
|                                    | Under most circumstances the values of these properties should not be modified.                 | |
97
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
98

    
99
Configuring a Harvest Site (Instructions for Site Contact)
100
----------------------------------------------------------
101

    
102
After Metacat's Harvester has been configured, remote sites can register and 
103
send information about which files should be retrieved. Each remote site must 
104
have a site contact who is responsible for registering the site and creating a 
105
list of EML files to harvest (the "Harvest List"), as well as for reviewing 
106
harvest reports. The site contact can unregister the site from the Harvester 
107
at any time.
108

    
109
To use Harvester:
110

    
111
1. Register with Harvester
112
2. Compose a Harvest List (you will likely wish to use the Harvest List Editor)
113
3. Prepare your EML Documents for Harvest
114
4. Review the Harvester Reports
115

    
116
Register with Harvester
117
~~~~~~~~~~~~~~~~~~~~~~~
118

    
119
To register a remote site with Harvester, the Site Contact should log in to 
120
Metacat's Harvester Registration page and enter information about the site and 
121
how it should be harvested. 
122

    
123
1. Using a Web browser, log in to Metacat's Harvester Registration page. 
124
   The Harvester Registration page is inside the skins directory. For example, 
125
   if the Metacat server that you wish to register with resides at the following URL: 
126

    
127
   ::
128
   
129
     http://somehost.somelocation.edu:8080/metacat/index.jsp
130

    
131
   then the Harvester Registration page would be accessed at: 
132

    
133
   ::
134
   
135
     http://somehost.somelocation.edu:8080/metacat/style/skins/default/harvesterRegistrationLogin.jsp
136

    
137
.. figure:: images/screenshots/image065.jpg
138
   :align: center
139
   
140
   Metacat's Harvester Registration page.
141

    
142
2. Enter your Metacat account information and click Submit to log in to your 
143
   Metacat from the Harvester Registration page.
144

    
145
   Note: In some cases, you may need to log in to an anonymous "site" account 
146
   rather than your personal account so that the registered data will not appear 
147
   to have been registered by a single user. For example, an information 
148
   manager (jones) who is registering data created by a team of scientists 
149
   (jones, smith, and barney) from the Georgia Coastal Ecosystems site  might 
150
   log in to a dedicated account (named with the site's acronym, "GCE") to 
151
   indicate that the registered data is from the entire site rather than "jones". 
152

    
153
3. Enter information about your site and how often you want to schedule harvests 
154
   and then click the Register button (Figure 7.2). The Harvest List URL should 
155
   point to the location of the Harvest List, which is an XML file that lists 
156
   the documents to harvest. If you do not yet have a Harvest List, please see 
157
   the next section for more information about creating one.
158
   
159
.. figure:: images/screenshots/image067.jpg
160
   :align: center
161
   
162
   Enter information about your site and how often you want to schedule harvests.
163

    
164
The example settings in the previous figure instruct Harvester to harvest 
165
documents from the site once every two weeks. The Harvester will access the 
166
site's Harvest List at URL "http://somehost.institution.edu/~myname/harvestList.xml", 
167
and will send email reports to the Site Contact at email address 
168
"myname@institution.edu". Note that you can enter multiple email addresses by 
169
separating each address with a comma or a semi-colon. For example, 
170
"myname@institution.edu,anothername@institution.edu"
171

    
172
Compose a Harvest List (The Harvest List Editor)
173
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
174
The Harvest List is an XML file that contains a list of documents to be harvested. 
175
The list is created by the site contact and stored on the site contact's site 
176
at the location specified during the Harvester registration process (see 
177
previous section for details). The list can be generated by hand, or you can 
178
use Metacat's Harvest List Editor to automatically generate and structure the 
179
list to conform to the required XML schema (displayed in figure at the end of 
180
this section). In this section we will look at what information is required when 
181
building a Harvest List, and how to configure and use the Harvest List Editor. 
182
Note that you must have a source distribution of Metacat in order to use the 
183
Harvest List Editor.
184

    
185
The Harvest List contains information that helps Metacat identify and retrieve 
186
each specified EML file. Each document in the list must be described with a 
187
docid, documentType, and documentURL (see table).
188

    
189
Table: Information that must be included in the Harvest List about each EML file
190
+--------------+-------------------------------------------------------------------------------------------------+
191
| Item         | Description                                                                                     |
192
+==============+=================================================================================================+
193
| docid        | The docid uniquely identifies each EML document. Each docid consists of three elements:         |
194
|              |                                                                                                 |
195
|              | ``scope`` The document group to which the document belongs                                      |
196
|              | ``identifier``  A number that uniquely identifies the document within the scope.                |
197
|              | ``revision`` Anumber that indicates the current revision.                                       |
198
|              |                                                                                                 |
199
|              | For example, a valid docid could be: demoDocument.1.5, where demoDocument represents            |
200
|              | the scope, 1 the identifier, and 5 the revision number.                                         |
201
+--------------+-------------------------------------------------------------------------------------------------+
202
| documentType | The documentType identifies the type of document as EML                                         |
203
|              | e.g., "eml://ecoinformatics.org/eml-2.0.0".                                                     |
204
+--------------+-------------------------------------------------------------------------------------------------+
205
| documentURL  | The documentURL specifies a place where Harvester can locate and retrieve the                   |
206
|              | document via HTTP. The Metacat Harvester must be given read access to the contents at this URL. |
207
|              | e.g. "http://www.lternet.edu/~dcosta/document1.xml".                                            |
208
+--------------+-------------------------------------------------------------------------------------------------+
209

    
210
The example Harvest List below contains two <document> elements that specify the 
211
information that Harvester needs to retrieve a pair of EML documents and 
212
upload them to Metacat.
213

    
214
::
215

    
216

    
217
  <!-- Example Harvest List -->
218
  <?xml version="1.0" encoding="UTF-8" ?>
219
  <hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList" >
220
    <document>
221
        <docid>
222
            <scope>demoDocument</scope>
223
            <identifier>1</identifier>
224
            <revision>5</revision>
225
        </docid>
226
        <documentType>eml://ecoinformatics.org/eml-2.0.0</documentType>
227
        <documentURL>http://www.lternet.edu/~dcosta/document1.xml</documentURL>
228
    </document>
229
    <document>
230
        <docid>
231
            <scope>demoDocument</scope>
232
            <identifier>2</identifier>
233
            <revision>1</revision>
234
        </docid>
235
        <documentType>eml://ecoinformatics.org/eml-2.0.0</documentType>
236
        <documentURL>http://www.lternet.edu/~dcosta/document2.xml</documentURL>
237
    </document>
238
  </hrv:harvestList>
239

    
240
Rather than formatting the list by hand, you may wish to use Metacat's Harvest 
241
List Editor to compose and edit it. The Harvest List Editor displays a Harvest 
242
List as a table of rows and fields. Each table row corresponds to 
243
a single <document> element in the corresponding Harvest List file (i.e., one 
244
EML document). The row numbers are used only for visual reference and are 
245
not editable.
246

    
247
To add a new document to the Harvest List, enter values for all five editable 
248
fields (all fields except the "Row #" field). Partially filled-in rows will 
249
cause errors that will result in an invalid Harvest List. 
250

    
251
The buttons at the bottom of the Editor can be used to Cut, Copy, and Paste 
252
rows from one location to another. Select a row and click the desired button, 
253
or paste the default values (which are specified in the Editor's configuration 
254
file, discussed later in this section) into the currently selected row by 
255
clicking the Paste Defaults button. Note: Only one row can be selected at any 
256
given time: all cut, copy, and paste operations work on only a single row 
257
rather than on a range of rows. 
258

    
259
To run the Harvest List Editor, from the terminal on which the Metacat 
260
source code is installed: 
261
      
262
1. Open a system command window or terminal window. 
263
2. Set the METACAT_HOME environment variable to the value of the Metacat 
264
   installation directory. Some examples follow: 
265

    
266
   On Windows: 
267

    
268
   ::
269
   
270
     set METACAT_HOME=C:\somePath\metacat
271

    
272
   On Linux/Unix (bash shell): 
273
   
274
   ::
275
   
276
     export METACAT_HOME=/home/somePath/metacat
277

    
278
3. cd to the following directory: 
279

    
280
   On Windows: 
281
   
282
   ::
283
   
284
     cd %METACAT_HOME%\lib\harvester
285

    
286
   On Linux/Unix: 
287

    
288
   ::
289
   
290
     cd $METACAT_HOME/lib/harvester
291

    
292
4. Run the appropriate Harvester shell script, as determined by the operating system: 
293

    
294
   On Windows: 
295
   
296
   ::
297
   
298
     runHarvestListEditor.bat
299

    
300
   On Linux/Unix: 
301

    
302
   ::
303
   
304
     sh runHarvestListEditor.sh
305

    
306
   The Harvest List Editor will open. 
307

    
308
If you would like to customize the Harvest List Editor (e.g., specify a 
309
default list to open automatically whenever the editor is opened and/or 
310
default values), create a file called .harvestListEditor (note the leading 
311
dot character). Use a plain text editor to create the file and place the file 
312
in the Site Contact's home directory. To determine the home directory, open a 
313
system command window or terminal window and type the following: 
314

    
315
On Windows: 
316

    
317
::
318

    
319
  echo %USERPROFILE%
320

    
321
On Linux/Unix: 
322

    
323
::
324

    
325
  echo $HOME
326

    
327
The configuration file contains a number of optional properties that can make 
328
using the Editor more convenient. A sample configure file is displayed below, and 
329
more information about each configuration property is contained in the table.
330

    
331
A sample .harvestListEditor configuration file
332

    
333
::
334

    
335
  defaultHarvestList=C:/temp/harvestList.xml
336
  defaultScope=demo_document
337
  defaultIdentifier=1
338
  defaultRevision=1
339
  defaultDocumentURL=http://www.lternet.edu/~dcosta/
340
  defaultDocumentType=eml://ecoinformatics.org/eml-2.0.0
341

    
342
Harvest List Editor Configuration Properties
343

    
344
+---------------------+----------------------------------------------------------------------------------------------+
345
| Property            | Description                                                                                  |
346
+=====================+==============================================================================================+
347
| defaultHarvestList  | The location of a Harvest List file that the Editor will                                     |
348
|                     | automatically open for editing on startup. Set this property                                 |
349
|                     | to the path to the Harvest List file that you expect to edit most frequently.                |
350
|                     |                                                                                              |
351
|                     | Examples:                                                                                    |
352
|                     | ``/home/jdoe/public_html/harvestList.xml``                                                   |
353
|                     | ``C:/temp/harvestList.xml``                                                                  |
354
+---------------------+----------------------------------------------------------------------------------------------+
355
| defaultScope        | The value pasted into the Editor's Scope field when the Paste                                |
356
|                     | Defaults button is clicked. The Scope field should contain                                   |
357
|                     | a symbolic identifier that indicates the family of documents                                 |
358
|                     | to which the EML document belongs.                                                           |
359
|                     |                                                                                              |
360
|                     | Example:   xyz_dataset                                                                       |
361
|                     | Default:    dataset                                                                          |
362
+---------------------+----------------------------------------------------------------------------------------------+
363
| defaultIdentifer    | The value pasted into the Editor's Identifier field when the                                 |
364
|                     | Paste Defaults button is clicked. The Scope field should contain                             |
365
|                     | a numeric value indicating the identifier for this particular EML document within the Scope. |
366
+---------------------+----------------------------------------------------------------------------------------------+
367
| defaultRevision     | The value pasted into the Editor's Revision field when the Paste Defaults button             |
368
|                     | is clicked. The Scope field should contain a numeric value indicating the                    |
369
|                     | revision number of this EML document within the Scope and Identifier.                        |
370
|                     |                                                                                              |
371
|                     | Example:   2                                                                                 |
372
|                     | Default:    1                                                                                |
373
+---------------------+----------------------------------------------------------------------------------------------+
374
| defaultDocumentType | The document type specification pasted into the                                              |
375
|                     | Editor's DocumentType field when the Paste Defaults button is clicked.                       |
376
|                     |                                                                                              |
377
|                     | Default: ``eml://ecoinformatics.org/eml-2.0.0``                                              |
378
+---------------------+----------------------------------------------------------------------------------------------+
379
| defaultDocumentURL  | The URL or partial URL pasted into the Editor's URL field                                    |
380
|                     | when the Paste Defaults button is clicked. Typically, this                                   |
381
|                     | value is set to the portion of the URL shared by all harvested EML documents.                |
382
|                     |                                                                                              |
383
|                     | Example:                                                                                     |
384
|                     | ``http://somehost.institution.edu/somepath/``                                                |
385
|                     | Default: ``http://``                                                                         |
386
+---------------------+----------------------------------------------------------------------------------------------+
387

    
388

    
389
XML Schema for Harvest Lists
390

    
391
::
392

    
393
  <?xml version="1.0" encoding="UTF-8"?>
394
  <!-- edited with XMLSPY v5 rel. 4 U (http://www.xmlspy.com) by Matt Jones (NCEAS) -->
395
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:hrv="eml://ecoinformatics.org/harvestList" xmlns="eml://ecoinformatics.org/harvestList" targetNamespace="eml://ecoinformatics.org/harvestList" elementFormDefault="unqualified" attributeFormDefault="unqualified">
396
  <xs:annotation>
397
    <xs:documentation>This module defines the required information for the harvester to collect documents from the local site. The local system containing this document must give the Metacat Harvester read access to this document.</xs:documentation>
398
  </xs:annotation>
399
  <xs:annotation>
400
    <xs:appinfo>
401
      <tooltip/>
402
      <summary/>
403
      <description/>
404
    </xs:appinfo>
405
  </xs:annotation>
406
  <xs:element name="harvestList">
407
    <xs:annotation>
408
      <xs:documentation>This represents the local document information that is used to inform the Harvester of the docid, document type, and location of the document to be harvested.</xs:documentation>
409
    </xs:annotation>
410
    <xs:complexType>
411
      <xs:sequence>
412
        <xs:element name="document" maxOccurs="unbounded">
413
          <xs:complexType>
414
            <xs:sequence>
415
              <xs:element name="docid">
416
                <xs:annotation>
417
                  <xs:documentation>The complete document identifier to be used by metacat.  The docid is a compound element that gives a scope for the identifier, an integer local identifer that is unique within that scope, and a revision.  Each revision is assumed to specify a unique, non-changing document, so once a particular revision is harvested, there is no need for it to be harvested again.  To trigger a harvest of a document that has been updated, increment the revision number for that identifier.</xs:documentation>
418
                </xs:annotation>
419
                <xs:complexType>
420
                  <xs:sequence>
421
                    <xs:element name="scope" type="xs:string">
422
                      <xs:annotation>
423
                        <xs:documentation>The system prefix of a metacat docid that defines the scope within which the identifier is unique.</xs:documentation>
424
                      </xs:annotation>
425
                    </xs:element>
426
                    <xs:element name="identifier" type="xs:long">
427
                      <xs:annotation>
428
                        <xs:documentation>The local (site specific) portion of the identifier (docid) that is unique within the context of the scope.</xs:documentation>
429
                      </xs:annotation>
430
                    </xs:element>
431
                    <xs:element name="revision" type="xs:long">
432
                      <xs:annotation>
433
                        <xs:documentation>The revision identifier for this document, indicating a unique document version.</xs:documentation>
434
                      </xs:annotation>
435
                    </xs:element>
436
                  </xs:sequence>
437
                </xs:complexType>
438
              </xs:element>
439
              <xs:element name="documentType" type="xs:string">
440
                <xs:annotation>
441
                  <xs:documentation>The type of document to be harvested, indicated by a namespace string, formal public identifier, mime type, or other type indicator.   </xs:documentation>
442
                </xs:annotation>
443
              </xs:element>
444
              <xs:element name="documentURL" type="xs:anyURI">
445
                <xs:annotation>
446
                  <xs:documentation>The documentURL field contains the URL of the document to be harvested. The Metacat Harvester must be given read access to the contents at this URL.</xs:documentation>
447
                </xs:annotation>
448
              </xs:element>
449
            </xs:sequence>
450
          </xs:complexType>
451
        </xs:element>
452
      </xs:sequence>
453
    </xs:complexType>
454
  </xs:element>
455
  </xs:schema>
456

    
457
Prepare EML Documents for Harvest
458
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
459
To prepare a set of EML documents for harvest, ensure that the following is true for each document: 
460

    
461
* The document contains valid EML 
462
* The document is specified in a ``<document>`` element in the site's Harvest List
463
* The file resides at the location specified by its URL in the Harvest List 
464

    
465
Review Harvester Reports
466
~~~~~~~~~~~~~~~~~~~~~~~~
467
Harvester sends an email report to the Site Contact after every scheduled site 
468
harvest. The report contains information about the performed operations, such 
469
as which EML documents were harvested and whether any errors were encountered. 
470
Errors are indicated by operations that display a status value of 1; a status 
471
value of 0 indicates that the operation completed successfully. 
472

    
473
When errors are reported, the Site Contact should try to determine whether the 
474
source of the error is something that can be corrected at the site. Common 
475
causes of errors include:
476

    
477
* a document URL specified in the Harvest List does not match the location of the actual EML file on the disk 
478
* the Harvest List does not contain valid XML as specified in the harvestList.xsd schema 
479
* the URL to the Harvest List (specified during registration) does not match the actual location of the Harvest List on the disk 
480
* an EML document that Harvester attempted to upload to Metacat does not contain valid EML 
481

    
482
If the Site Contact is unable to determine the cause of the error and its 
483
resolution, he or she should contact the Harvester Administrator for assistance. 
484

    
485
Unregister with Harvester
486
~~~~~~~~~~~~~~~~~~~~~~~~~
487
To discontinue harvests, the Site Contact must unregister with Harvester. 
488
To unregister:
489

    
490
1. Using a Web browser, log in to Metacat's Harvester Registration page. 
491
   The Harvester Registration page is inside the skins directory. For example, 
492
   if the Metacat server that you wish to register with resides at the 
493
   following URL: 
494

    
495
   ::
496
   
497
     http://somehost.somelocation.edu:8080/metacat/index.jsp
498

    
499
   then the Harvester Registration page would be accessed at: 
500

    
501
   ::
502

    
503
     http://somehost.somelocation.edu:8080/metacat/style/skins/default/harvesterRegistrationLogin.jsp
504

    
505
2. Enter and submit your Metacat account information. On the subsequent screen, 
506
   click Unregister to remove your site and discontinue harvests. 
507

    
508
Running Harvester
509
-----------------
510
The Harvester can be run as a servlet or in a command window. Under most 
511
circumstances, Harvester is best run continuously as a background servlet 
512
process. However, if you expect to use Harvester infrequently, or if wish only 
513
to test that Harvester is functioning, it may desirable to run it from a 
514
command window.
515

    
516
Running Harvester as a Servlet
517
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
518
To run Harvester as a servlet:
519

    
520
1. Remove the comment symbols around the HarvesterServlet entry in the
521
	 deployed Metacat web.xml ($TOMCAT_HOME/webapps/<context>/WEB-INF). 
522

    
523
   ::
524
   
525
     <!--
526
     <servlet>
527
       <servlet-name>HarvesterServlet</servlet-name>
528
       <servlet-class>edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet</servlet-class>
529
       <init-param>
530
       <param-name>debug</param-name>
531
       <param-value>1</param-value>
532
       </init-param>
533
       <init-param>
534
       <param-name>listings</param-name>
535
       <param-value>true</param-value>
536
       </init-param>
537
       <load-on-startup>1</load-on-startup>
538
     </servlet>
539
     -->
540

    
541
2. Save the edited file. 
542
3. Shut down Tomcat. 
543
4. Redeploy Metacat by running the following two Ant commands from the 
544
   top-level directory of your Metacat installation: 
545

    
546
   ::
547
   
548
     ant cleanweb
549
     ant install
550

    
551
5. Restart Tomcat. Note that you will have to edit the ``metacat.properties`` 
552
   file to specify harvester settings.
553

    
554
About thirty seconds after you restart Tomcat, the Harvester servlet will 
555
start executing. The first harvest will occur after the number of hours 
556
specified in the metacat.properties file. The servlet will continue running 
557
new harvests until the maximum number of harvests have been completed, or until 
558
Tomcat shuts down (harvest frequency and maximum number of harvests are also 
559
set in the Harvester properties). 
560

    
561
Running Harvester in a Command Window
562
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
563

    
564
To run Harvester in a Command Window:
565
 
566
1. Open a system command window or terminal window. 
567
2. Set the ``METACAT_HOME`` environment variable to the value of the 
568
   Metacat webapp deployment directory. 
569

    
570
   On Windows: 
571

    
572
   ::
573
   
574
     set METACAT_HOME=C:\somePath\metacat
575

    
576
   On Linux/Unix (bash shell): 
577

    
578
   ::
579
   
580
     export METACAT_HOME=/home/somePath/metacat
581

    
582
3. cd to the following directory: 
583

    
584
   On Windows: 
585

    
586
   ::
587
   
588
     cd %METACAT_HOME%\lib\harvester
589

    
590
   On Linux/Unix: 
591

    
592
   ::
593
   
594
     cd $METACAT_HOME/lib/harvester
595

    
596
4. Run the appropriate Harvester shell script, as determined by the operating system: 
597

    
598
   On Windows: 
599

    
600
   ::
601
   
602
     runHarvester.bat %METACAT_HOME%
603

    
604
   On Linux/Unix: 
605

    
606
   ::
607
   
608
     sh runHarvester.sh $METACAT_HOME
609

    
610
The Harvester application will start executing. The first harvest will occur 
611
after the number of hours specified in the ``metacat.properties file``. The 
612
servlet will continue running new harvests until the maximum number of harvests 
613
have been completed, or until you interrupt the process by hitting CTRL/C in 
614
the command window (harvest frequency and maximum number of harvests are also 
615
set in the Harvester properties). 
616

    
617
Reviewing Harvest Reports
618
-------------------------
619
Harvester sends an email report to the Harvester Administrator after every 
620
harvest. The report contains information about the performed operations, such 
621
as which sites were harvested as well as which EML documents were harvested 
622
and whether any errors were encountered. Errors are indicated by operations 
623
that display a status value of 1; a status value of 0 indicates that the 
624
operation completed successfully. 
625

    
626
The Harvester Administrator should review the report, paying particularly 
627
close attention to any reported errors and accompanying error messages. When 
628
errors are reported at a particular site, the Harvester Administrator should 
629
contact the Site Contact to determine the source of the error and its 
630
resolution. Common causes of errors include:
631

    
632
* a document URL specified in the Harvest List does not match the location of the actual EML file on the disk 
633
* the Harvest List does not contain valid XML as specified in the harvestList.xsd schema 
634
* the URL to the Harvest List (specified during registration) does not match the actual location of the Harvest List on the disk 
635
* an EML document that Harvester attempted to upload to Metacat does not contain valid EML 
636

    
637
Errors that are independent of a particular site may indicate a problem with 
638
Harvester itself, Metacat, or the database connection. Refer to the error 
639
message to determine the source of the error and its resolution. 
640