Project

General

Profile

1 6147 jones
Harvester and Harvest List Editor
2
=================================
3
4 6853 jones
Metacat's Harvester is an optional feature that can be used to automatically
5
retrieve EML documents from one or more custom data management system (e.g.,
6
SRB or PostgreSQL) and to insert (or update) those documents to the home
7
repository. The local sites control when they are harvested, and which documents
8
are harvested.
9 6147 jones
10 6853 jones
For example, the Long Term Ecological Research Network (LTER) uses the Metacat
11
Harvester to create a centralized repository of data stored on twenty-six
12
different sites that store EML metadata, but that use different data management
13
systems. Once the data have been harvested and placed into a centralized
14
repository, they are replicated to the KNB network, exposing the information
15
to an even larger scientific community.
16 6147 jones
17 6853 jones
Once the Harvester is properly configured, listed documents are retrieved and
18
uploaded on a regularly scheduled basis. You must configure both the home
19
Metacat and the remote sites (aka the "harvest sites") before using this
20
feature. Local sites must also provide the Metacat server with a list of
21
documents that should be harvested.
22 6147 jones
23 6853 jones
Configuring Harvester
24
---------------------
25
Before you can use the Harvester to retrieve documents, you must configure the
26
feature using the settings in the metacat.properties file. Note that you must
27
also configure each site that the Harvester will connect to and retrieve
28
documents from (see section 7.2 for details).
29
30
The Harvester configuration information is managed in the metacat.properties
31
file, which is located at::
32
33
  <CONTEXT_DIR>/WEB_INF/metacat.properties
34
35
The Harvester properties are grouped together and begin after the comment line::
36
37
  # Harvester properties
38
39
To configure Harvester, edit the metacat.properties and set appropriate values
40
for the harvesterAdministrator and smtpServer property. You may also wish to
41
customize the other Harvester paramaters, each discussed in the table below.
42
43
Harvester Properties and their Functions
44
----------------------------------------
45
46
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
47
| Property                           | Description and Values                                                                          | |
48
+====================================+=================================================================================================+=+
49
| connectToMetacat                   | Determine whether Harvester should connect to Metacat to upload retrieved documents.            | |
50
|                                    | Set to true (the default) under most circumstances. To test whether Harvester can               | |
51
|                                    | retrieve documents from a site without actually connecting to Metacat                           | |
52
|                                    | to upload the documents, set the value to false.                                                | |
53
|                                    |                                                                                                 | |
54
|                                    | Values: true/false                                                                              | |
55
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
56
| delay                              | The number of hours that Harvester will wait before beginning its first harvest.                | |
57
|                                    | For example, if Harvester is run at 1:00 p.m., and the delay is set to 12,                      | |
58
|                                    | Harvester will begin its first harvest at 1:00 a.m.                                             | |
59
|                                    |                                                                                                 | |
60
|                                    | Default: 0                                                                                      | |
61
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
62
| harvesterAdministrator             | The email address of the Harvester Administrator. Harvester will send                           | |
63
|                                    | email reports to this address after every harvest. Enter multiple email addresses by separating | |
64
|                                    | each address with a comma or semicolon (e.g., name1@abc.edu,name2@abc.edu).                     | |
65
|                                    |                                                                                                 | |
66
|                                    | Values: An email address, or multiple email addresses separated by commas or semi-colons        | |
67
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
68
| logPeriod                          | The number of days to retain Harvester log entries. Harvester log entries                       | |
69
|                                    | record information such as which documents were harvested, from which sites,                    | |
70
|                                    | and whether any errors were encountered during the harvest. Log entries older                   | |
71
|                                    | than logPeriod number of days are purged from the database at the end of each harvest.          | |
72
|                                    |                                                                                                 | |
73
|                                    | Default: 90                                                                                     | |
74
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
75
| maxHarvests                        | The maximum number of harvests that Harvester should execute before                             | |
76
|                                    | shutting down. If the value of maxHarvests is set to 0 or a                                     | |
77
|                                    | negative number, Harvester will execute indefinitely.                                           | |
78
|                                    |                                                                                                 | |
79
|                                    | Default: 0                                                                                      | |
80
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
81
| period                             | The number of hours between harvests. Harvester will run a new harvest                          | |
82
|                                    | every specified period of hours (either indefinitely or until the maximum                       | |
83
|                                    | number of harvests have run, depending on the value of maxHarvests).                            | |
84
|                                    |                                                                                                 | |
85
|                                    | Default: 24                                                                                     | |
86
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
87
| smtpServer                         | The SMTP server that Harvester uses for sending email messages to the                           | |
88
|                                    | Harvester Administrator and Site Contacts.                                                      | |
89
|                                    | (e.g., somehost.institution.edu). Note that the default value only works                        | |
90
|                                    | if the Harvester host machine is configured as a SMTP server.                                   | |
91
|                                    |                                                                                                 | |
92
|                                    | Default: localhost                                                                              | |
93
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
94
| Harvester Operation Properties     | The Harvester Operation properties are used by Harvester to report information                  | |
95
| (GetDocError, GetDocSuccess, etc.) | about performed operations for inclusion in log entries and email messages.                     | |
96
|                                    | Under most circumstances the values of these properties should not be modified.                 | |
97
+------------------------------------+-------------------------------------------------------------------------------------------------+-+
98
99
Configuring a Harvest Site (Instructions for Site Contact)
100
----------------------------------------------------------
101
102
After Metacat's Harvester has been configured, remote sites can register and
103
send information about which files should be retrieved. Each remote site must
104
have a site contact who is responsible for registering the site and creating a
105
list of EML files to harvest (the "Harvest List"), as well as for reviewing
106
harvest reports. The site contact can unregister the site from the Harvester
107
at any time.
108
109
To use Harvester:
110
111
1. Register with Harvester
112
2. Compose a Harvest List (you will likely wish to use the Harvest List Editor)
113
3. Prepare your EML Documents for Harvest
114
4. Review the Harvester Reports
115
116
Register with Harvester
117
~~~~~~~~~~~~~~~~~~~~~~~
118
119
To register a remote site with Harvester, the Site Contact should log in to
120
Metacat's Harvester Registration page and enter information about the site and
121
how it should be harvested.
122
123
1. Using a Web browser, log in to Metacat's Harvester Registration page.
124
   The Harvester Registration page is inside the skins directory. For example,
125
   if the Metacat server that you wish to register with resides at the following URL:
126
127
   ::
128
129
     http://somehost.somelocation.edu:8080/knb/index.jsp
130
131
   then the Harvester Registration page would be accessed at:
132
133
   ::
134
135
     http://somehost.somelocation.edu:8080/knb/style/skins/knb/harvesterRegistrationLogin.jsp
136
137
.. figure:: images/screenshots/image065.jpg
138
   :align: center
139
140
   Metacat's Harvester Registration page.
141
142
2. Enter your Metacat account information and click Submit to log in to your
143
   Metacat from the Harvester Registration page.
144
145
   Note: In some cases, you may need to log in to an anonymous "site" account
146
   rather than your personal account so that the registered data will not appear
147
   to have been registered by a single user. For example, an information
148
   manager (jones) who is registering data created by a team of scientists
149
   (jones, smith, and barney) from the Georgia Coastal Ecosystems site  might
150
   log in to a dedicated account (named with the site's acronym, "GCE") to
151
   indicate that the registered data is from the entire site rather than "jones".
152
153
3. Enter information about your site and how often you want to schedule harvests
154
   and then click the Register button (Figure 7.2). The Harvest List URL should
155
   point to the location of the Harvest List, which is an XML file that lists
156
   the documents to harvest. If you do not yet have a Harvest List, please see
157
   the next section for more information about creating one.
158
159
.. figure:: images/screenshots/image067.jpg
160
   :align: center
161
162
   Enter information about your site and how often you want to schedule harvests.
163
164
The example settings in the previous figure instruct Harvester to harvest
165
documents from the site once every two weeks. The Harvester will access the
166
site's Harvest List at URL "http://somehost.institution.edu/~myname/harvestList.xml",
167
and will send email reports to the Site Contact at email address
168
"myname@institution.edu". Note that you can enter multiple email addresses by
169
separating each address with a comma or a semi-colon. For example,
170
"myname@institution.edu,anothername@institution.edu"
171
172
Compose a Harvest List (The Harvest List Editor)
173
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
174
The Harvest List is an XML file that contains a list of documents to be harvested.
175
The list is created by the site contact and stored on the site contact's site
176
at the location specified during the Harvester registration process (see
177
previous section for details). The list can be generated by hand, or you can
178
use Metacat's Harvest List Editor to automatically generate and structure the
179
list to conform to the required XML schema (displayed in figure at the end of
180
this section). In this section we will look at what information is required when
181
building a Harvest List, and how to configure and use the Harvest List Editor.
182
Note that you must have a source distribution of Metacat in order to use the
183
Harvest List Editor.
184
185
The Harvest List contains information that helps Metacat identify and retrieve
186
each specified EML file. Each document in the list must be described with a
187
docid, documentType, and documentURL (see table).
188
189
Table: Information that must be included in the Harvest List about each EML file
190
+--------------+-------------------------------------------------------------------------------------------------+
191
| Item         | Description                                                                                     |
192
+==============+=================================================================================================+
193
| docid        | The docid uniquely identifies each EML document. Each docid consists of three elements:         |
194
|              |                                                                                                 |
195
|              | ``scope`` The document group to which the document belongs                                      |
196
|              | ``identifier``  A number that uniquely identifies the document within the scope.                |
197
|              | ``revision`` Anumber that indicates the current revision.                                       |
198
|              |                                                                                                 |
199
|              | For example, a valid docid could be: demoDocument.1.5, where demoDocument represents            |
200
|              | the scope, 1 the identifier, and 5 the revision number.                                         |
201
+--------------+-------------------------------------------------------------------------------------------------+
202
| documentType | The documentType identifies the type of document as EML                                         |
203
|              | e.g., "eml://ecoinformatics.org/eml-2.0.0".                                                     |
204
+--------------+-------------------------------------------------------------------------------------------------+
205
| documentURL  | The documentURL specifies a place where Harvester can locate and retrieve the                   |
206
|              | document via HTTP. The Metacat Harvester must be given read access to the contents at this URL. |
207
|              | e.g. "http://www.lternet.edu/~dcosta/document1.xml".                                            |
208
+--------------+-------------------------------------------------------------------------------------------------+
209
210
The example Harvest List below contains two <document> elements that specify the
211
information that Harvester needs to retrieve a pair of EML documents and
212
upload them to Metacat.
213
214
::
215
216
217
  <!-- Example Harvest List -->
218
  <?xml version="1.0" encoding="UTF-8" ?>
219
  <hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList" >
220
    <document>
221
        <docid>
222
            <scope>demoDocument</scope>
223
            <identifier>1</identifier>
224
            <revision>5</revision>
225
        </docid>
226
        <documentType>eml://ecoinformatics.org/eml-2.0.0</documentType>
227
        <documentURL>http://www.lternet.edu/~dcosta/document1.xml</documentURL>
228
    </document>
229
    <document>
230
        <docid>
231
            <scope>demoDocument</scope>
232
            <identifier>2</identifier>
233
            <revision>1</revision>
234
        </docid>
235
        <documentType>eml://ecoinformatics.org/eml-2.0.0</documentType>
236
        <documentURL>http://www.lternet.edu/~dcosta/document2.xml</documentURL>
237
    </document>
238
  </hrv:harvestList>
239
240
Rather than formatting the list by hand, you may wish to use Metacat's Harvest
241
List Editor to compose and edit it. The Harvest List Editor displays a Harvest
242
List as a table of rows and fields. Each table row corresponds to
243
a single <document> element in the corresponding Harvest List file (i.e., one
244
EML document). The row numbers are used only for visual reference and are
245
not editable.
246
247
To add a new document to the Harvest List, enter values for all five editable
248
fields (all fields except the "Row #" field). Partially filled-in rows will
249
cause errors that will result in an invalid Harvest List.
250
251
The buttons at the bottom of the Editor can be used to Cut, Copy, and Paste
252
rows from one location to another. Select a row and click the desired button,
253
or paste the default values (which are specified in the Editor's configuration
254
file, discussed later in this section) into the currently selected row by
255
clicking the Paste Defaults button. Note: Only one row can be selected at any
256
given time: all cut, copy, and paste operations work on only a single row
257
rather than on a range of rows.
258
259
To run the Harvest List Editor, from the terminal on which the Metacat
260
source code is installed:
261
262
1. Open a system command window or terminal window.
263
2. Set the METACAT_HOME environment variable to the value of the Metacat
264
   installation directory. Some examples follow:
265
266
   On Windows:
267
268
   ::
269
270
     set METACAT_HOME=C:\somePath\knb
271
272
   On Linux/Unix (bash shell):
273
274
   ::
275
276
     export METACAT_HOME=/home/somePath/metacat
277
278
3. cd to the following directory:
279
280
   On Windows:
281
282
   ::
283
284
     cd %METACAT_HOME%\lib\harvester
285
286
   On Linux/Unix:
287
288
   ::
289
290
     cd $METACAT_HOME/lib/harvester
291
292
4. Run the appropriate Harvester shell script, as determined by the operating system:
293
294
   On Windows:
295
296
   ::
297
298
     runHarvestListEditor.bat
299
300
   On Linux/Unix:
301
302
   ::
303
304
     sh runHarvestListEditor.sh
305
306
   The Harvest List Editor will open.
307
308
If you would like to customize the Harvest List Editor (e.g., specify a
309
default list to open automatically whenever the editor is opened and/or
310
default values), create a file called .harvestListEditor (note the leading
311
dot character). Use a plain text editor to create the file and place the file
312
in the Site Contact's home directory. To determine the home directory, open a
313
system command window or terminal window and type the following:
314
315
On Windows:
316
317
::
318
319
  echo %USERPROFILE%
320
321
On Linux/Unix:
322
323
::
324
325
  echo $HOME
326
327
The configuration file contains a number of optional properties that can make
328
using the Editor more convenient. A sample configure file is displayed below, and
329
more information about each configuration property is contained in the table.
330
331
A sample .harvestListEditor configuration file
332
333
::
334
335
  defaultHarvestList=C:/temp/harvestList.xml
336
  defaultScope=demo_document
337
  defaultIdentifier=1
338
  defaultRevision=1
339
  defaultDocumentURL=http://www.lternet.edu/~dcosta/
340
  defaultDocumentType=eml://ecoinformatics.org/eml-2.0.0
341
342
Harvest List Editor Configuration Properties
343
344
+---------------------+----------------------------------------------------------------------------------------------+
345
| Property            | Description                                                                                  |
346
+=====================+==============================================================================================+
347
| defaultHarvestList  | The location of a Harvest List file that the Editor will                                     |
348
|                     | automatically open for editing on startup. Set this property                                 |
349
|                     | to the path to the Harvest List file that you expect to edit most frequently.                |
350
|                     |                                                                                              |
351
|                     | Examples:                                                                                    |
352
|                     | ``/home/jdoe/public_html/harvestList.xml``                                                   |
353
|                     | ``C:/temp/harvestList.xml``                                                                  |
354
+---------------------+----------------------------------------------------------------------------------------------+
355
| defaultScope        | The value pasted into the Editor's Scope field when the Paste                                |
356
|                     | Defaults button is clicked. The Scope field should contain                                   |
357
|                     | a symbolic identifier that indicates the family of documents                                 |
358
|                     | to which the EML document belongs.                                                           |
359
|                     |                                                                                              |
360
|                     | Example:   xyz_dataset                                                                       |
361
|                     | Default:    dataset                                                                          |
362
+---------------------+----------------------------------------------------------------------------------------------+
363
| defaultIdentifer    | The value pasted into the Editor's Identifier field when the                                 |
364
|                     | Paste Defaults button is clicked. The Scope field should contain                             |
365
|                     | a numeric value indicating the identifier for this particular EML document within the Scope. |
366
+---------------------+----------------------------------------------------------------------------------------------+
367
| defaultRevision     | The value pasted into the Editor's Revision field when the Paste Defaults button             |
368
|                     | is clicked. The Scope field should contain a numeric value indicating the                    |
369
|                     | revision number of this EML document within the Scope and Identifier.                        |
370
|                     |                                                                                              |
371
|                     | Example:   2                                                                                 |
372
|                     | Default:    1                                                                                |
373
+---------------------+----------------------------------------------------------------------------------------------+
374
| defaultDocumentType | The document type specification pasted into the                                              |
375
|                     | Editor's DocumentType field when the Paste Defaults button is clicked.                       |
376
|                     |                                                                                              |
377
|                     | Default: ``eml://ecoinformatics.org/eml-2.0.0``                                              |
378
+---------------------+----------------------------------------------------------------------------------------------+
379
| defaultDocumentURL  | The URL or partial URL pasted into the Editor's URL field                                    |
380
|                     | when the Paste Defaults button is clicked. Typically, this                                   |
381
|                     | value is set to the portion of the URL shared by all harvested EML documents.                |
382
|                     |                                                                                              |
383
|                     | Example:                                                                                     |
384
|                     | ``http://somehost.institution.edu/somepath/``                                                |
385
|                     | Default: ``http://``                                                                         |
386
+---------------------+----------------------------------------------------------------------------------------------+
387
388
389
XML Schema for Harvest Lists
390
391
::
392
393
  <?xml version="1.0" encoding="UTF-8"?>
394
  <!-- edited with XMLSPY v5 rel. 4 U (http://www.xmlspy.com) by Matt Jones (NCEAS) -->
395
  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:hrv="eml://ecoinformatics.org/harvestList" xmlns="eml://ecoinformatics.org/harvestList" targetNamespace="eml://ecoinformatics.org/harvestList" elementFormDefault="unqualified" attributeFormDefault="unqualified">
396
  <xs:annotation>
397
    <xs:documentation>This module defines the required information for the harvester to collect documents from the local site. The local system containing this document must give the Metacat Harvester read access to this document.</xs:documentation>
398
  </xs:annotation>
399
  <xs:annotation>
400
    <xs:appinfo>
401
      <tooltip/>
402
      <summary/>
403
      <description/>
404
    </xs:appinfo>
405
  </xs:annotation>
406
  <xs:element name="harvestList">
407
    <xs:annotation>
408
      <xs:documentation>This represents the local document information that is used to inform the Harvester of the docid, document type, and location of the document to be harvested.</xs:documentation>
409
    </xs:annotation>
410
    <xs:complexType>
411
      <xs:sequence>
412
        <xs:element name="document" maxOccurs="unbounded">
413
          <xs:complexType>
414
            <xs:sequence>
415
              <xs:element name="docid">
416
                <xs:annotation>
417
                  <xs:documentation>The complete document identifier to be used by metacat.  The docid is a compound element that gives a scope for the identifier, an integer local identifer that is unique within that scope, and a revision.  Each revision is assumed to specify a unique, non-changing document, so once a particular revision is harvested, there is no need for it to be harvested again.  To trigger a harvest of a document that has been updated, increment the revision number for that identifier.</xs:documentation>
418
                </xs:annotation>
419
                <xs:complexType>
420
                  <xs:sequence>
421
                    <xs:element name="scope" type="xs:string">
422
                      <xs:annotation>
423
                        <xs:documentation>The system prefix of a metacat docid that defines the scope within which the identifier is unique.</xs:documentation>
424
                      </xs:annotation>
425
                    </xs:element>
426
                    <xs:element name="identifier" type="xs:long">
427
                      <xs:annotation>
428
                        <xs:documentation>The local (site specific) portion of the identifier (docid) that is unique within the context of the scope.</xs:documentation>
429
                      </xs:annotation>
430
                    </xs:element>
431
                    <xs:element name="revision" type="xs:long">
432
                      <xs:annotation>
433
                        <xs:documentation>The revision identifier for this document, indicating a unique document version.</xs:documentation>
434
                      </xs:annotation>
435
                    </xs:element>
436
                  </xs:sequence>
437
                </xs:complexType>
438
              </xs:element>
439
              <xs:element name="documentType" type="xs:string">
440
                <xs:annotation>
441
                  <xs:documentation>The type of document to be harvested, indicated by a namespace string, formal public identifier, mime type, or other type indicator.   </xs:documentation>
442
                </xs:annotation>
443
              </xs:element>
444
              <xs:element name="documentURL" type="xs:anyURI">
445
                <xs:annotation>
446
                  <xs:documentation>The documentURL field contains the URL of the document to be harvested. The Metacat Harvester must be given read access to the contents at this URL.</xs:documentation>
447
                </xs:annotation>
448
              </xs:element>
449
            </xs:sequence>
450
          </xs:complexType>
451
        </xs:element>
452
      </xs:sequence>
453
    </xs:complexType>
454
  </xs:element>
455
  </xs:schema>
456
457
Prepare EML Documents for Harvest
458
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
459
To prepare a set of EML documents for harvest, ensure that the following is true for each document:
460
461
* The document contains valid EML
462
* The document is specified in a ``<document>`` element in the site's Harvest List
463
* The file resides at the location specified by its URL in the Harvest List
464
465
Review Harvester Reports
466
~~~~~~~~~~~~~~~~~~~~~~~~
467
Harvester sends an email report to the Site Contact after every scheduled site
468
harvest. The report contains information about the performed operations, such
469
as which EML documents were harvested and whether any errors were encountered.
470
Errors are indicated by operations that display a status value of 1; a status
471
value of 0 indicates that the operation completed successfully.
472
473
When errors are reported, the Site Contact should try to determine whether the
474
source of the error is something that can be corrected at the site. Common
475
causes of errors include:
476
477
* a document URL specified in the Harvest List does not match the location of the actual EML file on the disk
478
* the Harvest List does not contain valid XML as specified in the harvestList.xsd schema
479
* the URL to the Harvest List (specified during registration) does not match the actual location of the Harvest List on the disk
480
* an EML document that Harvester attempted to upload to Metacat does not contain valid EML
481
482
If the Site Contact is unable to determine the cause of the error and its
483
resolution, he or she should contact the Harvester Administrator for assistance.
484
485
Unregister with Harvester
486
~~~~~~~~~~~~~~~~~~~~~~~~~
487
To discontinue harvests, the Site Contact must unregister with Harvester.
488
To unregister:
489
490
1. Using a Web browser, log in to Metacat's Harvester Registration page.
491
   The Harvester Registration page is inside the skins directory. For example,
492
   if the Metacat server that you wish to register with resides at the
493
   following URL:
494
495
   ::
496
497
     http://somehost.somelocation.edu:8080/knb/index.jsp
498
499
   then the Harvester Registration page would be accessed at:
500
501
   ::
502
503
     http://somehost.somelocation.edu:8080/knb/style/skins/knb/harvesterRegistrationLogin.html
504
505
2. Enter and submit your Metacat account information. On the subsequent screen,
506
   click Unregister to remove your site and discontinue harvests.
507
508
Running Harvester
509
-----------------
510
The Harvester can be run as a servlet or in a command window. Under most
511
circumstances, Harvester is best run continuously as a background servlet
512
process. However, if you expect to use Harvester infrequently, or if wish only
513
to test that Harvester is functioning, it may desirable to run it from a
514
command window.
515
516
Running Harvester as a Servlet
517
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
518
To run Harvester as a servlet (from a source code installation):
519
520
1. Remove the comment symbols around the HarvesterServlet entry in the source
521
   code. The HarvesterServlet entry is located in the ``lib/web.xml.tomcatN``
522
   file, where tomcatN corresponds to the version of Tomcat you are running.
523
   For example, if you are running Tomcat 6, edit file lib/web.xml.tomcat6.
524
525
   ::
526
527
     <!--
528
     <servlet>
529
       <servlet-name>HarvesterServlet</servlet-name>
530
       <servlet-class>edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet</servlet-class>
531
       <init-param>
532
       <param-name>debug</param-name>
533
       <param-value>1</param-value>
534
       </init-param>
535
       <init-param>
536
       <param-name>listings</param-name>
537
       <param-value>true</param-value>
538
       </init-param>
539
       <load-on-startup>1</load-on-startup>
540
     </servlet>
541
     -->
542
543
2. Save the edited file.
544
3. Shut down Tomcat.
545
4. Redeploy Metacat by running the following two Ant commands from the
546
   top-level directory of your Metacat installation:
547
548
   ::
549
550
     ant cleanweb
551
     ant install
552
553
5. Restart Tomcat. Note that you will have to edit the ``metacat.properties``
554
   file to specify harvester settings.
555
556
About thirty seconds after you restart Tomcat, the Harvester servlet will
557
start executing. The first harvest will occur after the number of hours
558
specified in the metacat.properties file. The servlet will continue running
559
new harvests until the maximum number of harvests have been completed, or until
560
Tomcat shuts down (harvest frequency and maximum number of harvests are also
561
set in the Harvester properties).
562
563
Running Harvester in a Command Window
564
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
565
566
To run Harvester in a Command Window:
567
568
1. Open a system command window or terminal window.
569
2. Set the ``METACAT_HOME`` environment variable to the value of the
570
   Metacat installation directory.
571
572
   On Windows:
573
574
   ::
575
576
     set METACAT_HOME=C:\somePath\metacat
577
578
   On Linux/Unix (bash shell):
579
580
   ::
581
582
     export METACAT_HOME=/home/somePath/metacat
583
584
3. cd to the following directory:
585
586
   On Windows:
587
588
   ::
589
590
     cd %METACAT_HOME%\lib\harvester
591
592
   On Linux/Unix:
593
594
   ::
595
596
     cd $METACAT_HOME/lib/harvester
597
598
4. Run the appropriate Harvester shell script, as determined by the operating system:
599
600
   On Windows:
601
602
   ::
603
604
     runHarvester.bat
605
606
   On Linux/Unix:
607
608
   ::
609
610
     sh runHarvester.sh
611
612
The Harvester application will start executing. The first harvest will occur
613
after the number of hours specified in the ``metacat.properties file``. The
614
servlet will continue running new harvests until the maximum number of harvests
615
have been completed, or until you interrupt the process by hitting CTRL/C in
616
the command window (harvest frequency and maximum number of harvests are also
617
set in the Harvester properties).
618
619
Reviewing Harvest Reports
620
-------------------------
621
Harvester sends an email report to the Harvester Administrator after every
622
harvest. The report contains information about the performed operations, such
623
as which sites were harvested as well as which EML documents were harvested
624
and whether any errors were encountered. Errors are indicated by operations
625
that display a status value of 1; a status value of 0 indicates that the
626
operation completed successfully.
627
628
The Harvester Administrator should review the report, paying particularly
629
close attention to any reported errors and accompanying error messages. When
630
errors are reported at a particular site, the Harvester Administrator should
631
contact the Site Contact to determine the source of the error and its
632
resolution. Common causes of errors include:
633
634
* a document URL specified in the Harvest List does not match the location of the actual EML file on the disk
635
* the Harvest List does not contain valid XML as specified in the harvestList.xsd schema
636
* the URL to the Harvest List (specified during registration) does not match the actual location of the Harvest List on the disk
637
* an EML document that Harvester attempted to upload to Metacat does not contain valid EML
638
639
Errors that are independent of a particular site may indicate a problem with
640
Harvester itself, Metacat, or the database connection. Refer to the error
641
message to determine the source of the error and its resolution.