Project

General

Profile

1
<!--
2
  * harvester.html
3
  *
4
  *      Authors: Duane Costa
5
  *    Copyright: 2004 Regents of the University of California and the
6
  *               National Center for Ecological Analysis and Synthesis,
7
  *               and the University of New Mexico.
8
  *  For Details: http://www.nceas.ucsb.edu/
9
  *      Created: 2004 April 9
10
  *      Version: 
11
  *    File Info: '$ '
12
  * 
13
  * 
14
-->
15
<HTML>
16
<HEAD>
17
<TITLE>Metacat Harvester</TITLE>
18
<link rel="stylesheet" type="text/css" href="@docrooturl@default.css">
19
</HEAD> 
20
<BODY>
21
  <table width="100%">
22
    <tr>
23
      <td class="tablehead" colspan="2">
24
        <p class="label">Metacat Harvester</p>
25
      </td>
26
      <td class="tablehead" colspan="2" align="right">
27
        <a href="./properties.html">Back</a> | 
28
        <a href="./metacattour.html">Home</a> | 
29
        <a href="./unimplem.html">Next</a>
30
      </td>
31
    </tr>
32
  </table>
33
  <h4>Introduction</h4>
34
The Metacat Harvester (henceforth referred to as "Harvester") is a
35
program that automates the retrieval of EML documents from one or more sites
36
and their subsequent upload (insert or update) to Metacat. Harvester uses pull
37
technology to retrieve and upload documents to Metacat on a regularly
38
scheduled basis.
39
<P>
40
Although Harvester is included with a Metacat installation (beginning with
41
Metacat version 1.4.0), it is an extention to Metacat's functionality
42
that may be used optionally.
43
</P>
44
  <h4>Definitions</h4>
45
The following table defines a number of terms that are useful in discussing
46
Harvester and its features.
47
  <br><br>
48
  <table border="1">
49
    <tr>
50
      <td><b>Term</b></td>
51
      <td><b>Definition</b></td>
52
    </tr>
53
    <tr>
54
      <td>Harvester</td>
55
      <td>The Harvester program, a Java application that is bundled with the
56
          Metacat distribution. When a user installs Metacat on a system, 
57
          the Harvester program is automatically included in the 
58
          installation.
59
      </td>
60
    </tr>
61
    <tr>
62
      <td>Harvester Administrator</td>
63
      <td>The individual who installs and manages Harvester. Typically, this
64
          would be the same individual who installs and manages Metacat at a
65
          given installation.
66
      </td>
67
    </tr>
68
    <tr>
69
      <td>Harvest Site</td>
70
      <td>A location from which Harvester can retrieve EML documents. A given 
71
          Harvester can retrieve documents from any number of different 
72
          Harvest Sites.
73
      </td>
74
    </tr>
75
    <tr>
76
      <td>Harvest</td>
77
      <td>The act (by Harvester) of visiting a Harvest Site, retrieving a
78
          number of EML documents, and inserting or updating the documents to 
79
          Metacat.
80
      </td>
81
    </tr>
82
    <tr>
83
      <td>Harvest List</td>
84
      <td>An XML document that lists a set of EML documents to be harvested. The
85
          Harvest List must conform to an XML Schema, 
86
          <a href="../../lib/harvester/harvestList.xsd">harvestList.xsd</a>.
87
      </td>
88
    </tr>
89
    <tr>
90
      <td>Site Contact</td>
91
      <td>The individual at a particular Harvest Site who registers with 
92
          Harvester, composes a Harvest List, and periodically prepares
93
          the site's EML documents for retrieval and upload to Metacat.
94
      </td>
95
    </tr>
96
    <tr>
97
      <td>Harvest List URL</td>
98
      <td>A URL to the Harvest List, as specified by the Site Contact. 
99
          Each Harvest Site corresponds to a Harvest List URL. Harvester 
100
          uses the URL to locate and read a site's Harvest List.
101
      </td>
102
    </tr>
103
    <tr>
104
      <td>Document URL</td>
105
      <td>A URL to an EML document, as specified in the Harvest List.
106
          The Harvest List may contain any number of Document URLs. Each
107
          Document URL provides a locator to a document to be harvested.
108
      </td>
109
    </tr>
110
    <tr>
111
      <td>Harvester Registration Page</td>
112
      <td>A web page that provides a means for a Site Contact
113
          to register with Harvester to schedule regular harvests from the
114
          site. Registration involves logging in and then specifying various
115
          settings for the Harvest Site, such as the Harvest List URL, the 
116
          harvest frequency, and the email address of the Site Contact.
117
      </td>
118
    </tr>
119
  </table>
120
  <h4>Managing Harvester</h4>
121
  Harvester is managed by the Harvester Administrator. Typically, the same
122
  individual who manages a Metacat server would also act as the Harvester
123
  Administrator. The responsibilities of the Harvester Administrator include:
124
    <ul>
125
      <li><a href="#Configuring Harvester">Configuring Harvester</a></li>
126
      <li><a href="#Running Harvester">Running Harvester</a></li>
127
      <li><a href="#Reviewing Harvester">Reviewing Harvester reports to 
128
      the Harvester Administrator</a></li>
129
    </ul>
130
  <h5><a name="Configuring Harvester">Configuring Harvester</a></h5>
131
  <p>Harvester must be configured to interact with a working Metacat
132
     installation. Thus, a Metacat installation that has been properly
133
     configured and installed is a pre-requisite to running Harvester.
134
     Additionally, Harvester has a number of settable properties that
135
     control its behavior. All Harvester configuration information is managed 
136
     in a single file, 
137
     <a href=../../lib/harvester/harvester.properties>harvester.properties</a>, 
138
     located at:
139
  <pre>      METACAT_HOME/lib/harvester/harvester.properties</pre>
140
     where METACAT_HOME is the top-level directory that Metacat is 
141
     installed in.
142
  </p>
143
  <p>The Harvester Administrator should edit 
144
     <a href=../../lib/harvester/harvester.properties>harvester.properties</a>, 
145
     setting appropriate values for the Metacat URL, database driver,
146
     database connection, and other settings. The 
147
     following table is a summary of each property and its function.
148
  </p>
149
  <table border="1">
150
    <tr>
151
      <td><b>Property</b></td>
152
      <td><b>Description</b></td>
153
      <td><b>Possible or default value</b></td>
154
    </tr>
155
    <tr>
156
      <td>connectToMetacat</td>
157
      <td>This property determines whether Harvester should connect to
158
          Metacat to upload documents. It should be set to <code>true</code>
159
          under most circumstances. Setting this property to <code>false</code>
160
          can be useful for testing whether Harvester is able to retrieve 
161
          documents from a site without actually connecting to Metacat to 
162
          upload the documents.</td>
163
      <td><code>true</code> | <code>false</code><br>
164
          Default: <code>true</code>
165
    </tr>
166
    <tr>
167
      <td>dbDriver</td>
168
      <td>The JDBC driver to be used to access the backend database. This
169
          setting should match the value of the dbDriver property as set 
170
          in the <a href=../../build.xml>build.xml</a> file as appropriate 
171
          to the database being used (Oracle, PostgreSQL, or SQL Server).
172
      </td>
173
      <td>Examples:<br>
174
          <code>oracle.jdbc.driver.OracleDriver</code><br>
175
          <code>org.postgresql.Driver</code><br>
176
          <code>com.microsoft.jdbc.sqlserver.SQLServerDriver</code>
177
      </td>
178
    </tr>
179
    <tr>
180
      <td>defaultDB</td>
181
      <td>The JDBC connection string that Metacat uses to connect to the 
182
          backend database. This setting should match the value of
183
          the <code>jdbc-connect</code> property as set in the
184
          <a href=../../build.properties>build.properties</a>
185
          file in the associated Metacat installation.</td>
186
      <td>Example:<br>
187
          <code>jdbc:oracle:thin:@server.domain.com:1521:Metacat</code></td>
188
    </tr>
189
    <tr>
190
      <td>delay</td>
191
      <td>The number of hours that Harvester will wait before beginning its
192
          first harvest. For example, if Harvester is run at  1:00 p.m., and
193
          the delay is set to 12, Harvester will begin its first harvest at 
194
          1:00 a.m.</td>
195
      <td>Default: 0</td>
196
    </tr>
197
    <tr>
198
      <td>harvesterAdministrator</td>
199
      <td>The email address of the Harvester Administrator. Harvester will
200
          send email reports to this address after every harvest.
201
      </td>
202
      <td>An email address</td>
203
    </tr>
204
    <tr>
205
      <td>logPeriod</td>
206
      <td>The number of days that Harvester should retain log entries of harvest
207
          operations in the database. Harvester log entries record information
208
          such as which documents were harvested, from which sites, and
209
          whether any errors were encountered during the harvest. Log entries
210
          older than <code>logPeriod</code> number of days are purged from the 
211
          database at the end of each harvest.</td>
212
      <td>Default: 90</td>
213
    </tr>
214
    <tr>
215
      <td>maxHarvests</td>
216
      <td>The maximum number of harvests that Harvester should execute before
217
          shutting down. When the Harvester program is executed, it will
218
          continue running until it has executed <code>maxHarvests</code>
219
          number of harvests and then the program will terminate.</td>
220
      <td>Default: 30</td>
221
    </tr>
222
    <tr>
223
      <td>metacatURL</td>
224
      <td>The URL of the Metacat servlet to which Harvester should connect
225
          for uploading documents.</td>
226
      <td>Example:<br>
227
               http://somehost.institution.edu:8080/knb/servlet/metacat</td>
228
    </tr>
229
    <tr>
230
      <td>password</td>
231
      <td>The password that Harvester uses to access the backend database.
232
          This setting should match the value of the <code>password</code>
233
          property as set in the 
234
          <a href=../../build.properties>build.properties</a>
235
          file in the associated Metacat installation.
236
      </td>
237
      <td>&nbsp;</td>
238
    </tr>
239
    <tr>
240
      <td>period</td>
241
      <td>The number of hours between harvests. Harvester will run a new
242
          harvest every <code>period</code> number of hours, until the 
243
          <code>maxHarvests</code> number of harvests have been run.</td>
244
      <td>Default: 24</td>
245
    </tr>
246
    <tr>
247
      <td>smtpServer</td>
248
      <td>The SMTP server that Harvester uses for sending email messages
249
          to the Harvester Administrator and to Site Contacts.</td>
250
      <td>A host name, for example: <code>somehost.institution.edu</code>
251
          <br><br>
252
          Default: <code>localhost</code>
253
          <br><br>
254
          Note that the default value will only work if the Harvester 
255
          host machine has been configured as a SMTP server.
256
      </td>
257
    </tr>
258
    <tr>
259
      <td>user</td>
260
      <td>The username that Metacat uses to access the backend database.
261
          This setting should match the <code>user</code> value as set in the 
262
          <a href=../../build.properties>build.properties</a>
263
          file in the associated Metacat installation.
264
      </td>
265
      <td>&nbsp;</td>
266
    </tr>
267
    <tr>
268
      <td>Harvester Operation Properties (GetDocError, GetDocSuccess, etc.)</td>
269
      <td>This group of properties is used by Harvester to report information
270
          about the operations it performs for inclusion in log
271
          entries and email messages. Under most circumstances the values 
272
          of these properties should not be modified.</td>
273
      <td>&nbsp;</td>
274
    </tr>
275
  </table>
276
  <br>
277
  <h5><a name="Running Harvester">Running Harvester</a></h5>
278
  After Harvester has been appropriately 
279
  <a href="#Configuring Harvester">configured</a>, 
280
  it can be run as follows:
281
  <ol>
282
  <li>Open a system command window or terminal window.</li>
283
  <li>Set the METACAT_HOME environment variable to the value of the Metacat
284
      installation directory. Some examples follow:
285
      <ul>
286
        <li>On Windows:
287
        <pre>set METACAT_HOME=C:\somePath\metacat</pre></li>
288
        <li>On Linux/Unix (bash shell):
289
        <pre>export METACAT_HOME=/home/somePath/metacat</pre></li>
290
      </ul>
291
  <li>cd to the following directory:
292
      <ul>
293
        <li>On Windows:
294
        <pre>cd %METACAT_HOME%\lib\harvester</pre></li>
295
        <li>On Linux/Unix:
296
        <pre>cd $METACAT_HOME/lib/harvester</pre></li>
297
      </ul>
298
  <li>Run the appropriate Harvester shell script, as determined by the
299
      operating system:
300
      <ul>
301
        <li>On Windows:
302
        <pre>runHarvester.bat</pre></li>
303
        <li>On Linux/Unix:
304
        <pre>sh runHarvester.sh</pre></li>
305
      </ul>
306
  </li>
307
  </ol>
308
  <p>The Harvester application will start executing. It will begin its first
309
  harvest after <code><b>delay</b></code> number of hours (as specified in the
310
  <a href=../../lib/harvester/harvester.properties>harvester.properties</a>
311
  file). The application will continue running a new harvest every
312
  <code><b>period</b></code> number of hours until a <code><b>maxHarvests</b></code>
313
  number of harvests have been completed.
314
  </p>
315
  <h5><a name="Reviewing Harvester">
316
  Reviewing Harvester Reports to the Harvester Administrator</a></h5>
317
  <P>
318
  After every harvest, Harvester will send an email report to the Harvester
319
  Administrator detailing the operations that were performed during the
320
  harvest. The report will contain information about each of the Harvest Sites
321
  that were harvested from, such as which EML documents were
322
  harvested and whether any errors were encountered.
323
  </P>
324
  <p>
325
  The harvest report will contain a list of log entries, where each log entry
326
  describes an operation that was performed by Harvester. Log entries that
327
  show a status value of 1 indicate that an error occurred during the
328
  operation, while those that show a status value of 0 indicate that the
329
  operation was completed successfully.
330
  </p>
331
  <P>The Harvester Administrator should review the report, paying particularly 
332
  close attention to any errors that are reported and to the accompanying error
333
  messages that are displayed. When errors are reported at
334
  a particular site, the Harvester Administrator should contact the Site
335
  Contact to determine the source of the error and its resolution. See 
336
  <a href=#Reviewing>Reviewing Harvester Reports to the Site Contact</a> for a
337
  description of common sources of errors at a Harvest Site.
338
  </P>
339
  <p>Errors that are independent of a particular site may indicate a problem 
340
  with Harvester itself, Metacat, or the database connection. Refer to the
341
  error message to determine the source of the error and its resolution.
342
  </p>
343
  <h4>Managing a Harvest Site</h4>
344
  A Harvest Site is managed by a Site Contact.
345
  The responsibilities of a Site Contact fall into the following categories:
346
    <ul>
347
      <li><a href=#Registering>Registering with Harvester</a></li>
348
      <li><a href=#Composing>Composing a Harvest List</a></li>
349
      <li><a href=#Preparing>Preparing EML Documents for harvest</a></li>
350
      <li><a href=#Reviewing>Reviewing Harvester reports to the Site Contact</a></li>
351
    </ul>
352
    <h5><a name="Registering">Registering with Harvester</a></h5>
353
  <p>
354
  A Site Contact registers a site with Harvester by logging in to the
355
  Harvester Registration page and entering several items of information
356
  that Harvester needs to know about the site.
357
  </p>
358
  <ol>
359
    <li>Logging in to the Harvester Registration Page
360
  <p>
361
  The Harvester Registration page is accessed from Metacat. For example, if
362
  the Metacat server that you wish to register with resides at the following 
363
  URL:
364
  <pre>  http://somehost.somelocation.edu:8080/knb/index.jsp</pre>
365
  then the Harvester Registration page would be accessed at:
366
  <pre>  http://somehost.somelocation.edu:8080/knb/style/skins/dev/harvesterRegistrationLogin.html</pre>
367
  </p>
368
  <p>
369
  After bringing up this page in your browser, login to your Metacat account 
370
  by entering your username and password.
371
  The username should include the full LDAP specification, for example:
372
  <pre>
373
  Username:   uid=jdoe,o=lter,dc=ecoinformatics,dc=org
374
  Password:   *******
375
  </pre>
376
  In some cases, a Site Contact may need to login to an anonymous account
377
  rather than his or her personal account. For example, a LTER Information 
378
  Manager may need to login to a dedicated account, named with a three-letter 
379
  acronym, that has been set up for the LTER site. For example:
380
  <pre>
381
  Username:   uid=GCE,o=lter,dc=ecoinformatics,dc=org
382
  Password:   *******
383
  </pre>
384
  is the account login that would be used by the LTER Information Mangager
385
  at the GCE (Georgia Coastal Ecosystems) site.
386
  </p>
387
    </li>
388
    <li>Registering with Harvester
389
  <p>
390
  After logging in, you will be presented with a web form that prompts you
391
  to enter information about your site and how often you want to schedule
392
  harvests at your site. For example:
393
  </p>
394
  <pre>
395
  Email address:            myname@institution.edu
396
  Harvest List URL:         http://somehost.institution.edu/~myname/harvestList.xml
397
  Harvest Frequency (1-99): 2
398
  Unit:                     ( ) day(s)    (*) week(s)   ( ) month(s)
399
  </pre>
400
  After values have been entered for each of these fields, click the Register 
401
  button to register your site with Harvester.
402
  </p>
403
  <P>
404
  In the example shown above, Harvester will attempt to harvest documents from 
405
  the site once every 2 weeks, it will access the site's Harvest List at URL
406
  "http://somehost.institution.edu/~myname/harvestList.xml", and it will send
407
  email reports to the Site Contact at email address "myname@institution.edu".
408
  </P>
409
    </li>
410
    <li>Unregistering with Harvester
411
  <p>
412
  At any time after you have registered with Harvester, you may discontinue 
413
  harvests at your site by unregistering. Simply login as described above and 
414
  then click the Unregister button. After doing so, Harvester will discontinue 
415
  harvests at the site.
416
  </p>
417
    </li>
418
  </ol>
419
  <h5><a name="Composing">Composing a Harvest List</a></h5>
420
  <p>
421
  A Harvest List is an XML file that holds a list of EML documents to be
422
  harvested. For each EML document in the list, the following information
423
  must be specified:
424
  <ul>
425
    <li><code>docid</code>, which consists of the:
426
      <ul>
427
        <li><code>scope</code>, e.g. "demoDocument". The scope is an identifier 
428
            that indicates which group of documents this document belongs to.
429
        </li>
430
        <li><code>identifier</code>, e.g. "1". The identifier is a number that 
431
            uniquely identifies this document within the scope.
432
        </li>
433
        <li><code>revision</code>, e.g. "5". The revision is a number that 
434
            indicates the current revision of this document.
435
        </li>
436
      </ul>
437
    </li>
438
    <li><code>documentType</code>, e.g. "eml://ecoinformatics.org/eml-2.0.0".
439
        The documentType identifies the document as an EML document.</li>
440
    <li><code>documentURL</code>, e.g. "http://www.lternet.edu/~dcosta/document1.xml".
441
        The documentURL specifies a place where Harvester can locate 
442
        and retrieve the document via HTTP.</li>
443
  </ul>
444
  </p>
445
  <p>
446
  The contents of a Harvest List XML file must conform to a particular
447
  XML Schema, as defined in file <a href="../../lib/harvester/harvestList.xsd">
448
  harvestList.xsd</a>. The contents of a valid Harvest List 
449
  can best be illustrated by example. The sample Harvest List
450
  below contains two &lt;<code>document</code>&gt; elements that specify the 
451
  information that Harvester needs to retrieve a pair of EML documents and 
452
  upload them to Metacat:
453
  <pre>
454
&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
455
&lt;hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList" &gt;
456
    &lt;document&gt;
457
        &lt;docid&gt;
458
            &lt;scope&gt;demoDocument&lt;/scope&gt;
459
            &lt;identifier&gt;1&lt;/identifier&gt;
460
            &lt;revision&gt;5&lt;/revision&gt;
461
        &lt;/docid&gt;
462
        &lt;documentType&gt;eml://ecoinformatics.org/eml-2.0.0&lt;/documentType&gt;
463
        &lt;documentURL&gt;http://www.lternet.edu/~dcosta/document1.xml&lt;/documentURL&gt;
464
    &lt;/document&gt;
465
    &lt;document&gt;
466
        &lt;docid&gt;
467
            &lt;scope&gt;demoDocument&lt;/scope&gt;
468
            &lt;identifier&gt;2&lt;/identifier&gt;
469
            &lt;revision&gt;1&lt;/revision&gt;
470
        &lt;/docid&gt;
471
        &lt;documentType&gt;eml://ecoinformatics.org/eml-2.0.0&lt;/documentType&gt;
472
        &lt;documentURL&gt;http://www.lternet.edu/~dcosta/document2.xml&lt;/documentURL&gt;
473
    &lt;/document&gt;
474
&lt;/hrv:harvestList&gt;
475
  </pre>
476
  <p>
477
  After editing the Harvest List, ensure that the Harvest List XML file resides
478
  at the appropriate location on disk as specified by the URL that was entered
479
  during the <a href=#Registering>registration</a> process.
480
  </p>
481
    <h5><a name="Preparing">Preparing EML Documents for harvest</a></h5>
482
  <p>
483
  To prepare a set of EML documents for harvest, ensure that the following is 
484
  true for each document:
485
  <ul>
486
    <li>The document contains valid EML</li>
487
    <li>The document is specified in a &lt;document&gt; element in the 
488
        site's Harvest List, as described above</li>
489
    <li>The file resides at the appropriate location on disk as specified
490
        by its URL in the Harvest List</li>
491
  </ul>
492
  </p>
493
    <h5><a name="Reviewing" >Reviewing Harvester Reports to the Site Contact</a></h5>
494
  <P>
495
  After every scheduled harvest that takes place at a particular Harvest
496
  Site, Harvester will send an email report to the Site Contact detailing the 
497
  operations that were performed during the harvest.
498
  The report will contain information about the operations that were
499
  performed by Harvester at that site, such as 
500
  which EML documents were harvested and whether any errors were encountered.
501
  </P>
502
  <P>
503
  The Site Contact should review the report, paying particularly 
504
  close attention to any errors that are reported. Errors are indicated
505
  by operations that display a status value of 1, while operations that
506
  display a status value of 0 indicate that the operation completed
507
  successfully.
508
  </P>
509
  <p>
510
  When errors are reported,
511
  the Site Contact should try to determine whether the source of the error
512
  is something that can be corrected at the site. Common causes of errors 
513
  might be:
514
  <ul>
515
    <li>A document URL specified in the Harvest List does not match
516
        the location of the actual EML file on the disk</li>
517
    <li>The Harvest List does not contain valid XML as specified in
518
        the <a href=../../lib/harvester/harvestList.xsd>harvestList.xsd</a> schema</li>
519
    <li>The URL to the Harvest List that was specified during
520
        registration with Harvester does not match the actual location of
521
        the Harvest List on the disk</li>
522
    <li>An EML document that Harvester attempted to upload to Metacat does
523
        not contain valid EML</li>
524
  </ul>
525
  </P> 
526
  <p>
527
  If the Site Contact is unable to determine the cause of the error and its
528
  resolution, he or she should contact the Harvester Administrator for assistance.
529
  </p>
530
  <a href="./properties.html">Back</a> | 
531
  <a href="./metacattour.html">Home</a> | 
532
  <a href="./unimplem.html">Next</a>
533
</BODY>
534
</HTML>
(14-14/47)