Project

General

Profile

1 2131 costa
<!--
2
  * harvester.html
3
  *
4
  *      Authors: Duane Costa
5
  *    Copyright: 2004 Regents of the University of California and the
6
  *               National Center for Ecological Analysis and Synthesis,
7
  *               and the University of New Mexico.
8
  *  For Details: http://www.nceas.ucsb.edu/
9
  *      Created: 2004 April 9
10
  *      Version:
11
  *    File Info: '$ '
12
  *
13
  *
14
-->
15
<HTML>
16
<HEAD>
17
<TITLE>Metacat Harvester</TITLE>
18
<link rel="stylesheet" type="text/css" href="@docrooturl@default.css">
19
</HEAD>
20
<BODY>
21
  <table width="100%">
22
    <tr>
23
      <td class="tablehead" colspan="2">
24
        <p class="label">Metacat Harvester</p>
25
      </td>
26
      <td class="tablehead" colspan="2" align="right">
27
        <a href="./properties.html">Back</a> |
28
        <a href="./metacattour.html">Home</a> |
29
        <a href="./unimplem.html">Next</a>
30
      </td>
31
    </tr>
32
  </table>
33
  <h4>Introduction</h4>
34
The Metacat Harvester (henceforth referred to as "Harvester") is a
35
program that automates the retrieval of EML documents from one or more sites
36
and their subsequent upload (insert or update) to Metacat. Harvester uses pull
37
technology to retrieve and upload documents to Metacat on a regularly
38
scheduled basis.
39
<P>
40
Although Harvester is included with a Metacat installation (beginning with
41
Metacat version 1.4.0), it is an extention to Metacat's functionality
42
that may be used optionally.
43
</P>
44
  <h4>Definitions</h4>
45
The following table defines a number of terms that are useful in discussing
46
Harvester and its features.
47
  <br><br>
48
  <table border="1">
49
    <tr>
50
      <td><b>Term</b></td>
51
      <td><b>Definition</b></td>
52
    </tr>
53
    <tr>
54
      <td>Harvester</td>
55
      <td>The Harvester program, a Java application that is bundled with the
56
          Metacat distribution. When a user installs Metacat on a system,
57
          the Harvester program is automatically included in the
58
          installation.
59
      </td>
60
    </tr>
61
    <tr>
62
      <td>Harvester Administrator</td>
63
      <td>The individual who installs and manages Harvester. Typically, this
64
          would be the same individual who installs and manages Metacat at a
65
          given installation.
66
      </td>
67
    </tr>
68
    <tr>
69
      <td>Harvest Site</td>
70
      <td>A location from which Harvester can retrieve EML documents. A given
71
          Harvester can retrieve documents from any number of different
72
          Harvest Sites.
73
      </td>
74
    </tr>
75
    <tr>
76
      <td>Harvest</td>
77
      <td>The act (by Harvester) of visiting a Harvest Site, retrieving a
78
          number of EML documents, and inserting or updating the documents to
79
          Metacat.
80
      </td>
81
    </tr>
82
    <tr>
83 2185 costa
      <td><a name="HarvestList" >Harvest List</a></td>
84 2131 costa
      <td>An XML document that lists a set of EML documents to be harvested. The
85
          Harvest List must conform to an XML Schema,
86
          <a href="../../lib/harvester/harvestList.xsd">harvestList.xsd</a>.
87
      </td>
88
    </tr>
89
    <tr>
90
      <td>Site Contact</td>
91
      <td>The individual at a particular Harvest Site who registers with
92
          Harvester, composes a Harvest List, and periodically prepares
93
          the site's EML documents for retrieval and upload to Metacat.
94
      </td>
95
    </tr>
96
    <tr>
97
      <td>Harvest List URL</td>
98
      <td>A URL to the Harvest List, as specified by the Site Contact.
99
          Each Harvest Site corresponds to a Harvest List URL. Harvester
100
          uses the URL to locate and read a site's Harvest List.
101
      </td>
102
    </tr>
103
    <tr>
104
      <td>Document URL</td>
105
      <td>A URL to an EML document, as specified in the Harvest List.
106
          The Harvest List may contain any number of Document URLs. Each
107
          Document URL provides a locator to a document to be harvested.
108
      </td>
109
    </tr>
110
    <tr>
111
      <td>Harvester Registration Page</td>
112
      <td>A web page that provides a means for a Site Contact
113
          to register with Harvester to schedule regular harvests from the
114
          site. Registration involves logging in and then specifying various
115
          settings for the Harvest Site, such as the Harvest List URL, the
116
          harvest frequency, and the email address of the Site Contact.
117
      </td>
118
    </tr>
119
  </table>
120
  <h4>Managing Harvester</h4>
121
  Harvester is managed by the Harvester Administrator. Typically, the same
122
  individual who manages a Metacat server would also act as the Harvester
123
  Administrator. The responsibilities of the Harvester Administrator include:
124
    <ul>
125
      <li><a href="#Configuring Harvester">Configuring Harvester</a></li>
126
      <li><a href="#Running Harvester">Running Harvester</a></li>
127
      <li><a href="#Reviewing Harvester">Reviewing Harvester reports to
128
      the Harvester Administrator</a></li>
129
    </ul>
130
  <h5><a name="Configuring Harvester">Configuring Harvester</a></h5>
131
  <p>Harvester must be configured to interact with a working Metacat
132
     installation. Thus, a Metacat installation that has been properly
133
     configured and installed is a pre-requisite to running Harvester.
134
     Additionally, Harvester has a number of settable properties that
135
     control its behavior. All Harvester configuration information is managed
136
     in a single file,
137 2157 costa
     <a href=../../lib/metacat.properties>metacat.properties</a>,
138 2131 costa
     located at:
139 2157 costa
  <pre>      METACAT_HOME/lib/metacat.properties</pre>
140 2131 costa
     where METACAT_HOME is the top-level directory that Metacat is
141 2157 costa
     installed in.
142 2131 costa
  </p>
143 2157 costa
  <p>Harvester properties are grouped together in
144
     <a href=../../lib/metacat.properties>metacat.properties</a>, beginning
145
     after the comment line:
146
  <pre><code>      # Harvester properties</code></pre>
147
  </p>
148 2131 costa
  <p>The Harvester Administrator should edit
149 2157 costa
     <a href=../../lib/metacat.properties>metacat.properties</a>,
150
     setting appropriate values for the <code><b>harvesterAdministrator</b></code>
151
     property, the <code><b>smtpServer</b></code> property, and possibly other
152
     properties. The following table is a summary of each property and its function.
153 2131 costa
  </p>
154
  <table border="1">
155
    <tr>
156
      <td><b>Property</b></td>
157
      <td><b>Description</b></td>
158
      <td><b>Possible or default value</b></td>
159
    </tr>
160
    <tr>
161
      <td>connectToMetacat</td>
162
      <td>This property determines whether Harvester should connect to
163
          Metacat to upload documents. It should be set to <code>true</code>
164
          under most circumstances. Setting this property to <code>false</code>
165
          can be useful for testing whether Harvester is able to retrieve
166
          documents from a site without actually connecting to Metacat to
167
          upload the documents.</td>
168
      <td><code>true</code> | <code>false</code><br>
169
          Default: <code>true</code>
170
    </tr>
171
    <tr>
172
      <td>delay</td>
173
      <td>The number of hours that Harvester will wait before beginning its
174
          first harvest. For example, if Harvester is run at  1:00 p.m., and
175
          the delay is set to 12, Harvester will begin its first harvest at
176
          1:00 a.m.</td>
177
      <td>Default: 0</td>
178
    </tr>
179
    <tr>
180
      <td>harvesterAdministrator</td>
181
      <td>The email address of the Harvester Administrator. Harvester will
182 2330 costa
          send email reports to this address after every harvest. You may
183
          enter multiple email addresses by separating each address with
184
          a comma or semicolon, for example, "name1@abc.edu,name2@abc.edu".
185 2131 costa
      </td>
186 2330 costa
      <td>An email address, or multiple email addresses separated by commas
187
          or semi-colons</td>
188 2131 costa
    </tr>
189
    <tr>
190
      <td>logPeriod</td>
191
      <td>The number of days that Harvester should retain log entries of harvest
192
          operations in the database. Harvester log entries record information
193
          such as which documents were harvested, from which sites, and
194
          whether any errors were encountered during the harvest. Log entries
195
          older than <code>logPeriod</code> number of days are purged from the
196
          database at the end of each harvest.</td>
197
      <td>Default: 90</td>
198
    </tr>
199
    <tr>
200
      <td>maxHarvests</td>
201
      <td>The maximum number of harvests that Harvester should execute before
202
          shutting down. When the Harvester program is executed, it will
203
          continue running until it has executed <code>maxHarvests</code>
204 2427 costa
          number of harvests and then the program will terminate. If
205
          the value of <code>maxHarvests</code> is set to 0 or a negative
206
          number, it will be ignored and Harvester will execute indefinitely.
207
          </td>
208
      <td>Default: 0</td>
209 2131 costa
    </tr>
210
    <tr>
211
      <td>period</td>
212
      <td>The number of hours between harvests. Harvester will run a new
213
          harvest every <code>period</code> number of hours, until the
214 2427 costa
          <code>maxHarvests</code> number of harvests have been run, or
215
          indefinitely if <code>maxHarvests</code> is set to a value of
216
          0 or a negative number.
217 2131 costa
      <td>Default: 24</td>
218
    </tr>
219
    <tr>
220
      <td>smtpServer</td>
221
      <td>The SMTP server that Harvester uses for sending email messages
222
          to the Harvester Administrator and to Site Contacts.</td>
223
      <td>A host name, for example: <code>somehost.institution.edu</code>
224
          <br><br>
225
          Default: <code>localhost</code>
226
          <br><br>
227
          Note that the default value will only work if the Harvester
228
          host machine has been configured as a SMTP server.
229
      </td>
230
    </tr>
231
    <tr>
232
      <td>Harvester Operation Properties (GetDocError, GetDocSuccess, etc.)</td>
233
      <td>This group of properties is used by Harvester to report information
234
          about the operations it performs for inclusion in log
235
          entries and email messages. Under most circumstances the values
236
          of these properties should not be modified.</td>
237
      <td>&nbsp;</td>
238
    </tr>
239
  </table>
240
  <br>
241
  <h5><a name="Running Harvester">Running Harvester</a></h5>
242
  After Harvester has been appropriately
243
  <a href="#Configuring Harvester">configured</a>,
244 2387 costa
  it can be run in either of two ways: (A) in a command window, or, (B)
245
  as a servlet. If you wish only to test that Harvester is functioning,
246
  or if you expect to use Harvester infrequently, it may desirable to run it from a
247
  command window. However, under most circumstances you will want Harvester to
248
  run continuously as a background servlet process. This eliminates the
249
  need to keep a command window continuously open while Harvester is running.
250
  Both of these procedures are described below.
251
  <ul>
252
  <li> (A) Running Harvester in a Command Window
253 2131 costa
  <ol>
254
  <li>Open a system command window or terminal window.</li>
255
  <li>Set the METACAT_HOME environment variable to the value of the Metacat
256
      installation directory. Some examples follow:
257
      <ul>
258
        <li>On Windows:
259
        <pre>set METACAT_HOME=C:\somePath\metacat</pre></li>
260
        <li>On Linux/Unix (bash shell):
261
        <pre>export METACAT_HOME=/home/somePath/metacat</pre></li>
262
      </ul>
263
  <li>cd to the following directory:
264
      <ul>
265
        <li>On Windows:
266
        <pre>cd %METACAT_HOME%\lib\harvester</pre></li>
267
        <li>On Linux/Unix:
268
        <pre>cd $METACAT_HOME/lib/harvester</pre></li>
269
      </ul>
270
  <li>Run the appropriate Harvester shell script, as determined by the
271
      operating system:
272
      <ul>
273
        <li>On Windows:
274
        <pre>runHarvester.bat</pre></li>
275
        <li>On Linux/Unix:
276
        <pre>sh runHarvester.sh</pre></li>
277
      </ul>
278
  </li>
279 2387 costa
 </ol>
280 2131 costa
  <p>The Harvester application will start executing. It will begin its first
281
  harvest after <code><b>delay</b></code> number of hours (as specified in the
282 2157 costa
  <a href=../../lib/metacat.properties>metacat.properties</a>
283 2131 costa
  file). The application will continue running a new harvest every
284
  <code><b>period</b></code> number of hours until a <code><b>maxHarvests</b></code>
285 2427 costa
  number of harvests have been completed (if <code><b>maxHarvests</b></code> is set
286
  to a value greater than 0), or until you interrupt the process by hitting CTRL/C
287
  in the command window.
288 2131 costa
  </p>
289 2387 costa
  </li>
290
  <li> (B) Running Harvester as a Servlet
291
  <ol>
292
  <li>Edit the file in your Metcat installation, <code>lib/web.xml.<em>tomcatN</em></code>, where <em>tomcatN</em> corresponds to the
293
  version of Tomcat you are running. For example, if you are running Tomcat 5,
294
  edit file <code>lib/web.xml.tomcat5</code>.</li>
295
  <li>Remove the comment symbols around the HarvesterServlet entry, so that:
296
  <pre><code>
297
  &lt;!--
298
  &lt;servlet>
299
  &lt;servlet-name>HarvesterServlet&lt;/servlet-name>
300
  &lt;servlet-class>edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet&lt;/servlet-class>
301
  &lt;init-param>
302
    &lt;param-name>debug&lt;/param-name>
303
    &lt;param-value>1&lt;/param-value>
304
  &lt;/init-param>
305
  &lt;init-param>
306
    &lt;param-name>listings&lt;/param-name>
307
    &lt;param-value>true&lt;/param-value>
308
  &lt;/init-param>
309
  &lt;load-on-startup>1&lt;/load-on-startup>
310
  &lt;/servlet>
311
  --&gt;
312
  </code></pre>
313
  is changed to:
314
  <pre><code>
315
  &lt;servlet>
316
  &lt;servlet-name>HarvesterServlet&lt;/servlet-name>
317
  &lt;servlet-class>edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet&lt;/servlet-class>
318
  &lt;init-param>
319
    &lt;param-name>debug&lt;/param-name>
320
    &lt;param-value>1&lt;/param-value>
321
  &lt;/init-param>
322
  &lt;init-param>
323
    &lt;param-name>listings&lt;/param-name>
324
    &lt;param-value>true&lt;/param-value>
325
  &lt;/init-param>
326
  &lt;load-on-startup>1&lt;/load-on-startup>
327
  &lt;/servlet>
328
  </code></pre>
329
  Save the edited file.
330
  </li>
331
  <li>Shutdown Tomcat.</li>
332
  <li>Redeploy Metacat by running the following two ant commands from the top-level
333
  directory of your Metacat installation:
334
  <code><pre>
335
  ant cleanweb
336
  ant install</code></pre>
337
  </li>
338
  <li>Restart Tomcat.</li>
339
 </ol>
340
  <p>About thirty seconds after you restart Tomcat, the Harvester servlet will
341
  start executing. It will begin its first
342
  harvest after <code><b>delay</b></code> number of hours (as specified in the
343
  <a href=../../lib/metacat.properties>metacat.properties</a>
344
  file). The servlet will continue running a new harvest every
345
  <code><b>period</b></code> number of hours until a <code><b>maxHarvests</b></code>
346 2427 costa
  number of harvests have been completed (if <code><b>maxHarvests</b></code> is set
347
  to a value greater than 0), or until Tomcat shuts down.
348 2387 costa
  </p>
349
  </li>
350
   <h5><a name="Reviewing Harvester">
351 2131 costa
  Reviewing Harvester Reports to the Harvester Administrator</a></h5>
352
  <P>
353
  After every harvest, Harvester will send an email report to the Harvester
354
  Administrator detailing the operations that were performed during the
355
  harvest. The report will contain information about each of the Harvest Sites
356
  that were harvested from, such as which EML documents were
357
  harvested and whether any errors were encountered.
358
  </P>
359
  <p>
360
  The harvest report will contain a list of log entries, where each log entry
361
  describes an operation that was performed by Harvester. Log entries that
362
  show a status value of 1 indicate that an error occurred during the
363
  operation, while those that show a status value of 0 indicate that the
364
  operation was completed successfully.
365
  </p>
366
  <P>The Harvester Administrator should review the report, paying particularly
367
  close attention to any errors that are reported and to the accompanying error
368
  messages that are displayed. When errors are reported at
369
  a particular site, the Harvester Administrator should contact the Site
370
  Contact to determine the source of the error and its resolution. See
371
  <a href=#Reviewing>Reviewing Harvester Reports to the Site Contact</a> for a
372
  description of common sources of errors at a Harvest Site.
373
  </P>
374
  <p>Errors that are independent of a particular site may indicate a problem
375
  with Harvester itself, Metacat, or the database connection. Refer to the
376
  error message to determine the source of the error and its resolution.
377
  </p>
378
  <h4>Managing a Harvest Site</h4>
379
  A Harvest Site is managed by a Site Contact.
380
  The responsibilities of a Site Contact fall into the following categories:
381
    <ul>
382
      <li><a href=#Registering>Registering with Harvester</a></li>
383
      <li><a href=#Composing>Composing a Harvest List</a></li>
384
      <li><a href=#Preparing>Preparing EML Documents for harvest</a></li>
385
      <li><a href=#Reviewing>Reviewing Harvester reports to the Site Contact</a></li>
386
    </ul>
387
    <h5><a name="Registering">Registering with Harvester</a></h5>
388
  <p>
389
  A Site Contact registers a site with Harvester by logging in to the
390
  Harvester Registration page and entering several items of information
391
  that Harvester needs to know about the site.
392
  </p>
393
  <ol>
394
    <li>Logging in to the Harvester Registration Page
395
  <p>
396
  The Harvester Registration page is accessed from Metacat. For example, if
397
  the Metacat server that you wish to register with resides at the following
398
  URL:
399
  <pre>  http://somehost.somelocation.edu:8080/knb/index.jsp</pre>
400
  then the Harvester Registration page would be accessed at:
401 2157 costa
  <pre>  http://somehost.somelocation.edu:8080/knb/style/skins/knb/harvesterRegistrationLogin.html</pre>
402 2131 costa
  </p>
403
  <p>
404
  After bringing up this page in your browser, login to your Metacat account
405 2166 costa
  by entering your username, organization, and password. For example:
406
      <table bgcolor="#ffffff" border="0" cellpadding="2" width='100%' >
407
        <tr >
408
          <td colspan=3 align=center >&nbsp;</td>
409
        </tr>
410
        <tr >
411
          <td colspan=3 align=center >
412
            <font face=verdana size=1%>
413
              <b>Please  Enter Username, Organization, and Password </b>
414
            </font>
415
          </td>
416
        </tr>
417
        <tr>
418
          <td width='10%'> &nbsp;</td>
419
          <td width="25%" bgcolor="#4682b4">
420
            <p align="center">
421
            <font color="white" face=verdana size=2%>
422
            <b>Username</b>
423
            </font>
424
          </td>
425
          <td><p><input type="text" name="uid" value="jdoe" maxlength="100" size="28"></td>
426
        </tr>
427
        <tr>
428
          <td width='10%'> &nbsp;</td>
429
          <td width="25%" bgcolor="#4682b4">
430
            <p align="center">
431
            <font color="white" face=verdana size=2%>
432
            <b>Organization</b>
433
            </font>
434
          </td>
435
          <td>
436
            <input type="radio" name="o" value="NCEAS" checked>NCEAS
437
            <input type="radio" name="o" value="LTER">LTER
438
            <input type="radio" name="o" value="NRS">NRS
439
            <br>
440
            <input type="radio" name="o" value="PISCO">PISCO
441
            <input type="radio" name="o" value="OBFS">OBFS
442
            <input type="radio" name="o" value="Unaffiliated">Unaffiliated
443
        </tr>
444
        <tr>
445
          <td width='10%'> &nbsp;</td>
446
          <td bgcolor="#4682b4">
447
            <p align="center">
448
            <font color="white" face=verdana size=2%>
449
            <b>Password</b>
450
            </font>
451
          </td>
452
          <td><p><input type="password" name="passwd" value="*******" maxlength="60" size="28">
453
          </td>
454
        </tr>
455
        <tr>
456
          <td colspan=3 align=center >&nbsp;</td>
457
        </tr>
458
      </table>
459 2131 costa
  In some cases, a Site Contact may need to login to an anonymous account
460
  rather than his or her personal account. For example, a LTER Information
461
  Manager may need to login to a dedicated account, named with a three-letter
462 2166 costa
  acronym, that has been set up for the LTER site. The username
463
  "GCE" would be used by the LTER Information Mangager at the GCE (Georgia
464
  Coastal Ecosystems) site.
465 2131 costa
  </p>
466
    </li>
467
    <li>Registering with Harvester
468
  <p>
469
  After logging in, you will be presented with a web form that prompts you
470
  to enter information about your site and how often you want to schedule
471
  harvests at your site. For example:
472 2185 costa
      <table bgcolor="#ffffff" border="0" cellpadding="2" width='100%' >
473
        <tr >
474
          <td colspan=3 align=center >&nbsp;</td>
475
        </tr>
476
        <tr >
477
          <td colspan=3 align=center >
478
            <font face=verdana size=1%>
479
              <b>Metacat Harvester Registration </b>
480
            </font>
481
          </td>
482
        </tr>
483
        <tr>
484
          <td width='10%'> &nbsp;</td>
485
          <td width="25%" bgcolor="#4682b4">
486
            <p align="center">
487
            <font color="white" face=verdana size=2%>
488
            <b>Email address:</b>
489
            </font>
490
          </td>
491
          <td><p><input type="text" size="55" name="uid" value="myname@institution.edu" maxlength="100" size="28"></td>
492
        </tr>
493
        <tr>
494
          <td width='10%'> &nbsp;</td>
495
          <td bgcolor="#4682b4">
496
            <p align="center">
497
            <font color="white" face=verdana size=2%>
498
            <b>Harvest List URL:</b>
499
            </font>
500
          </td>
501
          <td><p><input type="text" size="55" name="passwd" value="http://somehost.institution.edu/~myname/harvestList.xml" maxlength="60" size="28">
502
          </td>
503
        </tr>
504
        <tr>
505
          <td colspan=3 align=center >&nbsp;</td>
506
        </tr>
507
        <tr>
508
          <td width='10%'> &nbsp;</td>
509
          <td bgcolor="#4682b4">
510
            <p align="center">
511
            <font color="white" face=verdana size=2%>
512
            <b>Harvest Frequency (1-99):</b>
513
            </font>
514
          </td>
515
          <td><p><input type="text" size="3" name="passwd" value="2" maxlength="60" size="28">
516
          </td>
517
        </tr>
518
        <tr>
519
          <td colspan=3 align=center >&nbsp;</td>
520
        </tr>
521
        <tr>
522
          <td width='10%'> &nbsp;</td>
523
          <td width="25%" bgcolor="#4682b4">
524
            <p align="center">
525
            <font color="white" face=verdana size=2%>
526
            <b>Unit:</b>
527
            </font>
528
          </td>
529
          <td>
530
            <input type="radio" name="o" value="days" >day(s)
531
            <input type="radio" name="o" value="weeks" checked>week(s)
532
            <input type="radio" name="o" value="months">month(s)
533
        </tr>
534
      </table>
535
  <p>
536 2131 costa
  After values have been entered for each of these fields, click the Register
537
  button to register your site with Harvester.
538
  </p>
539
  <P>
540
  In the example shown above, Harvester will attempt to harvest documents from
541
  the site once every 2 weeks, it will access the site's Harvest List at URL
542
  "http://somehost.institution.edu/~myname/harvestList.xml", and it will send
543
  email reports to the Site Contact at email address "myname@institution.edu".
544
  </P>
545 2330 costa
  <P>
546
  Note that you may enter multiple email addresses by separating each
547
  address with a comma or a semi-colon. For example,
548
  "myname@institution.edu,anothername@institution.edu".
549
  </P>
550 2131 costa
    </li>
551
    <li>Unregistering with Harvester
552
  <p>
553
  At any time after you have registered with Harvester, you may discontinue
554
  harvests at your site by unregistering. Simply login as described above and
555
  then click the Unregister button. After doing so, Harvester will discontinue
556
  harvests at the site.
557
  </p>
558
    </li>
559
  </ol>
560
  <h5><a name="Composing">Composing a Harvest List</a></h5>
561
  <p>
562
  A Harvest List is an XML file that holds a list of EML documents to be
563
  harvested. For each EML document in the list, the following information
564
  must be specified:
565
  <ul>
566
    <li><code>docid</code>, which consists of the:
567
      <ul>
568
        <li><code>scope</code>, e.g. "demoDocument". The scope is an identifier
569
            that indicates which group of documents this document belongs to.
570
        </li>
571
        <li><code>identifier</code>, e.g. "1". The identifier is a number that
572
            uniquely identifies this document within the scope.
573
        </li>
574
        <li><code>revision</code>, e.g. "5". The revision is a number that
575
            indicates the current revision of this document.
576
        </li>
577
      </ul>
578
    </li>
579
    <li><code>documentType</code>, e.g. "eml://ecoinformatics.org/eml-2.0.0".
580
        The documentType identifies the document as an EML document.</li>
581
    <li><code>documentURL</code>, e.g. "http://www.lternet.edu/~dcosta/document1.xml".
582
        The documentURL specifies a place where Harvester can locate
583
        and retrieve the document via HTTP.</li>
584
  </ul>
585
  </p>
586
  <p>
587
  The contents of a Harvest List XML file must conform to a particular
588
  XML Schema, as defined in file <a href="../../lib/harvester/harvestList.xsd">
589
  harvestList.xsd</a>. The contents of a valid Harvest List
590
  can best be illustrated by example. The sample Harvest List
591
  below contains two &lt;<code>document</code>&gt; elements that specify the
592
  information that Harvester needs to retrieve a pair of EML documents and
593
  upload them to Metacat:
594
  <pre>
595
&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
596
&lt;hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList" &gt;
597
    &lt;document&gt;
598
        &lt;docid&gt;
599
            &lt;scope&gt;demoDocument&lt;/scope&gt;
600
            &lt;identifier&gt;1&lt;/identifier&gt;
601
            &lt;revision&gt;5&lt;/revision&gt;
602
        &lt;/docid&gt;
603
        &lt;documentType&gt;eml://ecoinformatics.org/eml-2.0.0&lt;/documentType&gt;
604
        &lt;documentURL&gt;http://www.lternet.edu/~dcosta/document1.xml&lt;/documentURL&gt;
605
    &lt;/document&gt;
606
    &lt;document&gt;
607
        &lt;docid&gt;
608
            &lt;scope&gt;demoDocument&lt;/scope&gt;
609
            &lt;identifier&gt;2&lt;/identifier&gt;
610
            &lt;revision&gt;1&lt;/revision&gt;
611
        &lt;/docid&gt;
612
        &lt;documentType&gt;eml://ecoinformatics.org/eml-2.0.0&lt;/documentType&gt;
613
        &lt;documentURL&gt;http://www.lternet.edu/~dcosta/document2.xml&lt;/documentURL&gt;
614
    &lt;/document&gt;
615
&lt;/hrv:harvestList&gt;
616
  </pre>
617
  <p>
618
  After editing the Harvest List, ensure that the Harvest List XML file resides
619
  at the appropriate location on disk as specified by the URL that was entered
620
  during the <a href=#Registering>registration</a> process.
621
  </p>
622 2185 costa
  <p>
623
  The <a href=./harvestListEditor.html>Harvest List Editor</a> is a tool that
624
  assists in composing and editing a Harvest List. (Click
625
  <a href=./harvestListEditor.html>here</a> for additional details.)
626
  </p>
627 2131 costa
    <h5><a name="Preparing">Preparing EML Documents for harvest</a></h5>
628
  <p>
629
  To prepare a set of EML documents for harvest, ensure that the following is
630
  true for each document:
631
  <ul>
632
    <li>The document contains valid EML</li>
633
    <li>The document is specified in a &lt;document&gt; element in the
634
        site's Harvest List, as described above</li>
635
    <li>The file resides at the appropriate location on disk as specified
636
        by its URL in the Harvest List</li>
637
  </ul>
638
  </p>
639
    <h5><a name="Reviewing" >Reviewing Harvester Reports to the Site Contact</a></h5>
640
  <P>
641
  After every scheduled harvest that takes place at a particular Harvest
642
  Site, Harvester will send an email report to the Site Contact detailing the
643
  operations that were performed during the harvest.
644
  The report will contain information about the operations that were
645
  performed by Harvester at that site, such as
646
  which EML documents were harvested and whether any errors were encountered.
647
  </P>
648
  <P>
649
  The Site Contact should review the report, paying particularly
650
  close attention to any errors that are reported. Errors are indicated
651
  by operations that display a status value of 1, while operations that
652
  display a status value of 0 indicate that the operation completed
653
  successfully.
654
  </P>
655
  <p>
656
  When errors are reported,
657
  the Site Contact should try to determine whether the source of the error
658
  is something that can be corrected at the site. Common causes of errors
659
  might be:
660
  <ul>
661
    <li>A document URL specified in the Harvest List does not match
662
        the location of the actual EML file on the disk</li>
663
    <li>The Harvest List does not contain valid XML as specified in
664
        the <a href=../../lib/harvester/harvestList.xsd>harvestList.xsd</a> schema</li>
665
    <li>The URL to the Harvest List that was specified during
666
        registration with Harvester does not match the actual location of
667
        the Harvest List on the disk</li>
668
    <li>An EML document that Harvester attempted to upload to Metacat does
669
        not contain valid EML</li>
670
  </ul>
671
  </P>
672
  <p>
673
  If the Site Contact is unable to determine the cause of the error and its
674
  resolution, he or she should contact the Harvester Administrator for assistance.
675
  </p>
676
  <a href="./properties.html">Back</a> |
677
  <a href="./metacattour.html">Home</a> |
678
  <a href="./unimplem.html">Next</a>
679
</BODY>
680
</HTML>