Project

General

Profile

1 2131 costa
<!--
2
  * harvester.html
3
  *
4
  *      Authors: Duane Costa
5
  *    Copyright: 2004 Regents of the University of California and the
6
  *               National Center for Ecological Analysis and Synthesis,
7
  *               and the University of New Mexico.
8
  *  For Details: http://www.nceas.ucsb.edu/
9
  *      Created: 2004 April 9
10
  *      Version:
11
  *    File Info: '$ '
12
  *
13
  *
14
-->
15
<HTML>
16
<HEAD>
17
<TITLE>Metacat Harvester</TITLE>
18
<link rel="stylesheet" type="text/css" href="@docrooturl@default.css">
19
</HEAD>
20
<BODY>
21
  <table width="100%">
22
    <tr>
23
      <td class="tablehead" colspan="2">
24
        <p class="label">Metacat Harvester</p>
25
      </td>
26
      <td class="tablehead" colspan="2" align="right">
27
        <a href="./properties.html">Back</a> |
28
        <a href="./metacattour.html">Home</a> |
29
        <a href="./unimplem.html">Next</a>
30
      </td>
31
    </tr>
32
  </table>
33
  <h4>Introduction</h4>
34
The Metacat Harvester (henceforth referred to as "Harvester") is a
35
program that automates the retrieval of EML documents from one or more sites
36
and their subsequent upload (insert or update) to Metacat. Harvester uses pull
37
technology to retrieve and upload documents to Metacat on a regularly
38
scheduled basis.
39
<P>
40
Although Harvester is included with a Metacat installation (beginning with
41
Metacat version 1.4.0), it is an extention to Metacat's functionality
42
that may be used optionally.
43
</P>
44
  <h4>Definitions</h4>
45
The following table defines a number of terms that are useful in discussing
46
Harvester and its features.
47
  <br><br>
48
  <table border="1">
49
    <tr>
50
      <td><b>Term</b></td>
51
      <td><b>Definition</b></td>
52
    </tr>
53
    <tr>
54
      <td>Harvester</td>
55
      <td>The Harvester program, a Java application that is bundled with the
56
          Metacat distribution. When a user installs Metacat on a system,
57
          the Harvester program is automatically included in the
58
          installation.
59
      </td>
60
    </tr>
61
    <tr>
62
      <td>Harvester Administrator</td>
63
      <td>The individual who installs and manages Harvester. Typically, this
64
          would be the same individual who installs and manages Metacat at a
65
          given installation.
66
      </td>
67
    </tr>
68
    <tr>
69
      <td>Harvest Site</td>
70
      <td>A location from which Harvester can retrieve EML documents. A given
71
          Harvester can retrieve documents from any number of different
72
          Harvest Sites.
73
      </td>
74
    </tr>
75
    <tr>
76
      <td>Harvest</td>
77
      <td>The act (by Harvester) of visiting a Harvest Site, retrieving a
78
          number of EML documents, and inserting or updating the documents to
79
          Metacat.
80
      </td>
81
    </tr>
82
    <tr>
83 2185 costa
      <td><a name="HarvestList" >Harvest List</a></td>
84 2131 costa
      <td>An XML document that lists a set of EML documents to be harvested. The
85
          Harvest List must conform to an XML Schema,
86
          <a href="../../lib/harvester/harvestList.xsd">harvestList.xsd</a>.
87
      </td>
88
    </tr>
89
    <tr>
90
      <td>Site Contact</td>
91
      <td>The individual at a particular Harvest Site who registers with
92
          Harvester, composes a Harvest List, and periodically prepares
93
          the site's EML documents for retrieval and upload to Metacat.
94
      </td>
95
    </tr>
96
    <tr>
97
      <td>Harvest List URL</td>
98
      <td>A URL to the Harvest List, as specified by the Site Contact.
99
          Each Harvest Site corresponds to a Harvest List URL. Harvester
100
          uses the URL to locate and read a site's Harvest List.
101
      </td>
102
    </tr>
103
    <tr>
104
      <td>Document URL</td>
105
      <td>A URL to an EML document, as specified in the Harvest List.
106
          The Harvest List may contain any number of Document URLs. Each
107
          Document URL provides a locator to a document to be harvested.
108
      </td>
109
    </tr>
110
    <tr>
111
      <td>Harvester Registration Page</td>
112
      <td>A web page that provides a means for a Site Contact
113
          to register with Harvester to schedule regular harvests from the
114
          site. Registration involves logging in and then specifying various
115
          settings for the Harvest Site, such as the Harvest List URL, the
116
          harvest frequency, and the email address of the Site Contact.
117
      </td>
118
    </tr>
119
  </table>
120
  <h4>Managing Harvester</h4>
121
  Harvester is managed by the Harvester Administrator. Typically, the same
122
  individual who manages a Metacat server would also act as the Harvester
123
  Administrator. The responsibilities of the Harvester Administrator include:
124
    <ul>
125
      <li><a href="#Configuring Harvester">Configuring Harvester</a></li>
126
      <li><a href="#Running Harvester">Running Harvester</a></li>
127
      <li><a href="#Reviewing Harvester">Reviewing Harvester reports to
128
      the Harvester Administrator</a></li>
129
    </ul>
130
  <h5><a name="Configuring Harvester">Configuring Harvester</a></h5>
131
  <p>Harvester must be configured to interact with a working Metacat
132
     installation. Thus, a Metacat installation that has been properly
133
     configured and installed is a pre-requisite to running Harvester.
134
     Additionally, Harvester has a number of settable properties that
135
     control its behavior. All Harvester configuration information is managed
136
     in a single file,
137 2157 costa
     <a href=../../lib/metacat.properties>metacat.properties</a>,
138 2131 costa
     located at:
139 2157 costa
  <pre>      METACAT_HOME/lib/metacat.properties</pre>
140 2131 costa
     where METACAT_HOME is the top-level directory that Metacat is
141 2157 costa
     installed in.
142 2131 costa
  </p>
143 2157 costa
  <p>Harvester properties are grouped together in
144
     <a href=../../lib/metacat.properties>metacat.properties</a>, beginning
145
     after the comment line:
146
  <pre><code>      # Harvester properties</code></pre>
147
  </p>
148 2131 costa
  <p>The Harvester Administrator should edit
149 2157 costa
     <a href=../../lib/metacat.properties>metacat.properties</a>,
150
     setting appropriate values for the <code><b>harvesterAdministrator</b></code>
151
     property, the <code><b>smtpServer</b></code> property, and possibly other
152
     properties. The following table is a summary of each property and its function.
153 2131 costa
  </p>
154
  <table border="1">
155
    <tr>
156
      <td><b>Property</b></td>
157
      <td><b>Description</b></td>
158
      <td><b>Possible or default value</b></td>
159
    </tr>
160
    <tr>
161
      <td>connectToMetacat</td>
162
      <td>This property determines whether Harvester should connect to
163
          Metacat to upload documents. It should be set to <code>true</code>
164
          under most circumstances. Setting this property to <code>false</code>
165
          can be useful for testing whether Harvester is able to retrieve
166
          documents from a site without actually connecting to Metacat to
167
          upload the documents.</td>
168
      <td><code>true</code> | <code>false</code><br>
169
          Default: <code>true</code>
170
    </tr>
171
    <tr>
172
      <td>delay</td>
173
      <td>The number of hours that Harvester will wait before beginning its
174
          first harvest. For example, if Harvester is run at  1:00 p.m., and
175
          the delay is set to 12, Harvester will begin its first harvest at
176
          1:00 a.m.</td>
177
      <td>Default: 0</td>
178
    </tr>
179
    <tr>
180
      <td>harvesterAdministrator</td>
181
      <td>The email address of the Harvester Administrator. Harvester will
182 2330 costa
          send email reports to this address after every harvest. You may
183
          enter multiple email addresses by separating each address with
184
          a comma or semicolon, for example, "name1@abc.edu,name2@abc.edu".
185 2131 costa
      </td>
186 2330 costa
      <td>An email address, or multiple email addresses separated by commas
187
          or semi-colons</td>
188 2131 costa
    </tr>
189
    <tr>
190
      <td>logPeriod</td>
191
      <td>The number of days that Harvester should retain log entries of harvest
192
          operations in the database. Harvester log entries record information
193
          such as which documents were harvested, from which sites, and
194
          whether any errors were encountered during the harvest. Log entries
195
          older than <code>logPeriod</code> number of days are purged from the
196
          database at the end of each harvest.</td>
197
      <td>Default: 90</td>
198
    </tr>
199
    <tr>
200
      <td>maxHarvests</td>
201
      <td>The maximum number of harvests that Harvester should execute before
202
          shutting down. When the Harvester program is executed, it will
203
          continue running until it has executed <code>maxHarvests</code>
204
          number of harvests and then the program will terminate.</td>
205
      <td>Default: 30</td>
206
    </tr>
207
    <tr>
208
      <td>period</td>
209
      <td>The number of hours between harvests. Harvester will run a new
210
          harvest every <code>period</code> number of hours, until the
211
          <code>maxHarvests</code> number of harvests have been run.</td>
212
      <td>Default: 24</td>
213
    </tr>
214
    <tr>
215
      <td>smtpServer</td>
216
      <td>The SMTP server that Harvester uses for sending email messages
217
          to the Harvester Administrator and to Site Contacts.</td>
218
      <td>A host name, for example: <code>somehost.institution.edu</code>
219
          <br><br>
220
          Default: <code>localhost</code>
221
          <br><br>
222
          Note that the default value will only work if the Harvester
223
          host machine has been configured as a SMTP server.
224
      </td>
225
    </tr>
226
    <tr>
227
      <td>Harvester Operation Properties (GetDocError, GetDocSuccess, etc.)</td>
228
      <td>This group of properties is used by Harvester to report information
229
          about the operations it performs for inclusion in log
230
          entries and email messages. Under most circumstances the values
231
          of these properties should not be modified.</td>
232
      <td>&nbsp;</td>
233
    </tr>
234
  </table>
235
  <br>
236
  <h5><a name="Running Harvester">Running Harvester</a></h5>
237
  After Harvester has been appropriately
238
  <a href="#Configuring Harvester">configured</a>,
239 2387 costa
  it can be run in either of two ways: (A) in a command window, or, (B)
240
  as a servlet. If you wish only to test that Harvester is functioning,
241
  or if you expect to use Harvester infrequently, it may desirable to run it from a
242
  command window. However, under most circumstances you will want Harvester to
243
  run continuously as a background servlet process. This eliminates the
244
  need to keep a command window continuously open while Harvester is running.
245
  Both of these procedures are described below.
246
  <ul>
247
  <li> (A) Running Harvester in a Command Window
248 2131 costa
  <ol>
249
  <li>Open a system command window or terminal window.</li>
250
  <li>Set the METACAT_HOME environment variable to the value of the Metacat
251
      installation directory. Some examples follow:
252
      <ul>
253
        <li>On Windows:
254
        <pre>set METACAT_HOME=C:\somePath\metacat</pre></li>
255
        <li>On Linux/Unix (bash shell):
256
        <pre>export METACAT_HOME=/home/somePath/metacat</pre></li>
257
      </ul>
258
  <li>cd to the following directory:
259
      <ul>
260
        <li>On Windows:
261
        <pre>cd %METACAT_HOME%\lib\harvester</pre></li>
262
        <li>On Linux/Unix:
263
        <pre>cd $METACAT_HOME/lib/harvester</pre></li>
264
      </ul>
265
  <li>Run the appropriate Harvester shell script, as determined by the
266
      operating system:
267
      <ul>
268
        <li>On Windows:
269
        <pre>runHarvester.bat</pre></li>
270
        <li>On Linux/Unix:
271
        <pre>sh runHarvester.sh</pre></li>
272
      </ul>
273
  </li>
274 2387 costa
 </ol>
275 2131 costa
  <p>The Harvester application will start executing. It will begin its first
276
  harvest after <code><b>delay</b></code> number of hours (as specified in the
277 2157 costa
  <a href=../../lib/metacat.properties>metacat.properties</a>
278 2131 costa
  file). The application will continue running a new harvest every
279
  <code><b>period</b></code> number of hours until a <code><b>maxHarvests</b></code>
280 2387 costa
  number of harvests have been completed, or until you interrupt the process by
281
  hitting CTRL/C in the command window.
282 2131 costa
  </p>
283 2387 costa
  </li>
284
  <li> (B) Running Harvester as a Servlet
285
  <ol>
286
  <li>Edit the file in your Metcat installation, <code>lib/web.xml.<em>tomcatN</em></code>, where <em>tomcatN</em> corresponds to the
287
  version of Tomcat you are running. For example, if you are running Tomcat 5,
288
  edit file <code>lib/web.xml.tomcat5</code>.</li>
289
  <li>Remove the comment symbols around the HarvesterServlet entry, so that:
290
  <pre><code>
291
  &lt;!--
292
  &lt;servlet>
293
  &lt;servlet-name>HarvesterServlet&lt;/servlet-name>
294
  &lt;servlet-class>edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet&lt;/servlet-class>
295
  &lt;init-param>
296
    &lt;param-name>debug&lt;/param-name>
297
    &lt;param-value>1&lt;/param-value>
298
  &lt;/init-param>
299
  &lt;init-param>
300
    &lt;param-name>listings&lt;/param-name>
301
    &lt;param-value>true&lt;/param-value>
302
  &lt;/init-param>
303
  &lt;load-on-startup>1&lt;/load-on-startup>
304
  &lt;/servlet>
305
  --&gt;
306
  </code></pre>
307
  is changed to:
308
  <pre><code>
309
  &lt;servlet>
310
  &lt;servlet-name>HarvesterServlet&lt;/servlet-name>
311
  &lt;servlet-class>edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet&lt;/servlet-class>
312
  &lt;init-param>
313
    &lt;param-name>debug&lt;/param-name>
314
    &lt;param-value>1&lt;/param-value>
315
  &lt;/init-param>
316
  &lt;init-param>
317
    &lt;param-name>listings&lt;/param-name>
318
    &lt;param-value>true&lt;/param-value>
319
  &lt;/init-param>
320
  &lt;load-on-startup>1&lt;/load-on-startup>
321
  &lt;/servlet>
322
  </code></pre>
323
  Save the edited file.
324
  </li>
325
  <li>Shutdown Tomcat.</li>
326
  <li>Redeploy Metacat by running the following two ant commands from the top-level
327
  directory of your Metacat installation:
328
  <code><pre>
329
  ant cleanweb
330
  ant install</code></pre>
331
  </li>
332
  <li>Restart Tomcat.</li>
333
 </ol>
334
  <p>About thirty seconds after you restart Tomcat, the Harvester servlet will
335
  start executing. It will begin its first
336
  harvest after <code><b>delay</b></code> number of hours (as specified in the
337
  <a href=../../lib/metacat.properties>metacat.properties</a>
338
  file). The servlet will continue running a new harvest every
339
  <code><b>period</b></code> number of hours until a <code><b>maxHarvests</b></code>
340
  number of harvests have been completed, or until Tomcat shuts down.
341
  </p>
342
  </li>
343
   <h5><a name="Reviewing Harvester">
344 2131 costa
  Reviewing Harvester Reports to the Harvester Administrator</a></h5>
345
  <P>
346
  After every harvest, Harvester will send an email report to the Harvester
347
  Administrator detailing the operations that were performed during the
348
  harvest. The report will contain information about each of the Harvest Sites
349
  that were harvested from, such as which EML documents were
350
  harvested and whether any errors were encountered.
351
  </P>
352
  <p>
353
  The harvest report will contain a list of log entries, where each log entry
354
  describes an operation that was performed by Harvester. Log entries that
355
  show a status value of 1 indicate that an error occurred during the
356
  operation, while those that show a status value of 0 indicate that the
357
  operation was completed successfully.
358
  </p>
359
  <P>The Harvester Administrator should review the report, paying particularly
360
  close attention to any errors that are reported and to the accompanying error
361
  messages that are displayed. When errors are reported at
362
  a particular site, the Harvester Administrator should contact the Site
363
  Contact to determine the source of the error and its resolution. See
364
  <a href=#Reviewing>Reviewing Harvester Reports to the Site Contact</a> for a
365
  description of common sources of errors at a Harvest Site.
366
  </P>
367
  <p>Errors that are independent of a particular site may indicate a problem
368
  with Harvester itself, Metacat, or the database connection. Refer to the
369
  error message to determine the source of the error and its resolution.
370
  </p>
371
  <h4>Managing a Harvest Site</h4>
372
  A Harvest Site is managed by a Site Contact.
373
  The responsibilities of a Site Contact fall into the following categories:
374
    <ul>
375
      <li><a href=#Registering>Registering with Harvester</a></li>
376
      <li><a href=#Composing>Composing a Harvest List</a></li>
377
      <li><a href=#Preparing>Preparing EML Documents for harvest</a></li>
378
      <li><a href=#Reviewing>Reviewing Harvester reports to the Site Contact</a></li>
379
    </ul>
380
    <h5><a name="Registering">Registering with Harvester</a></h5>
381
  <p>
382
  A Site Contact registers a site with Harvester by logging in to the
383
  Harvester Registration page and entering several items of information
384
  that Harvester needs to know about the site.
385
  </p>
386
  <ol>
387
    <li>Logging in to the Harvester Registration Page
388
  <p>
389
  The Harvester Registration page is accessed from Metacat. For example, if
390
  the Metacat server that you wish to register with resides at the following
391
  URL:
392
  <pre>  http://somehost.somelocation.edu:8080/knb/index.jsp</pre>
393
  then the Harvester Registration page would be accessed at:
394 2157 costa
  <pre>  http://somehost.somelocation.edu:8080/knb/style/skins/knb/harvesterRegistrationLogin.html</pre>
395 2131 costa
  </p>
396
  <p>
397
  After bringing up this page in your browser, login to your Metacat account
398 2166 costa
  by entering your username, organization, and password. For example:
399
      <table bgcolor="#ffffff" border="0" cellpadding="2" width='100%' >
400
        <tr >
401
          <td colspan=3 align=center >&nbsp;</td>
402
        </tr>
403
        <tr >
404
          <td colspan=3 align=center >
405
            <font face=verdana size=1%>
406
              <b>Please  Enter Username, Organization, and Password </b>
407
            </font>
408
          </td>
409
        </tr>
410
        <tr>
411
          <td width='10%'> &nbsp;</td>
412
          <td width="25%" bgcolor="#4682b4">
413
            <p align="center">
414
            <font color="white" face=verdana size=2%>
415
            <b>Username</b>
416
            </font>
417
          </td>
418
          <td><p><input type="text" name="uid" value="jdoe" maxlength="100" size="28"></td>
419
        </tr>
420
        <tr>
421
          <td width='10%'> &nbsp;</td>
422
          <td width="25%" bgcolor="#4682b4">
423
            <p align="center">
424
            <font color="white" face=verdana size=2%>
425
            <b>Organization</b>
426
            </font>
427
          </td>
428
          <td>
429
            <input type="radio" name="o" value="NCEAS" checked>NCEAS
430
            <input type="radio" name="o" value="LTER">LTER
431
            <input type="radio" name="o" value="NRS">NRS
432
            <br>
433
            <input type="radio" name="o" value="PISCO">PISCO
434
            <input type="radio" name="o" value="OBFS">OBFS
435
            <input type="radio" name="o" value="Unaffiliated">Unaffiliated
436
        </tr>
437
        <tr>
438
          <td width='10%'> &nbsp;</td>
439
          <td bgcolor="#4682b4">
440
            <p align="center">
441
            <font color="white" face=verdana size=2%>
442
            <b>Password</b>
443
            </font>
444
          </td>
445
          <td><p><input type="password" name="passwd" value="*******" maxlength="60" size="28">
446
          </td>
447
        </tr>
448
        <tr>
449
          <td colspan=3 align=center >&nbsp;</td>
450
        </tr>
451
      </table>
452 2131 costa
  In some cases, a Site Contact may need to login to an anonymous account
453
  rather than his or her personal account. For example, a LTER Information
454
  Manager may need to login to a dedicated account, named with a three-letter
455 2166 costa
  acronym, that has been set up for the LTER site. The username
456
  "GCE" would be used by the LTER Information Mangager at the GCE (Georgia
457
  Coastal Ecosystems) site.
458 2131 costa
  </p>
459
    </li>
460
    <li>Registering with Harvester
461
  <p>
462
  After logging in, you will be presented with a web form that prompts you
463
  to enter information about your site and how often you want to schedule
464
  harvests at your site. For example:
465 2185 costa
      <table bgcolor="#ffffff" border="0" cellpadding="2" width='100%' >
466
        <tr >
467
          <td colspan=3 align=center >&nbsp;</td>
468
        </tr>
469
        <tr >
470
          <td colspan=3 align=center >
471
            <font face=verdana size=1%>
472
              <b>Metacat Harvester Registration </b>
473
            </font>
474
          </td>
475
        </tr>
476
        <tr>
477
          <td width='10%'> &nbsp;</td>
478
          <td width="25%" bgcolor="#4682b4">
479
            <p align="center">
480
            <font color="white" face=verdana size=2%>
481
            <b>Email address:</b>
482
            </font>
483
          </td>
484
          <td><p><input type="text" size="55" name="uid" value="myname@institution.edu" maxlength="100" size="28"></td>
485
        </tr>
486
        <tr>
487
          <td width='10%'> &nbsp;</td>
488
          <td bgcolor="#4682b4">
489
            <p align="center">
490
            <font color="white" face=verdana size=2%>
491
            <b>Harvest List URL:</b>
492
            </font>
493
          </td>
494
          <td><p><input type="text" size="55" name="passwd" value="http://somehost.institution.edu/~myname/harvestList.xml" maxlength="60" size="28">
495
          </td>
496
        </tr>
497
        <tr>
498
          <td colspan=3 align=center >&nbsp;</td>
499
        </tr>
500
        <tr>
501
          <td width='10%'> &nbsp;</td>
502
          <td bgcolor="#4682b4">
503
            <p align="center">
504
            <font color="white" face=verdana size=2%>
505
            <b>Harvest Frequency (1-99):</b>
506
            </font>
507
          </td>
508
          <td><p><input type="text" size="3" name="passwd" value="2" maxlength="60" size="28">
509
          </td>
510
        </tr>
511
        <tr>
512
          <td colspan=3 align=center >&nbsp;</td>
513
        </tr>
514
        <tr>
515
          <td width='10%'> &nbsp;</td>
516
          <td width="25%" bgcolor="#4682b4">
517
            <p align="center">
518
            <font color="white" face=verdana size=2%>
519
            <b>Unit:</b>
520
            </font>
521
          </td>
522
          <td>
523
            <input type="radio" name="o" value="days" >day(s)
524
            <input type="radio" name="o" value="weeks" checked>week(s)
525
            <input type="radio" name="o" value="months">month(s)
526
        </tr>
527
      </table>
528
  <p>
529 2131 costa
  After values have been entered for each of these fields, click the Register
530
  button to register your site with Harvester.
531
  </p>
532
  <P>
533
  In the example shown above, Harvester will attempt to harvest documents from
534
  the site once every 2 weeks, it will access the site's Harvest List at URL
535
  "http://somehost.institution.edu/~myname/harvestList.xml", and it will send
536
  email reports to the Site Contact at email address "myname@institution.edu".
537
  </P>
538 2330 costa
  <P>
539
  Note that you may enter multiple email addresses by separating each
540
  address with a comma or a semi-colon. For example,
541
  "myname@institution.edu,anothername@institution.edu".
542
  </P>
543 2131 costa
    </li>
544
    <li>Unregistering with Harvester
545
  <p>
546
  At any time after you have registered with Harvester, you may discontinue
547
  harvests at your site by unregistering. Simply login as described above and
548
  then click the Unregister button. After doing so, Harvester will discontinue
549
  harvests at the site.
550
  </p>
551
    </li>
552
  </ol>
553
  <h5><a name="Composing">Composing a Harvest List</a></h5>
554
  <p>
555
  A Harvest List is an XML file that holds a list of EML documents to be
556
  harvested. For each EML document in the list, the following information
557
  must be specified:
558
  <ul>
559
    <li><code>docid</code>, which consists of the:
560
      <ul>
561
        <li><code>scope</code>, e.g. "demoDocument". The scope is an identifier
562
            that indicates which group of documents this document belongs to.
563
        </li>
564
        <li><code>identifier</code>, e.g. "1". The identifier is a number that
565
            uniquely identifies this document within the scope.
566
        </li>
567
        <li><code>revision</code>, e.g. "5". The revision is a number that
568
            indicates the current revision of this document.
569
        </li>
570
      </ul>
571
    </li>
572
    <li><code>documentType</code>, e.g. "eml://ecoinformatics.org/eml-2.0.0".
573
        The documentType identifies the document as an EML document.</li>
574
    <li><code>documentURL</code>, e.g. "http://www.lternet.edu/~dcosta/document1.xml".
575
        The documentURL specifies a place where Harvester can locate
576
        and retrieve the document via HTTP.</li>
577
  </ul>
578
  </p>
579
  <p>
580
  The contents of a Harvest List XML file must conform to a particular
581
  XML Schema, as defined in file <a href="../../lib/harvester/harvestList.xsd">
582
  harvestList.xsd</a>. The contents of a valid Harvest List
583
  can best be illustrated by example. The sample Harvest List
584
  below contains two &lt;<code>document</code>&gt; elements that specify the
585
  information that Harvester needs to retrieve a pair of EML documents and
586
  upload them to Metacat:
587
  <pre>
588
&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
589
&lt;hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList" &gt;
590
    &lt;document&gt;
591
        &lt;docid&gt;
592
            &lt;scope&gt;demoDocument&lt;/scope&gt;
593
            &lt;identifier&gt;1&lt;/identifier&gt;
594
            &lt;revision&gt;5&lt;/revision&gt;
595
        &lt;/docid&gt;
596
        &lt;documentType&gt;eml://ecoinformatics.org/eml-2.0.0&lt;/documentType&gt;
597
        &lt;documentURL&gt;http://www.lternet.edu/~dcosta/document1.xml&lt;/documentURL&gt;
598
    &lt;/document&gt;
599
    &lt;document&gt;
600
        &lt;docid&gt;
601
            &lt;scope&gt;demoDocument&lt;/scope&gt;
602
            &lt;identifier&gt;2&lt;/identifier&gt;
603
            &lt;revision&gt;1&lt;/revision&gt;
604
        &lt;/docid&gt;
605
        &lt;documentType&gt;eml://ecoinformatics.org/eml-2.0.0&lt;/documentType&gt;
606
        &lt;documentURL&gt;http://www.lternet.edu/~dcosta/document2.xml&lt;/documentURL&gt;
607
    &lt;/document&gt;
608
&lt;/hrv:harvestList&gt;
609
  </pre>
610
  <p>
611
  After editing the Harvest List, ensure that the Harvest List XML file resides
612
  at the appropriate location on disk as specified by the URL that was entered
613
  during the <a href=#Registering>registration</a> process.
614
  </p>
615 2185 costa
  <p>
616
  The <a href=./harvestListEditor.html>Harvest List Editor</a> is a tool that
617
  assists in composing and editing a Harvest List. (Click
618
  <a href=./harvestListEditor.html>here</a> for additional details.)
619
  </p>
620 2131 costa
    <h5><a name="Preparing">Preparing EML Documents for harvest</a></h5>
621
  <p>
622
  To prepare a set of EML documents for harvest, ensure that the following is
623
  true for each document:
624
  <ul>
625
    <li>The document contains valid EML</li>
626
    <li>The document is specified in a &lt;document&gt; element in the
627
        site's Harvest List, as described above</li>
628
    <li>The file resides at the appropriate location on disk as specified
629
        by its URL in the Harvest List</li>
630
  </ul>
631
  </p>
632
    <h5><a name="Reviewing" >Reviewing Harvester Reports to the Site Contact</a></h5>
633
  <P>
634
  After every scheduled harvest that takes place at a particular Harvest
635
  Site, Harvester will send an email report to the Site Contact detailing the
636
  operations that were performed during the harvest.
637
  The report will contain information about the operations that were
638
  performed by Harvester at that site, such as
639
  which EML documents were harvested and whether any errors were encountered.
640
  </P>
641
  <P>
642
  The Site Contact should review the report, paying particularly
643
  close attention to any errors that are reported. Errors are indicated
644
  by operations that display a status value of 1, while operations that
645
  display a status value of 0 indicate that the operation completed
646
  successfully.
647
  </P>
648
  <p>
649
  When errors are reported,
650
  the Site Contact should try to determine whether the source of the error
651
  is something that can be corrected at the site. Common causes of errors
652
  might be:
653
  <ul>
654
    <li>A document URL specified in the Harvest List does not match
655
        the location of the actual EML file on the disk</li>
656
    <li>The Harvest List does not contain valid XML as specified in
657
        the <a href=../../lib/harvester/harvestList.xsd>harvestList.xsd</a> schema</li>
658
    <li>The URL to the Harvest List that was specified during
659
        registration with Harvester does not match the actual location of
660
        the Harvest List on the disk</li>
661
    <li>An EML document that Harvester attempted to upload to Metacat does
662
        not contain valid EML</li>
663
  </ul>
664
  </P>
665
  <p>
666
  If the Site Contact is unable to determine the cause of the error and its
667
  resolution, he or she should contact the Harvester Administrator for assistance.
668
  </p>
669
  <a href="./properties.html">Back</a> |
670
  <a href="./metacattour.html">Home</a> |
671
  <a href="./unimplem.html">Next</a>
672
</BODY>
673
</HTML>