1
|
<!--
|
2
|
* harvester.html
|
3
|
*
|
4
|
* Authors: Duane Costa
|
5
|
* Copyright: 2004 Regents of the University of California and the
|
6
|
* National Center for Ecological Analysis and Synthesis,
|
7
|
* and the University of New Mexico.
|
8
|
* For Details: http://www.nceas.ucsb.edu/
|
9
|
* Created: 2004 April 9
|
10
|
* Version:
|
11
|
* File Info: '$ '
|
12
|
*
|
13
|
*
|
14
|
-->
|
15
|
<HTML>
|
16
|
<HEAD>
|
17
|
<TITLE>Metacat Harvester</TITLE>
|
18
|
<link rel="stylesheet" type="text/css" href="@docrooturl@default.css">
|
19
|
</HEAD>
|
20
|
<BODY>
|
21
|
<table width="100%">
|
22
|
<tr>
|
23
|
<td class="tablehead" colspan="2">
|
24
|
<p class="label">Metacat Harvester</p>
|
25
|
</td>
|
26
|
<td class="tablehead" colspan="2" align="right">
|
27
|
<a href="./properties.html">Back</a> |
|
28
|
<a href="./metacattour.html">Home</a> |
|
29
|
<a href="./unimplem.html">Next</a>
|
30
|
</td>
|
31
|
</tr>
|
32
|
</table>
|
33
|
<h4>Introduction</h4>
|
34
|
The Metacat Harvester (henceforth referred to as "Harvester") is a
|
35
|
program that automates the retrieval of EML documents from one or more sites
|
36
|
and their subsequent upload (insert or update) to Metacat. Harvester uses pull
|
37
|
technology to retrieve and upload documents to Metacat on a regularly
|
38
|
scheduled basis.
|
39
|
<P>
|
40
|
Although Harvester is included with a Metacat installation (beginning with
|
41
|
Metacat version 1.4.0), it is an extention to Metacat's functionality
|
42
|
that may be used optionally.
|
43
|
</P>
|
44
|
<h4>Definitions</h4>
|
45
|
The following table defines a number of terms that are useful in discussing
|
46
|
Harvester and its features.
|
47
|
<br><br>
|
48
|
<table border="1">
|
49
|
<tr>
|
50
|
<td><b>Term</b></td>
|
51
|
<td><b>Definition</b></td>
|
52
|
</tr>
|
53
|
<tr>
|
54
|
<td>Harvester</td>
|
55
|
<td>The Harvester program, a Java application that is bundled with the
|
56
|
Metacat distribution. When a user installs Metacat on a system,
|
57
|
the Harvester program is automatically included in the
|
58
|
installation.
|
59
|
</td>
|
60
|
</tr>
|
61
|
<tr>
|
62
|
<td>Harvester Administrator</td>
|
63
|
<td>The individual who installs and manages Harvester. Typically, this
|
64
|
would be the same individual who installs and manages Metacat at a
|
65
|
given installation.
|
66
|
</td>
|
67
|
</tr>
|
68
|
<tr>
|
69
|
<td>Harvest Site</td>
|
70
|
<td>A location from which Harvester can retrieve EML documents. A given
|
71
|
Harvester can retrieve documents from any number of different
|
72
|
Harvest Sites.
|
73
|
</td>
|
74
|
</tr>
|
75
|
<tr>
|
76
|
<td>Harvest</td>
|
77
|
<td>The act (by Harvester) of visiting a Harvest Site, retrieving a
|
78
|
number of EML documents, and inserting or updating the documents to
|
79
|
Metacat.
|
80
|
</td>
|
81
|
</tr>
|
82
|
<tr>
|
83
|
<td><a name="HarvestList" >Harvest List</a></td>
|
84
|
<td>An XML document that lists a set of EML documents to be harvested. The
|
85
|
Harvest List must conform to an XML Schema,
|
86
|
<a href="../../lib/harvester/harvestList.xsd">harvestList.xsd</a>.
|
87
|
</td>
|
88
|
</tr>
|
89
|
<tr>
|
90
|
<td>Site Contact</td>
|
91
|
<td>The individual at a particular Harvest Site who registers with
|
92
|
Harvester, composes a Harvest List, and periodically prepares
|
93
|
the site's EML documents for retrieval and upload to Metacat.
|
94
|
</td>
|
95
|
</tr>
|
96
|
<tr>
|
97
|
<td>Harvest List URL</td>
|
98
|
<td>A URL to the Harvest List, as specified by the Site Contact.
|
99
|
Each Harvest Site corresponds to a Harvest List URL. Harvester
|
100
|
uses the URL to locate and read a site's Harvest List.
|
101
|
</td>
|
102
|
</tr>
|
103
|
<tr>
|
104
|
<td>Document URL</td>
|
105
|
<td>A URL to an EML document, as specified in the Harvest List.
|
106
|
The Harvest List may contain any number of Document URLs. Each
|
107
|
Document URL provides a locator to a document to be harvested.
|
108
|
</td>
|
109
|
</tr>
|
110
|
<tr>
|
111
|
<td>Harvester Registration Page</td>
|
112
|
<td>A web page that provides a means for a Site Contact
|
113
|
to register with Harvester to schedule regular harvests from the
|
114
|
site. Registration involves logging in and then specifying various
|
115
|
settings for the Harvest Site, such as the Harvest List URL, the
|
116
|
harvest frequency, and the email address of the Site Contact.
|
117
|
</td>
|
118
|
</tr>
|
119
|
</table>
|
120
|
<h4>Managing Harvester</h4>
|
121
|
Harvester is managed by the Harvester Administrator. Typically, the same
|
122
|
individual who manages a Metacat server would also act as the Harvester
|
123
|
Administrator. The responsibilities of the Harvester Administrator include:
|
124
|
<ul>
|
125
|
<li><a href="#Configuring Harvester">Configuring Harvester</a></li>
|
126
|
<li><a href="#Running Harvester">Running Harvester</a></li>
|
127
|
<li><a href="#Reviewing Harvester">Reviewing Harvester reports to
|
128
|
the Harvester Administrator</a></li>
|
129
|
</ul>
|
130
|
<h5><a name="Configuring Harvester">Configuring Harvester</a></h5>
|
131
|
<p>Harvester must be configured to interact with a working Metacat
|
132
|
installation. Thus, a Metacat installation that has been properly
|
133
|
configured and installed is a pre-requisite to running Harvester.
|
134
|
Additionally, Harvester has a number of settable properties that
|
135
|
control its behavior. All Harvester configuration information is managed
|
136
|
in a single file,
|
137
|
<a href=../../lib/metacat.properties>metacat.properties</a>,
|
138
|
located at:
|
139
|
<pre> METACAT_HOME/lib/metacat.properties</pre>
|
140
|
where METACAT_HOME is the top-level directory that Metacat is
|
141
|
installed in.
|
142
|
</p>
|
143
|
<p>Harvester properties are grouped together in
|
144
|
<a href=../../lib/metacat.properties>metacat.properties</a>, beginning
|
145
|
after the comment line:
|
146
|
<pre><code> # Harvester properties</code></pre>
|
147
|
</p>
|
148
|
<p>The Harvester Administrator should edit
|
149
|
<a href=../../lib/metacat.properties>metacat.properties</a>,
|
150
|
setting appropriate values for the <code><b>harvesterAdministrator</b></code>
|
151
|
property, the <code><b>smtpServer</b></code> property, and possibly other
|
152
|
properties. The following table is a summary of each property and its function.
|
153
|
</p>
|
154
|
<table border="1">
|
155
|
<tr>
|
156
|
<td><b>Property</b></td>
|
157
|
<td><b>Description</b></td>
|
158
|
<td><b>Possible or default value</b></td>
|
159
|
</tr>
|
160
|
<tr>
|
161
|
<td>connectToMetacat</td>
|
162
|
<td>This property determines whether Harvester should connect to
|
163
|
Metacat to upload documents. It should be set to <code>true</code>
|
164
|
under most circumstances. Setting this property to <code>false</code>
|
165
|
can be useful for testing whether Harvester is able to retrieve
|
166
|
documents from a site without actually connecting to Metacat to
|
167
|
upload the documents.</td>
|
168
|
<td><code>true</code> | <code>false</code><br>
|
169
|
Default: <code>true</code>
|
170
|
</tr>
|
171
|
<tr>
|
172
|
<td>delay</td>
|
173
|
<td>The number of hours that Harvester will wait before beginning its
|
174
|
first harvest. For example, if Harvester is run at 1:00 p.m., and
|
175
|
the delay is set to 12, Harvester will begin its first harvest at
|
176
|
1:00 a.m.</td>
|
177
|
<td>Default: 0</td>
|
178
|
</tr>
|
179
|
<tr>
|
180
|
<td>harvesterAdministrator</td>
|
181
|
<td>The email address of the Harvester Administrator. Harvester will
|
182
|
send email reports to this address after every harvest. You may
|
183
|
enter multiple email addresses by separating each address with
|
184
|
a comma or semicolon, for example, "name1@abc.edu,name2@abc.edu".
|
185
|
</td>
|
186
|
<td>An email address, or multiple email addresses separated by commas
|
187
|
or semi-colons</td>
|
188
|
</tr>
|
189
|
<tr>
|
190
|
<td>logPeriod</td>
|
191
|
<td>The number of days that Harvester should retain log entries of harvest
|
192
|
operations in the database. Harvester log entries record information
|
193
|
such as which documents were harvested, from which sites, and
|
194
|
whether any errors were encountered during the harvest. Log entries
|
195
|
older than <code>logPeriod</code> number of days are purged from the
|
196
|
database at the end of each harvest.</td>
|
197
|
<td>Default: 90</td>
|
198
|
</tr>
|
199
|
<tr>
|
200
|
<td>maxHarvests</td>
|
201
|
<td>The maximum number of harvests that Harvester should execute before
|
202
|
shutting down. When the Harvester program is executed, it will
|
203
|
continue running until it has executed <code>maxHarvests</code>
|
204
|
number of harvests and then the program will terminate.</td>
|
205
|
<td>Default: 30</td>
|
206
|
</tr>
|
207
|
<tr>
|
208
|
<td>period</td>
|
209
|
<td>The number of hours between harvests. Harvester will run a new
|
210
|
harvest every <code>period</code> number of hours, until the
|
211
|
<code>maxHarvests</code> number of harvests have been run.</td>
|
212
|
<td>Default: 24</td>
|
213
|
</tr>
|
214
|
<tr>
|
215
|
<td>smtpServer</td>
|
216
|
<td>The SMTP server that Harvester uses for sending email messages
|
217
|
to the Harvester Administrator and to Site Contacts.</td>
|
218
|
<td>A host name, for example: <code>somehost.institution.edu</code>
|
219
|
<br><br>
|
220
|
Default: <code>localhost</code>
|
221
|
<br><br>
|
222
|
Note that the default value will only work if the Harvester
|
223
|
host machine has been configured as a SMTP server.
|
224
|
</td>
|
225
|
</tr>
|
226
|
<tr>
|
227
|
<td>Harvester Operation Properties (GetDocError, GetDocSuccess, etc.)</td>
|
228
|
<td>This group of properties is used by Harvester to report information
|
229
|
about the operations it performs for inclusion in log
|
230
|
entries and email messages. Under most circumstances the values
|
231
|
of these properties should not be modified.</td>
|
232
|
<td> </td>
|
233
|
</tr>
|
234
|
</table>
|
235
|
<br>
|
236
|
<h5><a name="Running Harvester">Running Harvester</a></h5>
|
237
|
After Harvester has been appropriately
|
238
|
<a href="#Configuring Harvester">configured</a>,
|
239
|
it can be run in either of two ways: (A) in a command window, or, (B)
|
240
|
as a servlet. If you wish only to test that Harvester is functioning,
|
241
|
or if you expect to use Harvester infrequently, it may desirable to run it from a
|
242
|
command window. However, under most circumstances you will want Harvester to
|
243
|
run continuously as a background servlet process. This eliminates the
|
244
|
need to keep a command window continuously open while Harvester is running.
|
245
|
Both of these procedures are described below.
|
246
|
<ul>
|
247
|
<li> (A) Running Harvester in a Command Window
|
248
|
<ol>
|
249
|
<li>Open a system command window or terminal window.</li>
|
250
|
<li>Set the METACAT_HOME environment variable to the value of the Metacat
|
251
|
installation directory. Some examples follow:
|
252
|
<ul>
|
253
|
<li>On Windows:
|
254
|
<pre>set METACAT_HOME=C:\somePath\metacat</pre></li>
|
255
|
<li>On Linux/Unix (bash shell):
|
256
|
<pre>export METACAT_HOME=/home/somePath/metacat</pre></li>
|
257
|
</ul>
|
258
|
<li>cd to the following directory:
|
259
|
<ul>
|
260
|
<li>On Windows:
|
261
|
<pre>cd %METACAT_HOME%\lib\harvester</pre></li>
|
262
|
<li>On Linux/Unix:
|
263
|
<pre>cd $METACAT_HOME/lib/harvester</pre></li>
|
264
|
</ul>
|
265
|
<li>Run the appropriate Harvester shell script, as determined by the
|
266
|
operating system:
|
267
|
<ul>
|
268
|
<li>On Windows:
|
269
|
<pre>runHarvester.bat</pre></li>
|
270
|
<li>On Linux/Unix:
|
271
|
<pre>sh runHarvester.sh</pre></li>
|
272
|
</ul>
|
273
|
</li>
|
274
|
</ol>
|
275
|
<p>The Harvester application will start executing. It will begin its first
|
276
|
harvest after <code><b>delay</b></code> number of hours (as specified in the
|
277
|
<a href=../../lib/metacat.properties>metacat.properties</a>
|
278
|
file). The application will continue running a new harvest every
|
279
|
<code><b>period</b></code> number of hours until a <code><b>maxHarvests</b></code>
|
280
|
number of harvests have been completed, or until you interrupt the process by
|
281
|
hitting CTRL/C in the command window.
|
282
|
</p>
|
283
|
</li>
|
284
|
<li> (B) Running Harvester as a Servlet
|
285
|
<ol>
|
286
|
<li>Edit the file in your Metcat installation, <code>lib/web.xml.<em>tomcatN</em></code>, where <em>tomcatN</em> corresponds to the
|
287
|
version of Tomcat you are running. For example, if you are running Tomcat 5,
|
288
|
edit file <code>lib/web.xml.tomcat5</code>.</li>
|
289
|
<li>Remove the comment symbols around the HarvesterServlet entry, so that:
|
290
|
<pre><code>
|
291
|
<!--
|
292
|
<servlet>
|
293
|
<servlet-name>HarvesterServlet</servlet-name>
|
294
|
<servlet-class>edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet</servlet-class>
|
295
|
<init-param>
|
296
|
<param-name>debug</param-name>
|
297
|
<param-value>1</param-value>
|
298
|
</init-param>
|
299
|
<init-param>
|
300
|
<param-name>listings</param-name>
|
301
|
<param-value>true</param-value>
|
302
|
</init-param>
|
303
|
<load-on-startup>1</load-on-startup>
|
304
|
</servlet>
|
305
|
-->
|
306
|
</code></pre>
|
307
|
is changed to:
|
308
|
<pre><code>
|
309
|
<servlet>
|
310
|
<servlet-name>HarvesterServlet</servlet-name>
|
311
|
<servlet-class>edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet</servlet-class>
|
312
|
<init-param>
|
313
|
<param-name>debug</param-name>
|
314
|
<param-value>1</param-value>
|
315
|
</init-param>
|
316
|
<init-param>
|
317
|
<param-name>listings</param-name>
|
318
|
<param-value>true</param-value>
|
319
|
</init-param>
|
320
|
<load-on-startup>1</load-on-startup>
|
321
|
</servlet>
|
322
|
</code></pre>
|
323
|
Save the edited file.
|
324
|
</li>
|
325
|
<li>Shutdown Tomcat.</li>
|
326
|
<li>Redeploy Metacat by running the following two ant commands from the top-level
|
327
|
directory of your Metacat installation:
|
328
|
<code><pre>
|
329
|
ant cleanweb
|
330
|
ant install</code></pre>
|
331
|
</li>
|
332
|
<li>Restart Tomcat.</li>
|
333
|
</ol>
|
334
|
<p>About thirty seconds after you restart Tomcat, the Harvester servlet will
|
335
|
start executing. It will begin its first
|
336
|
harvest after <code><b>delay</b></code> number of hours (as specified in the
|
337
|
<a href=../../lib/metacat.properties>metacat.properties</a>
|
338
|
file). The servlet will continue running a new harvest every
|
339
|
<code><b>period</b></code> number of hours until a <code><b>maxHarvests</b></code>
|
340
|
number of harvests have been completed, or until Tomcat shuts down.
|
341
|
</p>
|
342
|
</li>
|
343
|
<h5><a name="Reviewing Harvester">
|
344
|
Reviewing Harvester Reports to the Harvester Administrator</a></h5>
|
345
|
<P>
|
346
|
After every harvest, Harvester will send an email report to the Harvester
|
347
|
Administrator detailing the operations that were performed during the
|
348
|
harvest. The report will contain information about each of the Harvest Sites
|
349
|
that were harvested from, such as which EML documents were
|
350
|
harvested and whether any errors were encountered.
|
351
|
</P>
|
352
|
<p>
|
353
|
The harvest report will contain a list of log entries, where each log entry
|
354
|
describes an operation that was performed by Harvester. Log entries that
|
355
|
show a status value of 1 indicate that an error occurred during the
|
356
|
operation, while those that show a status value of 0 indicate that the
|
357
|
operation was completed successfully.
|
358
|
</p>
|
359
|
<P>The Harvester Administrator should review the report, paying particularly
|
360
|
close attention to any errors that are reported and to the accompanying error
|
361
|
messages that are displayed. When errors are reported at
|
362
|
a particular site, the Harvester Administrator should contact the Site
|
363
|
Contact to determine the source of the error and its resolution. See
|
364
|
<a href=#Reviewing>Reviewing Harvester Reports to the Site Contact</a> for a
|
365
|
description of common sources of errors at a Harvest Site.
|
366
|
</P>
|
367
|
<p>Errors that are independent of a particular site may indicate a problem
|
368
|
with Harvester itself, Metacat, or the database connection. Refer to the
|
369
|
error message to determine the source of the error and its resolution.
|
370
|
</p>
|
371
|
<h4>Managing a Harvest Site</h4>
|
372
|
A Harvest Site is managed by a Site Contact.
|
373
|
The responsibilities of a Site Contact fall into the following categories:
|
374
|
<ul>
|
375
|
<li><a href=#Registering>Registering with Harvester</a></li>
|
376
|
<li><a href=#Composing>Composing a Harvest List</a></li>
|
377
|
<li><a href=#Preparing>Preparing EML Documents for harvest</a></li>
|
378
|
<li><a href=#Reviewing>Reviewing Harvester reports to the Site Contact</a></li>
|
379
|
</ul>
|
380
|
<h5><a name="Registering">Registering with Harvester</a></h5>
|
381
|
<p>
|
382
|
A Site Contact registers a site with Harvester by logging in to the
|
383
|
Harvester Registration page and entering several items of information
|
384
|
that Harvester needs to know about the site.
|
385
|
</p>
|
386
|
<ol>
|
387
|
<li>Logging in to the Harvester Registration Page
|
388
|
<p>
|
389
|
The Harvester Registration page is accessed from Metacat. For example, if
|
390
|
the Metacat server that you wish to register with resides at the following
|
391
|
URL:
|
392
|
<pre> http://somehost.somelocation.edu:8080/knb/index.jsp</pre>
|
393
|
then the Harvester Registration page would be accessed at:
|
394
|
<pre> http://somehost.somelocation.edu:8080/knb/style/skins/knb/harvesterRegistrationLogin.html</pre>
|
395
|
</p>
|
396
|
<p>
|
397
|
After bringing up this page in your browser, login to your Metacat account
|
398
|
by entering your username, organization, and password. For example:
|
399
|
<table bgcolor="#ffffff" border="0" cellpadding="2" width='100%' >
|
400
|
<tr >
|
401
|
<td colspan=3 align=center > </td>
|
402
|
</tr>
|
403
|
<tr >
|
404
|
<td colspan=3 align=center >
|
405
|
<font face=verdana size=1%>
|
406
|
<b>Please Enter Username, Organization, and Password </b>
|
407
|
</font>
|
408
|
</td>
|
409
|
</tr>
|
410
|
<tr>
|
411
|
<td width='10%'> </td>
|
412
|
<td width="25%" bgcolor="#4682b4">
|
413
|
<p align="center">
|
414
|
<font color="white" face=verdana size=2%>
|
415
|
<b>Username</b>
|
416
|
</font>
|
417
|
</td>
|
418
|
<td><p><input type="text" name="uid" value="jdoe" maxlength="100" size="28"></td>
|
419
|
</tr>
|
420
|
<tr>
|
421
|
<td width='10%'> </td>
|
422
|
<td width="25%" bgcolor="#4682b4">
|
423
|
<p align="center">
|
424
|
<font color="white" face=verdana size=2%>
|
425
|
<b>Organization</b>
|
426
|
</font>
|
427
|
</td>
|
428
|
<td>
|
429
|
<input type="radio" name="o" value="NCEAS" checked>NCEAS
|
430
|
<input type="radio" name="o" value="LTER">LTER
|
431
|
<input type="radio" name="o" value="NRS">NRS
|
432
|
<br>
|
433
|
<input type="radio" name="o" value="PISCO">PISCO
|
434
|
<input type="radio" name="o" value="OBFS">OBFS
|
435
|
<input type="radio" name="o" value="Unaffiliated">Unaffiliated
|
436
|
</tr>
|
437
|
<tr>
|
438
|
<td width='10%'> </td>
|
439
|
<td bgcolor="#4682b4">
|
440
|
<p align="center">
|
441
|
<font color="white" face=verdana size=2%>
|
442
|
<b>Password</b>
|
443
|
</font>
|
444
|
</td>
|
445
|
<td><p><input type="password" name="passwd" value="*******" maxlength="60" size="28">
|
446
|
</td>
|
447
|
</tr>
|
448
|
<tr>
|
449
|
<td colspan=3 align=center > </td>
|
450
|
</tr>
|
451
|
</table>
|
452
|
In some cases, a Site Contact may need to login to an anonymous account
|
453
|
rather than his or her personal account. For example, a LTER Information
|
454
|
Manager may need to login to a dedicated account, named with a three-letter
|
455
|
acronym, that has been set up for the LTER site. The username
|
456
|
"GCE" would be used by the LTER Information Mangager at the GCE (Georgia
|
457
|
Coastal Ecosystems) site.
|
458
|
</p>
|
459
|
</li>
|
460
|
<li>Registering with Harvester
|
461
|
<p>
|
462
|
After logging in, you will be presented with a web form that prompts you
|
463
|
to enter information about your site and how often you want to schedule
|
464
|
harvests at your site. For example:
|
465
|
<table bgcolor="#ffffff" border="0" cellpadding="2" width='100%' >
|
466
|
<tr >
|
467
|
<td colspan=3 align=center > </td>
|
468
|
</tr>
|
469
|
<tr >
|
470
|
<td colspan=3 align=center >
|
471
|
<font face=verdana size=1%>
|
472
|
<b>Metacat Harvester Registration </b>
|
473
|
</font>
|
474
|
</td>
|
475
|
</tr>
|
476
|
<tr>
|
477
|
<td width='10%'> </td>
|
478
|
<td width="25%" bgcolor="#4682b4">
|
479
|
<p align="center">
|
480
|
<font color="white" face=verdana size=2%>
|
481
|
<b>Email address:</b>
|
482
|
</font>
|
483
|
</td>
|
484
|
<td><p><input type="text" size="55" name="uid" value="myname@institution.edu" maxlength="100" size="28"></td>
|
485
|
</tr>
|
486
|
<tr>
|
487
|
<td width='10%'> </td>
|
488
|
<td bgcolor="#4682b4">
|
489
|
<p align="center">
|
490
|
<font color="white" face=verdana size=2%>
|
491
|
<b>Harvest List URL:</b>
|
492
|
</font>
|
493
|
</td>
|
494
|
<td><p><input type="text" size="55" name="passwd" value="http://somehost.institution.edu/~myname/harvestList.xml" maxlength="60" size="28">
|
495
|
</td>
|
496
|
</tr>
|
497
|
<tr>
|
498
|
<td colspan=3 align=center > </td>
|
499
|
</tr>
|
500
|
<tr>
|
501
|
<td width='10%'> </td>
|
502
|
<td bgcolor="#4682b4">
|
503
|
<p align="center">
|
504
|
<font color="white" face=verdana size=2%>
|
505
|
<b>Harvest Frequency (1-99):</b>
|
506
|
</font>
|
507
|
</td>
|
508
|
<td><p><input type="text" size="3" name="passwd" value="2" maxlength="60" size="28">
|
509
|
</td>
|
510
|
</tr>
|
511
|
<tr>
|
512
|
<td colspan=3 align=center > </td>
|
513
|
</tr>
|
514
|
<tr>
|
515
|
<td width='10%'> </td>
|
516
|
<td width="25%" bgcolor="#4682b4">
|
517
|
<p align="center">
|
518
|
<font color="white" face=verdana size=2%>
|
519
|
<b>Unit:</b>
|
520
|
</font>
|
521
|
</td>
|
522
|
<td>
|
523
|
<input type="radio" name="o" value="days" >day(s)
|
524
|
<input type="radio" name="o" value="weeks" checked>week(s)
|
525
|
<input type="radio" name="o" value="months">month(s)
|
526
|
</tr>
|
527
|
</table>
|
528
|
<p>
|
529
|
After values have been entered for each of these fields, click the Register
|
530
|
button to register your site with Harvester.
|
531
|
</p>
|
532
|
<P>
|
533
|
In the example shown above, Harvester will attempt to harvest documents from
|
534
|
the site once every 2 weeks, it will access the site's Harvest List at URL
|
535
|
"http://somehost.institution.edu/~myname/harvestList.xml", and it will send
|
536
|
email reports to the Site Contact at email address "myname@institution.edu".
|
537
|
</P>
|
538
|
<P>
|
539
|
Note that you may enter multiple email addresses by separating each
|
540
|
address with a comma or a semi-colon. For example,
|
541
|
"myname@institution.edu,anothername@institution.edu".
|
542
|
</P>
|
543
|
</li>
|
544
|
<li>Unregistering with Harvester
|
545
|
<p>
|
546
|
At any time after you have registered with Harvester, you may discontinue
|
547
|
harvests at your site by unregistering. Simply login as described above and
|
548
|
then click the Unregister button. After doing so, Harvester will discontinue
|
549
|
harvests at the site.
|
550
|
</p>
|
551
|
</li>
|
552
|
</ol>
|
553
|
<h5><a name="Composing">Composing a Harvest List</a></h5>
|
554
|
<p>
|
555
|
A Harvest List is an XML file that holds a list of EML documents to be
|
556
|
harvested. For each EML document in the list, the following information
|
557
|
must be specified:
|
558
|
<ul>
|
559
|
<li><code>docid</code>, which consists of the:
|
560
|
<ul>
|
561
|
<li><code>scope</code>, e.g. "demoDocument". The scope is an identifier
|
562
|
that indicates which group of documents this document belongs to.
|
563
|
</li>
|
564
|
<li><code>identifier</code>, e.g. "1". The identifier is a number that
|
565
|
uniquely identifies this document within the scope.
|
566
|
</li>
|
567
|
<li><code>revision</code>, e.g. "5". The revision is a number that
|
568
|
indicates the current revision of this document.
|
569
|
</li>
|
570
|
</ul>
|
571
|
</li>
|
572
|
<li><code>documentType</code>, e.g. "eml://ecoinformatics.org/eml-2.0.0".
|
573
|
The documentType identifies the document as an EML document.</li>
|
574
|
<li><code>documentURL</code>, e.g. "http://www.lternet.edu/~dcosta/document1.xml".
|
575
|
The documentURL specifies a place where Harvester can locate
|
576
|
and retrieve the document via HTTP.</li>
|
577
|
</ul>
|
578
|
</p>
|
579
|
<p>
|
580
|
The contents of a Harvest List XML file must conform to a particular
|
581
|
XML Schema, as defined in file <a href="../../lib/harvester/harvestList.xsd">
|
582
|
harvestList.xsd</a>. The contents of a valid Harvest List
|
583
|
can best be illustrated by example. The sample Harvest List
|
584
|
below contains two <<code>document</code>> elements that specify the
|
585
|
information that Harvester needs to retrieve a pair of EML documents and
|
586
|
upload them to Metacat:
|
587
|
<pre>
|
588
|
<?xml version="1.0" encoding="UTF-8" ?>
|
589
|
<hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList" >
|
590
|
<document>
|
591
|
<docid>
|
592
|
<scope>demoDocument</scope>
|
593
|
<identifier>1</identifier>
|
594
|
<revision>5</revision>
|
595
|
</docid>
|
596
|
<documentType>eml://ecoinformatics.org/eml-2.0.0</documentType>
|
597
|
<documentURL>http://www.lternet.edu/~dcosta/document1.xml</documentURL>
|
598
|
</document>
|
599
|
<document>
|
600
|
<docid>
|
601
|
<scope>demoDocument</scope>
|
602
|
<identifier>2</identifier>
|
603
|
<revision>1</revision>
|
604
|
</docid>
|
605
|
<documentType>eml://ecoinformatics.org/eml-2.0.0</documentType>
|
606
|
<documentURL>http://www.lternet.edu/~dcosta/document2.xml</documentURL>
|
607
|
</document>
|
608
|
</hrv:harvestList>
|
609
|
</pre>
|
610
|
<p>
|
611
|
After editing the Harvest List, ensure that the Harvest List XML file resides
|
612
|
at the appropriate location on disk as specified by the URL that was entered
|
613
|
during the <a href=#Registering>registration</a> process.
|
614
|
</p>
|
615
|
<p>
|
616
|
The <a href=./harvestListEditor.html>Harvest List Editor</a> is a tool that
|
617
|
assists in composing and editing a Harvest List. (Click
|
618
|
<a href=./harvestListEditor.html>here</a> for additional details.)
|
619
|
</p>
|
620
|
<h5><a name="Preparing">Preparing EML Documents for harvest</a></h5>
|
621
|
<p>
|
622
|
To prepare a set of EML documents for harvest, ensure that the following is
|
623
|
true for each document:
|
624
|
<ul>
|
625
|
<li>The document contains valid EML</li>
|
626
|
<li>The document is specified in a <document> element in the
|
627
|
site's Harvest List, as described above</li>
|
628
|
<li>The file resides at the appropriate location on disk as specified
|
629
|
by its URL in the Harvest List</li>
|
630
|
</ul>
|
631
|
</p>
|
632
|
<h5><a name="Reviewing" >Reviewing Harvester Reports to the Site Contact</a></h5>
|
633
|
<P>
|
634
|
After every scheduled harvest that takes place at a particular Harvest
|
635
|
Site, Harvester will send an email report to the Site Contact detailing the
|
636
|
operations that were performed during the harvest.
|
637
|
The report will contain information about the operations that were
|
638
|
performed by Harvester at that site, such as
|
639
|
which EML documents were harvested and whether any errors were encountered.
|
640
|
</P>
|
641
|
<P>
|
642
|
The Site Contact should review the report, paying particularly
|
643
|
close attention to any errors that are reported. Errors are indicated
|
644
|
by operations that display a status value of 1, while operations that
|
645
|
display a status value of 0 indicate that the operation completed
|
646
|
successfully.
|
647
|
</P>
|
648
|
<p>
|
649
|
When errors are reported,
|
650
|
the Site Contact should try to determine whether the source of the error
|
651
|
is something that can be corrected at the site. Common causes of errors
|
652
|
might be:
|
653
|
<ul>
|
654
|
<li>A document URL specified in the Harvest List does not match
|
655
|
the location of the actual EML file on the disk</li>
|
656
|
<li>The Harvest List does not contain valid XML as specified in
|
657
|
the <a href=../../lib/harvester/harvestList.xsd>harvestList.xsd</a> schema</li>
|
658
|
<li>The URL to the Harvest List that was specified during
|
659
|
registration with Harvester does not match the actual location of
|
660
|
the Harvest List on the disk</li>
|
661
|
<li>An EML document that Harvester attempted to upload to Metacat does
|
662
|
not contain valid EML</li>
|
663
|
</ul>
|
664
|
</P>
|
665
|
<p>
|
666
|
If the Site Contact is unable to determine the cause of the error and its
|
667
|
resolution, he or she should contact the Harvester Administrator for assistance.
|
668
|
</p>
|
669
|
<a href="./properties.html">Back</a> |
|
670
|
<a href="./metacattour.html">Home</a> |
|
671
|
<a href="./unimplem.html">Next</a>
|
672
|
</BODY>
|
673
|
</HTML>
|