Project

General

Profile

1 3074 perry
<!--
2
  * harvester.html
3
  *
4
  *      Authors: Duane Costa
5
  *    Copyright: 2004 Regents of the University of California and the
6
  *               National Center for Ecological Analysis and Synthesis,
7
  *               and the University of New Mexico.
8
  *  For Details: http://www.nceas.ucsb.edu/
9
  *      Created: 2004 April 9
10
  *      Version:
11
  *    File Info: '$ '
12
  *
13
  *
14
-->
15
<HTML>
16
<HEAD>
17
<TITLE>Metacat Spatial Option</TITLE>
18
<link rel="stylesheet" type="text/css" href="./default.css">
19
</HEAD>
20
<BODY>
21
  <table width="100%">
22
    <tr>
23
      <td class="tablehead" colspan="2">
24
        <p class="label">Metacat Spatial Option</p>
25
      </td>
26
      <td class="tablehead" colspan="2" align="right">
27
        <a href="./properties.html">Back</a> |
28
        <a href="./metacattour.html">Home</a> |
29
        <a href="./unimplem.html">Next</a>
30
      </td>
31
    </tr>
32
  </table>
33
  <h4>Introduction</h4>
34
The Metacat spatial option enables you to query and visualize the geographic coverage of metacat documents.
35
This document is intended to provide a high-level overview of the Metacat spatial functionality. It is primarily a resource for users and developers who want to understand the architecture before digging into the code to extend the existing functionality.
36
<P>
37
Although the spatial option is included with a Metacat installation (beginning with
38
Metacat version 1.7.0), it is an extention to Metacat's functionality
39
that may be used optionally.
40
</P>
41
  <h4>Outline</h4>
42
 <ul>
43
   <li> <a href="#definitions">Defintion of Terms</a> </li>
44
   <li> <a href="#overview"> Overview of the major components </a>
45
      <ul>
46
        <li> <a href="#spatial_harvester"> Spatial Harvester </a> </li>
47
        <li> <a href="#wms"> Web Mapping Service </a> </li>
48
        <li> <a href="#spatial_query"> Spatial Query </a> </li>
49
        <li> <a href="#html_client"> HTML Mapping Client</a> </li>
50
      </ul>
51
   </li>
52
   <li> <a href="#install">Installing and Configuring the Spatial Option</a> </li>
53
   <li> <a href="#adding_data">Adding Other Spatial Datasets to the Web Map</a> </li>
54
   <li> <a href="#dev">Developers Notes</a> </li>
55
   <li> <a href="#future">Future Directions</a> </li>
56
57
 </ul>
58
59
  <h4> <a name="definitions">Definitions</a></h4>
60
The following table defines a number of terms that are useful in discussing
61
Harvester and its features.
62
  <br><br>
63
  <table border="1">
64
    <tr>
65
      <td><b>Term</b></td>
66
      <td><b>Definition</b></td>
67
    </tr>
68
    <tr>
69
      <td>Spatial Cache</td>
70
      <td>
71
          A cached version of the metacat documents representing their geographic coverages in a GIS-compatible data format; the ESRI Shapefile.
72
      </td>
73
    </tr>
74
    <tr>
75
      <td>Web Mapping Service (WMS)</td>
76
      <td>
77
         A standard interface specification for requesting spatial data as a web-deliverable map image. WMS servers accept a common set of parameters via http, render the spatial dataset into an appropriate image and deliver it back to the client. The <a href="http://www.opengeospatial.org/standards/wms">WMS spec</a> was developed by the <a href="http://www.opengeospatial.org/">Open Geospatial Consortium</a>.
78
    </tr>
79
    <tr>
80
      <td>Bounding Box (BBOX)</td>
81
      <td>
82
          A Bounding Box is two sets of geographic coordinates representing the full geographic extent of an entity; the minimum lat/long (the lower-left) and the maximum lat/long (the upper-right).
83
      </td>
84
    </tr>
85
    <tr>
86
      <td>Spatial Dataset</td>
87
      <td>
88
        A collection of <em>spatial features</em> in a common datastore.
89
      </td>
90
    </tr>
91
    <tr>
92
      <td>Spatial Features</td>
93
      <td>
94
        Analagous to a "row" in a tabular dataset, a feature is an entity comprised of both tabular attributes and a <em>spatial geometry</em>.
95
      </td>
96
    </tr>
97
    <tr>
98
      <td>Spatial Geometry</td>
99
      <td>
100
        The geometry is a vector representation of an entities' geographic location. This can be a point (a single vertex), line (a series of vertices) or polygon (a series of vertices forming a closed area).
101
      </td>
102
    </tr>
103
    <tr>
104
      <td>Multi-Geometry</td>
105
      <td>
106
       A <em> Spatial geometry </em> represented by one or more geometry primitives (points,lines and polygons). For example a single species census (the <em> spatial feature </em>) might have mutltiple sample sites and could be represented as a multi-point geometry.
107
      </td>
108
    </tr>
109
  </table>
110
111
  <h4><a name="overview">Overview Of the Major Components<a/></h4>
112
113
114
        <h5> <a name="spatial_harvester"> Spatial Harvester </a> </h5>
115
116
117
118
119
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	The Spatial Harvester component syncs the metacat database with the spatial cache (an ESRI shapefile which contains the geographic coverages of the documents). </span></p>
120
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	The Spatial Harvester is implemented entirely in Java using the Geotools library which allows manipulation of spatial datasets. In rough terms, a spatial dataset is a collection of Features which are comprised of a geometry (i.e. the geographic coverage) and associated attributes (i.e. the document's title).</span></p>
121
   <p dir="ltr" style="text-align:left;margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	There are a number of Java classes which, collectively, make up the spatial harvester functionality. They are found in the </span><span style="font-style:italic" lang="en-US">edu.ucsb.nceas.metacat.spatial</span><span lang="en-US"> package:</span></p>
122
   <ul>
123
    <li style="margin-bottom:12pt"><span lang="en-US">&nbsp;</span><span style="font-weight:bold" lang="en-US">SpatialDataset</span><span lang="en-US"> : Provides read/write access to the spatial cache.</span></li>
124
    <li style="margin-bottom:12pt"><span lang="en-US">&nbsp;</span><span style="font-weight:bold" lang="en-US">SpatialDocument</span><span lang="en-US"> : Represents the geographic coverage of a document as a Geotools Feature. </span></li>
125
    <li style="margin-bottom:12pt"><span lang="en-US">&nbsp;</span><span style="font-weight:bold" lang="en-US">SpatialFeatureSchema</span><span lang="en-US"> : A class of static memebers defining the properties of the spatial cache (location on the file system, attribute and geometry schemas, etc.)</span></li>
126
    <li style="margin-bottom:12pt"><span lang="en-US">&nbsp;</span><span style="font-weight:bold" lang="en-US">SpatialHarvester</span><span lang="en-US"> : The high-level interface for manipulating the spatial cache and initiating the harvesting process.</span>    </li>
127
   </ul>
128
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"></p>
129
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	The spatial cache currently represents the geographic coverage of XML documents based on a bounding box. The four bounding coordinates (either latitudes or longitudes) can be specified in the metacat.properties file by their xpaths. For example, the geographic coverage of EML documents is defined as:</span></p>
130
131
<pre>
132
westBoundingCoordinatePath=geographicCoverage/boundingCoordinates/westBoundingCoordinate
133
eastBoundingCoordinatePath=geographicCoverage/boundingCoordinates/eastBoundingCoordinate
134
southBoundingCoordinatePath=geographicCoverage/boundingCoordinates/southBoundingCoordinate
135
northBoundingCoordinatePath=geographicCoverage/boundingCoordinates/northBoundingCoordinate
136
</pre>
137
138
   <p dir="ltr" style="text-align:left;margin-bottom:12pt;margin-left:0.1250in;text-indent:0in"></p>
139
   <p dir="ltr" style="text-align:left;margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	&nbsp;It is important to note that, at the moment, only one set of xpaths are defined in metacat.properties meaning only documents of the chosen schema can be accessed by the spatial harvester. Also note that, for performance reasons, the xpaths to the bounding coordinates must also appear in your indexPath (defined in build.properties).</span></p>
140
   <p dir="ltr" style="text-align:left;margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	The bounding coordinates are spatially cached in two ways: t</span><span lang="en-US">he centroid(s) of the bounding box(s) and the actual bounding box(es). These are stored as two seperate shapefiles with multi-point and multi-polygon geometry types respectively. By default, </span><span style="font-size:8pt;font-family:'Courier New'" lang="en-US">${tomcat_webapps_directory}/${context}/data/metacat_shps/</span><span style="font-size:8pt;font-family:'Courier New'" lang="en-US">data_points.shp</span><span lang="en-US"> is the storage location of the point cache while </span><span style="font-size:8pt;font-family:'Courier New'" lang="en-US">data_bounds.shp</span><span lang="en-US"> represents the polygon cache. </span></p>
141
   <p dir="ltr" style="text-align:left;margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	The bounding polygon is not relevant to every document as bounding coordinates are allowed to be of zero-area (ie west = = east and north = = south). In this case they are represented only as a point. In cases where no bounding coordinates are defined, the document is not represented at all in the spatial cache. Note that special care has been taken to account for cases where the bounding box crosses the international dateline or polar regions (at which point Cartesian calculations are invalid).</span></p>
142
   <p dir="ltr" style="text-align:left;margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Because documents may have more than one geographic coverage, it is necessary to define the two spatial caches as </span><span style="font-style:italic" lang="en-US">multi</span><span lang="en-US">-point and </span><span style="font-style:italic" lang="en-US">multi</span><span lang="en-US">-polygon geometry types. This means that each feature's geometry field can contain a collection of one </span><span style="font-style:italic" lang="en-US">or more </span><span lang="en-US">primitive geometries.</span></p>
143
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	&nbsp;With the spatial option properly installed, the default metacat.properties setting is to set </span><span style="font-style:italic" lang="en-US">regenerateCacheOnRestart=true.</span><span lang="en-US"> This is very useful the first time you install metacat since it will generate the spatial cache from scratch when your servlet container is restarted. Depending on how many documents you have in your metacat database, this can take a considerable amount of time; several minutes in the case of a few thousand documents. For this reason, Metacat sets this property to <em>false</em> after the spatial cache has been generated the first time. This prevents the regeneration of the spatial cache every time you restart your servlet container. Note that if you upgrade or reinstall metacat, the spatial cache will be regenerated again.  </span></p>
144
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Once the spatial cache has been generated, the Metacat servlet will keep the spatial cache in sync with the metacat database by triggering the spatial harvester on every insert, update or delete. This does not regenerate the whole spatial cache, instead simply updating features in the cache as needed. It is fairly quick and should not add more than 1/2 second to any given transaction. As mentioned earlier, all high-level interactions with the spatial cache are handled through the SpatialHarvester class. </span></p>
145
   <ul>
146
    <li style="margin-bottom:12pt"><span lang="en-US">	W</span><span lang="en-US">hen a document is deleted, the SpatialHarvester.addToDeleteQue() method is called directly from the Metacat Servlet. </span></li>
147
    <li style="margin-bottom:12pt"><span lang="en-US">	Inserts and updates are handled during the indexing process; after BuildIndex for a document has completed successfully (see </span><span style="font-style:italic" lang="en-US">DocumentImpl.java </span><span lang="en-US">) the SpatialHarvester.addToUpdateQue() method is invoked. The document is purged from the spatial cache (if updating) and the new document (or document revision) is added.</span>    </li>
148
   </ul>
149
   <p dir="ltr" style="text-align:left;margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	There is one very important note about document authentication. While metacat provides very fine-grained permissions control at the document level, the Web Mapping Server component does not. For this reason, only documents that are publicly readable (i.e. documents which match the following SQL query : </span><span style="font-size:9pt;font-family:'Courier New'" lang="en-US">select distinct docid from xml_access where principal_name = 'public' and perm_type = 'allow')</span><span lang="en-US">will be added to the spatial cache. </span><span lang="en-US">In the Future Directions section of this document, the potential for adding feature-level permissions to the WMS server are discussed.</span></p>
150
151
152
        <h5> <a name="wms"> Web Mapping Service </a> </h5>
153
154
155
156
157
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	The primary function of the Web Mapping Server component is to render the spatial cache as a web-deliverable map image. It is also responsible for rendering other geographic data to provide base maps or other auxillary map layers.</span></p>
158
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	The OpenGIS consortium has defined a standard for requesting maps, the Web Mapping Service or WMS standard. WMS servers accept a common set of parameters via http, render the spatial dataset into an appropriate image and deliver it back to the client. For Metacat, we chose to go with GeoServer, a WMS-capable application written in Java. </span></p>
159
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Integration with Metacat Context</span></p>
160
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Configuration</span></p>
161
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Web Interface</span></p>
162
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Supports many vector data formats</span></p>
163
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US"> &nbsp;Outputs Images (mainly) but can be used to output GML and KML.</span></p>
164
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Known issues (rasters, startup procedure, difficult for novice to configure, maven build, size)</span></p>
165
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"></p>
166
167
168
169
170
        <h5> <a name="spatial_query"> Spatial Query </a> </h5>
171
172
173
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Displaying the spatial cache as a map is important but users also need to query the spatial cache in order to answer the question "What documents lie in this geographic region?". The functionality is invoked through the metacat servlet itself; there is a </span><span style="font-style:italic" lang="en-US">spatial_query </span><span lang="en-US">action for this purpose. An example spatial query would be:</span></p>
174
   <p dir="ltr" style="text-align:left;margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span style="font-size:8pt;font-family:'Courier New'" lang="en-US">http://localhost/knb/metacat?action=spatial_query&amp;xmin=-117.5&amp;xmax=-64&amp;ymin=3&amp;ymax=46&amp;skin=default</span></p>
175
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">Where xmin, xmax, ymin and ymax represent the west, east, south and northern bounding coordinates respectively. This will return an html document listing (in the style of the specified skin) all documents whose geographic coverage intersect the given bounding box.</span></p>
176
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	The core functionality of the spatial query mechanism is found in the </span><span style="font-style:italic" lang="en-US">edu.ucsb.nceas.metacat.spatial.SpatialQuery</span><span lang="en-US"> class and, like the spatial harvester, relies heavily on the Geotools library. This class has a single method,</span><span style="font-style:italic" lang="en-US"> filterByBbox()</span><span lang="en-US">, which compares the bounding box to both the point and polygon cache. For each shapefile, the process requires two steps: First, filter the spatial cache for features whose bounding box overlaps the specified bounding coordinates; Second, iterate through the remaining features and perform an an actual geometric intersection. The second step, though more costly than comparing the bbox, is necessary because the feature's geometry may be a multi-geometry whose bounding box is large but whose component primitive geometries are scattered over that area. The end result is a vector of docids matching the spatial query.</span></p>
177
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	This docid list is then sent to DBQuery. Using a special constructor that takes a vector of docids, the DBQuery class is able to use the Docid override mechanism to perform an optimized query (for cases where the list of docids is already known).</span></p>
178
179
180
181
182
183
184
        <h5> <a name="html_client"> HTML Mapping Client</a> </h5>
185
186
187
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	In order to provide a web-based user interface to the WMS and the spatial query functionality, Metacat relies on Community Mapbuilder. Mapbuilder is a pure HTML/javascript application which uses AJAX and XSLT on the client side to create a desktop-GIS-like environment for interacting with geographic data through a web browser.</span></p>
188
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Consumes WMS services, defined through WMC</span></p>
189
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Map interface components (map, box zooms, layer list, "select location" dropdown, scalebar, coordinates, info query)</span></p>
190
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Widget Architecture, Metacat config files, html divs</span></p>
191
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	AOIMetacat Query</span></p>
192
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"><span lang="en-US">	Integration with skins</span></p>
193
194
195
196
197
198
<h4> <a name="install">Installing and Configuring the Spatial Option</a> </h4>
199
200
 <p style="margin-bottom:12pt"><span lang="en-US">When first installing a version of metacat with the spatial option, you'll want to ensure that </span><span style="font-style:italic" lang="en-US">install.spatial</span><span lang="en-US"> is </span><span style="font-style:italic" lang="en-US">true</span><span lang="en-US"> in build.properties. You'll also want to ensure that </span><span style="font-style:italic" lang="en-US">runSpatialOption </span><span lang="en-US">is </span><span style="font-style:italic" lang="en-US">true</span><span lang="en-US"> in metacat.properties. Both of these values are the default so, unless you explicitly set them to be false, the spatial option should install and run automatically.</span></p>
201
202
   <p style="margin-bottom:12pt">How do I configure the layout of the html mapping interface?</p>
203
   <p style="margin-bottom:12pt">How can I configure the initial extent of the map?</p>
204
   <p style="margin-bottom:12pt">How do I configure the "select location" dropdown to contain different predefined locations?</p>
205
   <p dir="ltr" style="text-align:left;margin-bottom:12pt;margin-top:0.0000in;margin-right:0.0000in">Can I use a different web mapping interface?</span></p>
206
   <p dir="ltr" style="text-align:left;margin-bottom:12pt;margin-top:0.0000in;margin-right:0.0000in">How do I configure the styling and classification of the data?</p>
207
   <p style="margin-bottom:12pt">How can I upgrade/change the version of geoserver?</p>
208
   <p style="margin-bottom:12pt">What versions of tomcat are supported?</p>
209
   <p style="margin-bottom:12pt">The spatial functionality has only been tested on tomcat 5. </p>
210
   <p style="margin-bottom:12pt">How do I add the spatial functionality to a metacat skin?</p>
211
212
213
214
<h4> <a name="adding_data">Adding Other Spatial Datasets to the Web Map</a> </h4>
215
216
  <h5> WMS data </h5>
217
  <h5> Spatial datasets already registered with metacat </h5>
218
  <h5> Raster Images </h5>
219
220
221
222
<h4> <a name="dev">Developers Notes</a></h4>
223
224
<h5>web.xml</h5>
225
<h5>upgrading geoserver</h5>
226
<h5>tomcat</h5>
227
<h5>geotools versions</h5>
228
229
230
231
<h4> <a name="future">Future Directions</a> </h4>
232
233
   <h5>Automatically handle spatial datasets</h5>
234
   <p style="margin-bottom:12pt">When users put spatial data into the Morpho system, it would be nice if we could automatically pull all the avialable metadata from the spatial dataset itself. </p>
235
   <p style="margin-bottom:12pt">On the metacat side, it might be worth trying to auto-detect spatial datasets and add them to the WMS service do that they could be displayed along with the metadata coverages. This is tricky since the styling of spatial data is intentionally seperated from the data itself; we'd have to have some sort of easy way to prompt the user for the classification and styling info and construct the appropriate SLDs. </p>
236
   <p style="margin-bottom:12pt">It's worth noting that, currently, one could do this manually. There is nothing, aside from editing a few configuration files, to prevent any Geotools-supported dataset from being displayed through the WMS map interface.</p>
237
   <p style="margin-bottom:12pt">For vector datasets, it would be possible to store the data directly in the database itself (This is a logical extension of the future work to put tabular data directly in a relational database). Postgresql has the PostGIS extensions to handle this so we would have to require postgresql if we went this route.</p>
238
239
   <h5>WMS bypass</h5>
240
   <p style="margin-bottom:12pt"> Filter which spatial cache features are displayed by access contraints, skin constraints and the current non-spatial query set. This would involve intercepting incoming WMS requests and appending a styled layer descriptor (SLD) with an OGC filter to prevent/allow certain docids. </p>
241
242
   <h5>SLD factory</h5>
243
   <p style="margin-bottom:12pt">Closely related to the WMS bypass implemetation, the SLD factory would be in charge of constructing the filter based on on the contraints mentioned above. Because it would have to generate this list of docids on every wms request, performance is a big concern. Likely we'll need to cache docid lists as session variables.</p>
244
245
   <h5>Map configuration interface </h5>
246
   <p style="margin-bottom:12pt">Geoserver currently offers a nice web-based configuration but it is lacking a few key features and may be difficult for a novice GIS user. We may want to reinvent a custom geoserver configuration interface to</p>
247
   <ul>
248
    <li style="margin-bottom:12pt"><span lang="en-US">	D</span>efine the available datasets (ie editing the geoserver xml config files)</li>
249
    <li style="margin-bottom:12pt"><span lang="en-US">	Define</span> classification and styling (ie editing the SLDs)</li>
250
    <li style="margin-bottom:12pt"><span lang="en-US">	Define which layers get displayed in which map (ie editing the Mapbuilder WMCs)</span></li>
251
    <li style="margin-bottom:12pt"><span lang="en-US">	Picking a layout ( from a list of pre-configure mapbuilder templates)</span>    </li>
252
   </ul>
253
   <p style="margin-bottom:12pt;margin-left:0pt;text-indent:0in"> Ideally we could pull as much information as possible from the metadata and make the UI very intuitive. This does bring up issues of web-based admin access constraints and developing a subsytem to handle who has edit access to the map configuration.</p>
254
255
256
257
258
259
260
261
262
263
264
265
266
<br/><br/>
267
<hr/><hr/>
268
<br/><br/>
269
  Harvester is managed by the Harvester Administrator. Typically, the same
270
  individual who manages a Metacat server would also act as the Harvester
271
  Administrator. The responsibilities of the Harvester Administrator include:
272
    <ul>
273
      <li><a href="#Configuring Harvester">Configuring Harvester</a></li>
274
      <li><a href="#Running Harvester">Running Harvester</a></li>
275
      <li><a href="#Reviewing Harvester">Reviewing Harvester reports to
276
      the Harvester Administrator</a></li>
277
    </ul>
278
  <h5><a name="Configuring Harvester">Configuring Harvester</a></h5>
279
  <p>Harvester must be configured to interact with a working Metacat
280
     installation. Thus, a Metacat installation that has been properly
281
     configured and installed is a pre-requisite to running Harvester.
282
     Additionally, Harvester has a number of settable properties that
283
     control its behavior. All Harvester configuration information is managed
284
     in a single file,
285
     <a href=../../lib/metacat.properties>metacat.properties</a>,
286
     located at:
287
  <pre>      METACAT_HOME/lib/metacat.properties</pre>
288
     where METACAT_HOME is the top-level directory that Metacat is
289
     installed in.
290
  </p>
291
  <p>Harvester properties are grouped together in
292
     <a href=../../lib/metacat.properties>metacat.properties</a>, beginning
293
     after the comment line:
294
  <pre><code>      # Harvester properties</code></pre>
295
  </p>
296
  <p>The Harvester Administrator should edit
297
     <a href=../../lib/metacat.properties>metacat.properties</a>,
298
     setting appropriate values for the <code><b>harvesterAdministrator</b></code>
299
     property, the <code><b>smtpServer</b></code> property, and possibly other
300
     properties. The following table is a summary of each property and its function.
301
  </p>
302
  <table border="1">
303
    <tr>
304
      <td><b>Property</b></td>
305
      <td><b>Description</b></td>
306
      <td><b>Possible or default value</b></td>
307
    </tr>
308
    <tr>
309
      <td>connectToMetacat</td>
310
      <td>This property determines whether Harvester should connect to
311
          Metacat to upload documents. It should be set to <code>true</code>
312
          under most circumstances. Setting this property to <code>false</code>
313
          can be useful for testing whether Harvester is able to retrieve
314
          documents from a site without actually connecting to Metacat to
315
          upload the documents.</td>
316
      <td><code>true</code> | <code>false</code><br>
317
          Default: <code>true</code>
318
    </tr>
319
    <tr>
320
      <td>delay</td>
321
      <td>The number of hours that Harvester will wait before beginning its
322
          first harvest. For example, if Harvester is run at  1:00 p.m., and
323
          the delay is set to 12, Harvester will begin its first harvest at
324
          1:00 a.m.</td>
325
      <td>Default: 0</td>
326
    </tr>
327
    <tr>
328
      <td>harvesterAdministrator</td>
329
      <td>The email address of the Harvester Administrator. Harvester will
330
          send email reports to this address after every harvest. You may
331
          enter multiple email addresses by separating each address with
332
          a comma or semicolon, for example, "name1@abc.edu,name2@abc.edu".
333
      </td>
334
      <td>An email address, or multiple email addresses separated by commas
335
          or semi-colons</td>
336
    </tr>
337
    <tr>
338
      <td>logPeriod</td>
339
      <td>The number of days that Harvester should retain log entries of harvest
340
          operations in the database. Harvester log entries record information
341
          such as which documents were harvested, from which sites, and
342
          whether any errors were encountered during the harvest. Log entries
343
          older than <code>logPeriod</code> number of days are purged from the
344
          database at the end of each harvest.</td>
345
      <td>Default: 90</td>
346
    </tr>
347
    <tr>
348
      <td>maxHarvests</td>
349
      <td>The maximum number of harvests that Harvester should execute before
350
          shutting down. When the Harvester program is executed, it will
351
          continue running until it has executed <code>maxHarvests</code>
352
          number of harvests and then the program will terminate. If
353
          the value of <code>maxHarvests</code> is set to 0 or a negative
354
          number, it will be ignored and Harvester will execute indefinitely.
355
          </td>
356
      <td>Default: 0</td>
357
    </tr>
358
    <tr>
359
      <td>period</td>
360
      <td>The number of hours between harvests. Harvester will run a new
361
          harvest every <code>period</code> number of hours, until the
362
          <code>maxHarvests</code> number of harvests have been run, or
363
          indefinitely if <code>maxHarvests</code> is set to a value of
364
          0 or a negative number.
365
      <td>Default: 24</td>
366
    </tr>
367
    <tr>
368
      <td>smtpServer</td>
369
      <td>The SMTP server that Harvester uses for sending email messages
370
          to the Harvester Administrator and to Site Contacts.</td>
371
      <td>A host name, for example: <code>somehost.institution.edu</code>
372
          <br><br>
373
          Default: <code>localhost</code>
374
          <br><br>
375
          Note that the default value will only work if the Harvester
376
          host machine has been configured as a SMTP server.
377
      </td>
378
    </tr>
379
    <tr>
380
      <td>Harvester Operation Properties (GetDocError, GetDocSuccess, etc.)</td>
381
      <td>This group of properties is used by Harvester to report information
382
          about the operations it performs for inclusion in log
383
          entries and email messages. Under most circumstances the values
384
          of these properties should not be modified.</td>
385
      <td>&nbsp;</td>
386
    </tr>
387
  </table>
388
  <br>
389
  <h5><a name="Running Harvester">Running Harvester</a></h5>
390
  After Harvester has been appropriately
391
  <a href="#Configuring Harvester">configured</a>,
392
  it can be run in either of two ways: (A) in a command window, or, (B)
393
  as a servlet. If you wish only to test that Harvester is functioning,
394
  or if you expect to use Harvester infrequently, it may desirable to run it from a
395
  command window. However, under most circumstances you will want Harvester to
396
  run continuously as a background servlet process. This eliminates the
397
  need to keep a command window continuously open while Harvester is running.
398
  Both of these procedures are described below.
399
  <ul>
400
  <li> (A) Running Harvester in a Command Window
401
  <ol>
402
  <li>Open a system command window or terminal window.</li>
403
  <li>Set the METACAT_HOME environment variable to the value of the Metacat
404
      installation directory. Some examples follow:
405
      <ul>
406
        <li>On Windows:
407
        <pre>set METACAT_HOME=C:\somePath\metacat</pre></li>
408
        <li>On Linux/Unix (bash shell):
409
        <pre>export METACAT_HOME=/home/somePath/metacat</pre></li>
410
      </ul>
411
  <li>cd to the following directory:
412
      <ul>
413
        <li>On Windows:
414
        <pre>cd %METACAT_HOME%\lib\harvester</pre></li>
415
        <li>On Linux/Unix:
416
        <pre>cd $METACAT_HOME/lib/harvester</pre></li>
417
      </ul>
418
  <li>Run the appropriate Harvester shell script, as determined by the
419
      operating system:
420
      <ul>
421
        <li>On Windows:
422
        <pre>runHarvester.bat</pre></li>
423
        <li>On Linux/Unix:
424
        <pre>sh runHarvester.sh</pre></li>
425
      </ul>
426
  </li>
427
 </ol>
428
  <p>The Harvester application will start executing. It will begin its first
429
  harvest after <code><b>delay</b></code> number of hours (as specified in the
430
  <a href=../../lib/metacat.properties>metacat.properties</a>
431
  file). The application will continue running a new harvest every
432
  <code><b>period</b></code> number of hours until a <code><b>maxHarvests</b></code>
433
  number of harvests have been completed (if <code><b>maxHarvests</b></code> is set
434
  to a value greater than 0), or until you interrupt the process by hitting CTRL/C
435
  in the command window.
436
  </p>
437
  </li>
438
  <li> (B) Running Harvester as a Servlet
439
  <ol>
440
  <li>Edit the file in your Metcat installation, <code>lib/web.xml.<em>tomcatN</em></code>, where <em>tomcatN</em> corresponds to the
441
  version of Tomcat you are running. For example, if you are running Tomcat 5,
442
  edit file <code>lib/web.xml.tomcat5</code>.</li>
443
  <li>Remove the comment symbols around the HarvesterServlet entry, so that:
444
  <pre><code>
445
  &lt;!--
446
  &lt;servlet>
447
  &lt;servlet-name>HarvesterServlet&lt;/servlet-name>
448
  &lt;servlet-class>edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet&lt;/servlet-class>
449
  &lt;init-param>
450
    &lt;param-name>debug&lt;/param-name>
451
    &lt;param-value>1&lt;/param-value>
452
  &lt;/init-param>
453
  &lt;init-param>
454
    &lt;param-name>listings&lt;/param-name>
455
    &lt;param-value>true&lt;/param-value>
456
  &lt;/init-param>
457
  &lt;load-on-startup>1&lt;/load-on-startup>
458
  &lt;/servlet>
459
  --&gt;
460
  </code></pre>
461
  is changed to:
462
  <pre><code>
463
  &lt;servlet>
464
  &lt;servlet-name>HarvesterServlet&lt;/servlet-name>
465
  &lt;servlet-class>edu.ucsb.nceas.metacat.harvesterClient.HarvesterServlet&lt;/servlet-class>
466
  &lt;init-param>
467
    &lt;param-name>debug&lt;/param-name>
468
    &lt;param-value>1&lt;/param-value>
469
  &lt;/init-param>
470
  &lt;init-param>
471
    &lt;param-name>listings&lt;/param-name>
472
    &lt;param-value>true&lt;/param-value>
473
  &lt;/init-param>
474
  &lt;load-on-startup>1&lt;/load-on-startup>
475
  &lt;/servlet>
476
  </code></pre>
477
  Save the edited file.
478
  </li>
479
  <li>Shutdown Tomcat.</li>
480
  <li>Redeploy Metacat by running the following two ant commands from the top-level
481
  directory of your Metacat installation:
482
  <code><pre>
483
  ant cleanweb
484
  ant install</code></pre>
485
  </li>
486
  <li>Restart Tomcat.</li>
487
 </ol>
488
  <p>About thirty seconds after you restart Tomcat, the Harvester servlet will
489
  start executing. It will begin its first
490
  harvest after <code><b>delay</b></code> number of hours (as specified in the
491
  <a href=../../lib/metacat.properties>metacat.properties</a>
492
  file). The servlet will continue running a new harvest every
493
  <code><b>period</b></code> number of hours until a <code><b>maxHarvests</b></code>
494
  number of harvests have been completed (if <code><b>maxHarvests</b></code> is set
495
  to a value greater than 0), or until Tomcat shuts down.
496
  </p>
497
  </li>
498
   <h5><a name="Reviewing Harvester">
499
  Reviewing Harvester Reports to the Harvester Administrator</a></h5>
500
  <P>
501
  After every harvest, Harvester will send an email report to the Harvester
502
  Administrator detailing the operations that were performed during the
503
  harvest. The report will contain information about each of the Harvest Sites
504
  that were harvested from, such as which EML documents were
505
  harvested and whether any errors were encountered.
506
  </P>
507
  <p>
508
  The harvest report will contain a list of log entries, where each log entry
509
  describes an operation that was performed by Harvester. Log entries that
510
  show a status value of 1 indicate that an error occurred during the
511
  operation, while those that show a status value of 0 indicate that the
512
  operation was completed successfully.
513
  </p>
514
  <P>The Harvester Administrator should review the report, paying particularly
515
  close attention to any errors that are reported and to the accompanying error
516
  messages that are displayed. When errors are reported at
517
  a particular site, the Harvester Administrator should contact the Site
518
  Contact to determine the source of the error and its resolution. See
519
  <a href=#Reviewing>Reviewing Harvester Reports to the Site Contact</a> for a
520
  description of common sources of errors at a Harvest Site.
521
  </P>
522
  <p>Errors that are independent of a particular site may indicate a problem
523
  with Harvester itself, Metacat, or the database connection. Refer to the
524
  error message to determine the source of the error and its resolution.
525
  </p>
526
  <h4>Managing a Harvest Site</h4>
527
  A Harvest Site is managed by a Site Contact.
528
  The responsibilities of a Site Contact fall into the following categories:
529
    <ul>
530
      <li><a href=#Registering>Registering with Harvester</a></li>
531
      <li><a href=#Composing>Composing a Harvest List</a></li>
532
      <li><a href=#Preparing>Preparing EML Documents for harvest</a></li>
533
      <li><a href=#Reviewing>Reviewing Harvester reports to the Site Contact</a></li>
534
    </ul>
535
    <h5><a name="Registering">Registering with Harvester</a></h5>
536
  <p>
537
  A Site Contact registers a site with Harvester by logging in to the
538
  Harvester Registration page and entering several items of information
539
  that Harvester needs to know about the site.
540
  </p>
541
  <ol>
542
    <li>Logging in to the Harvester Registration Page
543
  <p>
544
  The Harvester Registration page is accessed from Metacat. For example, if
545
  the Metacat server that you wish to register with resides at the following
546
  URL:
547
  <pre>  http://somehost.somelocation.edu:8080/knb/index.jsp</pre>
548
  then the Harvester Registration page would be accessed at:
549
  <pre>  http://somehost.somelocation.edu:8080/knb/style/skins/knb/harvesterRegistrationLogin.html</pre>
550
  </p>
551
  <p>
552
  After bringing up this page in your browser, login to your Metacat account
553
  by entering your username, organization, and password. For example:
554
      <table bgcolor="#ffffff" border="0" cellpadding="2" width='100%' >
555
        <tr >
556
          <td colspan=3 align=center >&nbsp;</td>
557
        </tr>
558
        <tr >
559
          <td colspan=3 align=center >
560
            <font face=verdana size=1%>
561
              <b>Please  Enter Username, Organization, and Password </b>
562
            </font>
563
          </td>
564
        </tr>
565
        <tr>
566
          <td width='10%'> &nbsp;</td>
567
          <td width="25%" bgcolor="#4682b4">
568
            <p align="center">
569
            <font color="white" face=verdana size=2%>
570
            <b>Username</b>
571
            </font>
572
          </td>
573
          <td><p><input type="text" name="uid" value="jdoe" maxlength="100" size="28"></td>
574
        </tr>
575
        <tr>
576
          <td width='10%'> &nbsp;</td>
577
          <td width="25%" bgcolor="#4682b4">
578
            <p align="center">
579
            <font color="white" face=verdana size=2%>
580
            <b>Organization</b>
581
            </font>
582
          </td>
583
          <td>
584
            <input type="radio" name="o" value="NCEAS" checked>NCEAS
585
            <input type="radio" name="o" value="LTER">LTER
586
            <input type="radio" name="o" value="NRS">NRS
587
            <br>
588
            <input type="radio" name="o" value="PISCO">PISCO
589
            <input type="radio" name="o" value="OBFS">OBFS
590
            <input type="radio" name="o" value="Unaffiliated">Unaffiliated
591
        </tr>
592
        <tr>
593
          <td width='10%'> &nbsp;</td>
594
          <td bgcolor="#4682b4">
595
            <p align="center">
596
            <font color="white" face=verdana size=2%>
597
            <b>Password</b>
598
            </font>
599
          </td>
600
          <td><p><input type="password" name="passwd" value="*******" maxlength="60" size="28">
601
          </td>
602
        </tr>
603
        <tr>
604
          <td colspan=3 align=center >&nbsp;</td>
605
        </tr>
606
      </table>
607
  In some cases, a Site Contact may need to login to an anonymous account
608
  rather than his or her personal account. For example, a LTER Information
609
  Manager may need to login to a dedicated account, named with a three-letter
610
  acronym, that has been set up for the LTER site. The username
611
  "GCE" would be used by the LTER Information Mangager at the GCE (Georgia
612
  Coastal Ecosystems) site.
613
  </p>
614
    </li>
615
    <li>Registering with Harvester
616
  <p>
617
  After logging in, you will be presented with a web form that prompts you
618
  to enter information about your site and how often you want to schedule
619
  harvests at your site. For example:
620
      <table bgcolor="#ffffff" border="0" cellpadding="2" width='100%' >
621
        <tr >
622
          <td colspan=3 align=center >&nbsp;</td>
623
        </tr>
624
        <tr >
625
          <td colspan=3 align=center >
626
            <font face=verdana size=1%>
627
              <b>Metacat Harvester Registration </b>
628
            </font>
629
          </td>
630
        </tr>
631
        <tr>
632
          <td width='10%'> &nbsp;</td>
633
          <td width="25%" bgcolor="#4682b4">
634
            <p align="center">
635
            <font color="white" face=verdana size=2%>
636
            <b>Email address:</b>
637
            </font>
638
          </td>
639
          <td><p><input type="text" size="55" name="uid" value="myname@institution.edu" maxlength="100" size="28"></td>
640
        </tr>
641
        <tr>
642
          <td width='10%'> &nbsp;</td>
643
          <td bgcolor="#4682b4">
644
            <p align="center">
645
            <font color="white" face=verdana size=2%>
646
            <b>Harvest List URL:</b>
647
            </font>
648
          </td>
649
          <td><p><input type="text" size="55" name="passwd" value="http://somehost.institution.edu/~myname/harvestList.xml" maxlength="60" size="28">
650
          </td>
651
        </tr>
652
        <tr>
653
          <td colspan=3 align=center >&nbsp;</td>
654
        </tr>
655
        <tr>
656
          <td width='10%'> &nbsp;</td>
657
          <td bgcolor="#4682b4">
658
            <p align="center">
659
            <font color="white" face=verdana size=2%>
660
            <b>Harvest Frequency (1-99):</b>
661
            </font>
662
          </td>
663
          <td><p><input type="text" size="3" name="passwd" value="2" maxlength="60" size="28">
664
          </td>
665
        </tr>
666
        <tr>
667
          <td colspan=3 align=center >&nbsp;</td>
668
        </tr>
669
        <tr>
670
          <td width='10%'> &nbsp;</td>
671
          <td width="25%" bgcolor="#4682b4">
672
            <p align="center">
673
            <font color="white" face=verdana size=2%>
674
            <b>Unit:</b>
675
            </font>
676
          </td>
677
          <td>
678
            <input type="radio" name="o" value="days" >day(s)
679
            <input type="radio" name="o" value="weeks" checked>week(s)
680
            <input type="radio" name="o" value="months">month(s)
681
        </tr>
682
      </table>
683
  <p>
684
  After values have been entered for each of these fields, click the Register
685
  button to register your site with Harvester.
686
  </p>
687
  <P>
688
  In the example shown above, Harvester will attempt to harvest documents from
689
  the site once every 2 weeks, it will access the site's Harvest List at URL
690
  "http://somehost.institution.edu/~myname/harvestList.xml", and it will send
691
  email reports to the Site Contact at email address "myname@institution.edu".
692
  </P>
693
  <P>
694
  Note that you may enter multiple email addresses by separating each
695
  address with a comma or a semi-colon. For example,
696
  "myname@institution.edu,anothername@institution.edu".
697
  </P>
698
    </li>
699
    <li>Unregistering with Harvester
700
  <p>
701
  At any time after you have registered with Harvester, you may discontinue
702
  harvests at your site by unregistering. Simply login as described above and
703
  then click the Unregister button. After doing so, Harvester will discontinue
704
  harvests at the site.
705
  </p>
706
    </li>
707
  </ol>
708
  <h5><a name="Composing">Composing a Harvest List</a></h5>
709
  <p>
710
  A Harvest List is an XML file that holds a list of EML documents to be
711
  harvested. For each EML document in the list, the following information
712
  must be specified:
713
  <ul>
714
    <li><code>docid</code>, which consists of the:
715
      <ul>
716
        <li><code>scope</code>, e.g. "demoDocument". The scope is an identifier
717
            that indicates which group of documents this document belongs to.
718
        </li>
719
        <li><code>identifier</code>, e.g. "1". The identifier is a number that
720
            uniquely identifies this document within the scope.
721
        </li>
722
        <li><code>revision</code>, e.g. "5". The revision is a number that
723
            indicates the current revision of this document.
724
        </li>
725
      </ul>
726
    </li>
727
    <li><code>documentType</code>, e.g. "eml://ecoinformatics.org/eml-2.0.0".
728
        The documentType identifies the document as an EML document.</li>
729
    <li><code>documentURL</code>, e.g. "http://www.lternet.edu/~dcosta/document1.xml".
730
        The documentURL specifies a place where Harvester can locate
731
        and retrieve the document via HTTP.</li>
732
  </ul>
733
  </p>
734
  <p>
735
  The contents of a Harvest List XML file must conform to a particular
736
  XML Schema, as defined in file <a href="../../lib/harvester/harvestList.xsd">
737
  harvestList.xsd</a>. The contents of a valid Harvest List
738
  can best be illustrated by example. The sample Harvest List
739
  below contains two &lt;<code>document</code>&gt; elements that specify the
740
  information that Harvester needs to retrieve a pair of EML documents and
741
  upload them to Metacat:
742
  <pre>
743
&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
744
&lt;hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList" &gt;
745
    &lt;document&gt;
746
        &lt;docid&gt;
747
            &lt;scope&gt;demoDocument&lt;/scope&gt;
748
            &lt;identifier&gt;1&lt;/identifier&gt;
749
            &lt;revision&gt;5&lt;/revision&gt;
750
        &lt;/docid&gt;
751
        &lt;documentType&gt;eml://ecoinformatics.org/eml-2.0.0&lt;/documentType&gt;
752
        &lt;documentURL&gt;http://www.lternet.edu/~dcosta/document1.xml&lt;/documentURL&gt;
753
    &lt;/document&gt;
754
    &lt;document&gt;
755
        &lt;docid&gt;
756
            &lt;scope&gt;demoDocument&lt;/scope&gt;
757
            &lt;identifier&gt;2&lt;/identifier&gt;
758
            &lt;revision&gt;1&lt;/revision&gt;
759
        &lt;/docid&gt;
760
        &lt;documentType&gt;eml://ecoinformatics.org/eml-2.0.0&lt;/documentType&gt;
761
        &lt;documentURL&gt;http://www.lternet.edu/~dcosta/document2.xml&lt;/documentURL&gt;
762
    &lt;/document&gt;
763
&lt;/hrv:harvestList&gt;
764
  </pre>
765
  <p>
766
  After editing the Harvest List, ensure that the Harvest List XML file resides
767
  at the appropriate location on disk as specified by the URL that was entered
768
  during the <a href=#Registering>registration</a> process.
769
  </p>
770
  <p>
771
  The <a href=./harvestListEditor.html>Harvest List Editor</a> is a tool that
772
  assists in composing and editing a Harvest List. (Click
773
  <a href=./harvestListEditor.html>here</a> for additional details.)
774
  </p>
775
    <h5><a name="Preparing">Preparing EML Documents for harvest</a></h5>
776
  <p>
777
  To prepare a set of EML documents for harvest, ensure that the following is
778
  true for each document:
779
  <ul>
780
    <li>The document contains valid EML</li>
781
    <li>The document is specified in a &lt;document&gt; element in the
782
        site's Harvest List, as described above</li>
783
    <li>The file resides at the appropriate location on disk as specified
784
        by its URL in the Harvest List</li>
785
  </ul>
786
  </p>
787
    <h5><a name="Reviewing" >Reviewing Harvester Reports to the Site Contact</a></h5>
788
  <P>
789
  After every scheduled harvest that takes place at a particular Harvest
790
  Site, Harvester will send an email report to the Site Contact detailing the
791
  operations that were performed during the harvest.
792
  The report will contain information about the operations that were
793
  performed by Harvester at that site, such as
794
  which EML documents were harvested and whether any errors were encountered.
795
  </P>
796
  <P>
797
  The Site Contact should review the report, paying particularly
798
  close attention to any errors that are reported. Errors are indicated
799
  by operations that display a status value of 1, while operations that
800
  display a status value of 0 indicate that the operation completed
801
  successfully.
802
  </P>
803
  <p>
804
  When errors are reported,
805
  the Site Contact should try to determine whether the source of the error
806
  is something that can be corrected at the site. Common causes of errors
807
  might be:
808
  <ul>
809
    <li>A document URL specified in the Harvest List does not match
810
        the location of the actual EML file on the disk</li>
811
    <li>The Harvest List does not contain valid XML as specified in
812
        the <a href=../../lib/harvester/harvestList.xsd>harvestList.xsd</a> schema</li>
813
    <li>The URL to the Harvest List that was specified during
814
        registration with Harvester does not match the actual location of
815
        the Harvest List on the disk</li>
816
    <li>An EML document that Harvester attempted to upload to Metacat does
817
        not contain valid EML</li>
818
  </ul>
819
  </P>
820
  <p>
821
  If the Site Contact is unable to determine the cause of the error and its
822
  resolution, he or she should contact the Harvester Administrator for assistance.
823
  </p>
824
  <a href="./properties.html">Back</a> |
825
  <a href="./metacattour.html">Home</a> |
826
  <a href="./unimplem.html">Next</a>
827
</BODY>
828
</HTML>