Metacat Harvester

Back | Home | Next

Introduction

The Metacat Harvester (henceforth referred to as "Harvester") is a program that automates the retrieval of EML documents from one or more sites and their subsequent upload (insert or update) to Metacat. Harvester uses pull technology to retrieve and upload documents to Metacat on a regularly scheduled basis.

Although Harvester is included with a Metacat installation (beginning with Metacat version 1.4.0), it is an extention to Metacat's functionality that may be used optionally.

Definitions

The following table defines a number of terms that are useful in discussing Harvester and its features.

Term	Definition
Harvester	The Harvester program, a Java application that is bundled with the Metacat distribution. When a user installs Metacat on a system, the Harvester program is automatically included in the installation.
Harvester Administrator	The individual who installs and manages Harvester. Typically, this would be the same individual who installs and manages Metacat at a given installation.
Harvest Site	A location from which Harvester can retrieve EML documents. A given Harvester can retrieve documents from any number of different Harvest Sites.
Harvest	The act (by Harvester) of visiting a Harvest Site, retrieving a number of EML documents, and inserting or updating the documents to Metacat.
Harvest List	An XML document that lists a set of EML documents to be harvested. The Harvest List must conform to an XML Schema, harvestList.xsd.
Site Contact	The individual at a particular Harvest Site who registers with Harvester, composes a Harvest List, and periodically prepares the site's EML documents for retrieval and upload to Metacat.
Harvest List URL	A URL to the Harvest List, as specified by the Site Contact. Each Harvest Site corresponds to a Harvest List URL. Harvester uses the URL to locate and read a site's Harvest List.
Document URL	A URL to an EML document, as specified in the Harvest List. The Harvest List may contain any number of Document URLs. Each Document URL provides a locator to a document to be harvested.
Harvester Registration Page	A web page that provides a means for a Site Contact to register with Harvester to schedule regular harvests from the site. Registration involves logging in and then specifying various settings for the Harvest Site, such as the Harvest List URL, the harvest frequency, and the email address of the Site Contact.

Managing Harvester

Harvester is managed by the Harvester Administrator. Typically, the same individual who manages a Metacat server would also act as the Harvester Administrator. The responsibilities of the Harvester Administrator include:

Configuring Harvester
Running Harvester
Reviewing Harvester reports to the Harvester Administrator

Configuring Harvester

Harvester must be configured to interact with a working Metacat installation. Thus, a Metacat installation that has been properly configured and installed is a pre-requisite to running Harvester. Additionally, Harvester has a number of settable properties that control its behavior. All Harvester configuration information is managed in a single file, harvester.properties, located at:

      METACAT_HOME/lib/harvester/harvester.properties

where METACAT_HOME is the top-level directory that Metacat is installed in.

The Harvester Administrator should edit harvester.properties, setting appropriate values for the Metacat URL, database driver, database connection, and other settings. The following table is a summary of each property and its function.

Property	Description	Possible or default value
connectToMetacat	This property determines whether Harvester should connect to Metacat to upload documents. It should be set to `true` under most circumstances. Setting this property to `false` can be useful for testing whether Harvester is able to retrieve documents from a site without actually connecting to Metacat to upload the documents.	`true` \| `false` Default: `true`
dbDriver	The JDBC driver to be used to access the backend database. This setting should match the value of the dbDriver property as set in the build.xml file as appropriate to the database being used (Oracle, PostgreSQL, or SQL Server).	Examples: `oracle.jdbc.driver.OracleDriver` `org.postgresql.Driver` `com.microsoft.jdbc.sqlserver.SQLServerDriver`
defaultDB	The JDBC connection string that Metacat uses to connect to the backend database. This setting should match the value of the `jdbc-connect` property as set in the build.properties file in the associated Metacat installation.	Example: `jdbc:oracle:thin:@server.domain.com:1521:Metacat`
delay	The number of hours that Harvester will wait before beginning its first harvest. For example, if Harvester is run at 1:00 p.m., and the delay is set to 12, Harvester will begin its first harvest at 1:00 a.m.	Default: 0
harvesterAdministrator	The email address of the Harvester Administrator. Harvester will send email reports to this address after every harvest.	An email address
logPeriod	The number of days that Harvester should retain log entries of harvest operations in the database. Harvester log entries record information such as which documents were harvested, from which sites, and whether any errors were encountered during the harvest. Log entries older than `logPeriod` number of days are purged from the database at the end of each harvest.	Default: 90
maxHarvests	The maximum number of harvests that Harvester should execute before shutting down. When the Harvester program is executed, it will continue running until it has executed `maxHarvests` number of harvests and then the program will terminate.	Default: 30
metacatURL	The URL of the Metacat servlet to which Harvester should connect for uploading documents.	Example: http://somehost.institution.edu:8080/knb/servlet/metacat
password	The password that Harvester uses to access the backend database. This setting should match the value of the `password` property as set in the build.properties file in the associated Metacat installation.
period	The number of hours between harvests. Harvester will run a new harvest every `period` number of hours, until the `maxHarvests` number of harvests have been run.	Default: 24
smtpServer	The SMTP server that Harvester uses for sending email messages to the Harvester Administrator and to Site Contacts.	A host name, for example: `somehost.institution.edu` Default: `localhost` Note that the default value will only work if the Harvester host machine has been configured as a SMTP server.
user	The username that Metacat uses to access the backend database. This setting should match the `user` value as set in the build.properties file in the associated Metacat installation.
Harvester Operation Properties (GetDocError, GetDocSuccess, etc.)	This group of properties is used by Harvester to report information about the operations it performs for inclusion in log entries and email messages. Under most circumstances the values of these properties should not be modified.

Running Harvester

After Harvester has been appropriately configured, it can be run as follows:

Open a system command window or terminal window.
Set the METACAT_HOME environment variable to the value of the Metacat installation directory. Some examples follow:
- On Windows:
```
set METACAT_HOME=C:\somePath\metacat
```
- On Linux/Unix (bash shell):
```
export METACAT_HOME=/home/somePath/metacat
```

cd to the following directory:

On Windows:
```
cd %METACAT_HOME%\lib\harvester
```
On Linux/Unix:
```
cd $METACAT_HOME/lib/harvester
```

Run the appropriate Harvester shell script, as determined by the operating system:
- On Windows:
```
runHarvester.bat
```
- On Linux/Unix:
```
sh runHarvester.sh
```

The Harvester application will start executing. It will begin its first harvest after delay number of hours (as specified in the harvester.properties file). The application will continue running a new harvest every period number of hours until a maxHarvests number of harvests have been completed.

Reviewing Harvester Reports to the Harvester Administrator

After every harvest, Harvester will send an email report to the Harvester Administrator detailing the operations that were performed during the harvest. The report will contain information about each of the Harvest Sites that were harvested from, such as which EML documents were harvested and whether any errors were encountered.

The harvest report will contain a list of log entries, where each log entry describes an operation that was performed by Harvester. Log entries that show a status value of 1 indicate that an error occurred during the operation, while those that show a status value of 0 indicate that the operation was completed successfully.

The Harvester Administrator should review the report, paying particularly close attention to any errors that are reported and to the accompanying error messages that are displayed. When errors are reported at a particular site, the Harvester Administrator should contact the Site Contact to determine the source of the error and its resolution. See Reviewing Harvester Reports to the Site Contact for a description of common sources of errors at a Harvest Site.

Errors that are independent of a particular site may indicate a problem with Harvester itself, Metacat, or the database connection. Refer to the error message to determine the source of the error and its resolution.

Managing a Harvest Site

A Harvest Site is managed by a Site Contact. The responsibilities of a Site Contact fall into the following categories:

Registering with Harvester
Composing a Harvest List
Preparing EML Documents for harvest
Reviewing Harvester reports to the Site Contact

Registering with Harvester

A Site Contact registers a site with Harvester by logging in to the Harvester Registration page and entering several items of information that Harvester needs to know about the site.

Logging in to the Harvester Registration Page
The Harvester Registration page is accessed from Metacat. For example, if the Metacat server that you wish to register with resides at the following URL:
```
  http://somehost.somelocation.edu:8080/knb/index.jsp
```
then the Harvester Registration page would be accessed at:
```
  http://somehost.somelocation.edu:8080/knb/style/skins/dev/harvesterRegistrationLogin.html
```
After bringing up this page in your browser, login to your Metacat account by entering your username and password. The username should include the full LDAP specification, for example:
```
  Username:   uid=jdoe,o=lter,dc=ecoinformatics,dc=org
  Password:   *******
  
```
In some cases, a Site Contact may need to login to an anonymous account rather than his or her personal account. For example, a LTER Information Manager may need to login to a dedicated account, named with a three-letter acronym, that has been set up for the LTER site. For example:
```
  Username:   uid=GCE,o=lter,dc=ecoinformatics,dc=org
  Password:   *******
  
```
is the account login that would be used by the LTER Information Mangager at the GCE (Georgia Coastal Ecosystems) site.
Registering with Harvester
After logging in, you will be presented with a web form that prompts you to enter information about your site and how often you want to schedule harvests at your site. For example:
```
  Email address:            myname@institution.edu
  Harvest List URL:         http://somehost.institution.edu/~myname/harvestList.xml
  Harvest Frequency (1-99): 2
  Unit:                     ( ) day(s)    (*) week(s)   ( ) month(s)
  
```
After values have been entered for each of these fields, click the Register button to register your site with Harvester.

In the example shown above, Harvester will attempt to harvest documents from the site once every 2 weeks, it will access the site's Harvest List at URL "http://somehost.institution.edu/~myname/harvestList.xml", and it will send email reports to the Site Contact at email address "myname@institution.edu".
Unregistering with Harvester
At any time after you have registered with Harvester, you may discontinue harvests at your site by unregistering. Simply login as described above and then click the Unregister button. After doing so, Harvester will discontinue harvests at the site.

Composing a Harvest List

A Harvest List is an XML file that holds a list of EML documents to be harvested. For each EML document in the list, the following information must be specified:

docid, which consists of the:
- scope, e.g. "demoDocument". The scope is an identifier that indicates which group of documents this document belongs to.
- identifier, e.g. "1". The identifier is a number that uniquely identifies this document within the scope.
- revision, e.g. "5". The revision is a number that indicates the current revision of this document.
documentType, e.g. "eml://ecoinformatics.org/eml-2.0.0". The documentType identifies the document as an EML document.
documentURL, e.g. "http://www.lternet.edu/~dcosta/document1.xml". The documentURL specifies a place where Harvester can locate and retrieve the document via HTTP.

The contents of a Harvest List XML file must conform to a particular XML Schema, as defined in file harvestList.xsd. The contents of a valid Harvest List can best be illustrated by example. The sample Harvest List below contains two <document> elements that specify the information that Harvester needs to retrieve a pair of EML documents and upload them to Metacat:

<?xml version="1.0" encoding="UTF-8" ?>
<hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList" >
    <document>
        <docid>
            <scope>demoDocument</scope>
            <identifier>1</identifier>
            <revision>5</revision>
        </docid>
        <documentType>eml://ecoinformatics.org/eml-2.0.0</documentType>
        <documentURL>http://www.lternet.edu/~dcosta/document1.xml</documentURL>
    </document>
    <document>
        <docid>
            <scope>demoDocument</scope>
            <identifier>2</identifier>
            <revision>1</revision>
        </docid>
        <documentType>eml://ecoinformatics.org/eml-2.0.0</documentType>
        <documentURL>http://www.lternet.edu/~dcosta/document2.xml</documentURL>
    </document>
</hrv:harvestList>

After editing the Harvest List, ensure that the Harvest List XML file resides at the appropriate location on disk as specified by the URL that was entered during the registration process.

Preparing EML Documents for harvest

To prepare a set of EML documents for harvest, ensure that the following is true for each document:

The document contains valid EML
The document is specified in a <document> element in the site's Harvest List, as described above
The file resides at the appropriate location on disk as specified by its URL in the Harvest List

Reviewing Harvester Reports to the Site Contact

After every scheduled harvest that takes place at a particular Harvest Site, Harvester will send an email report to the Site Contact detailing the operations that were performed during the harvest. The report will contain information about the operations that were performed by Harvester at that site, such as which EML documents were harvested and whether any errors were encountered.

The Site Contact should review the report, paying particularly close attention to any errors that are reported. Errors are indicated by operations that display a status value of 1, while operations that display a status value of 0 indicate that the operation completed successfully.

When errors are reported, the Site Contact should try to determine whether the source of the error is something that can be corrected at the site. Common causes of errors might be:

A document URL specified in the Harvest List does not match the location of the actual EML file on the disk
The Harvest List does not contain valid XML as specified in the harvestList.xsd schema
The URL to the Harvest List that was specified during registration with Harvester does not match the actual location of the Harvest List on the disk
An EML document that Harvester attempted to upload to Metacat does not contain valid EML

If the Site Contact is unable to determine the cause of the error and its resolution, he or she should contact the Harvester Administrator for assistance.

Back | Home | Next