Metacat Harvester

Back | Home | Next

Introduction

The Metacat Harvester (henceforth referred to as "Harvester") is a program that automates the retrieval of EML documents from one or more sites and their subsequent upload (insert or update) to Metacat. Harvester uses pull technology to retrieve and upload documents to Metacat on a regularly scheduled basis.

Although Harvester is included with a Metacat installation (beginning with Metacat version 1.4.0), it is an extention to Metacat's functionality that may be used optionally.

Definitions

The following table defines a number of terms that are useful in discussing Harvester and its features.

Term Definition
Harvester The Harvester program, a Java application that is bundled with the Metacat distribution. When a user installs Metacat on a system, the Harvester program is automatically included in the installation.
Harvester Administrator The individual who installs and manages Harvester. Typically, this would be the same individual who installs and manages Metacat at a given installation.
Harvest Site A location from which Harvester can retrieve EML documents. A given Harvester can retrieve documents from any number of different Harvest Sites.
Harvest The act (by Harvester) of visiting a Harvest Site, retrieving a number of EML documents, and inserting or updating the documents to Metacat.
Harvest List An XML document that lists a set of EML documents to be harvested. The Harvest List must conform to an XML Schema, harvestList.xsd.
Site Contact The individual at a particular Harvest Site who registers with Harvester, composes a Harvest List, and periodically prepares the site's EML documents for retrieval and upload to Metacat.
Harvest List URL A URL to the Harvest List, as specified by the Site Contact. Each Harvest Site corresponds to a Harvest List URL. Harvester uses the URL to locate and read a site's Harvest List.
Document URL A URL to an EML document, as specified in the Harvest List. The Harvest List may contain any number of Document URLs. Each Document URL provides a locator to a document to be harvested.
Harvester Registration Page A web page that provides a means for a Site Contact to register with Harvester to schedule regular harvests from the site. Registration involves logging in and then specifying various settings for the Harvest Site, such as the Harvest List URL, the harvest frequency, and the email address of the Site Contact.

Managing Harvester

Harvester is managed by the Harvester Administrator. Typically, the same individual who manages a Metacat server would also act as the Harvester Administrator. The responsibilities of the Harvester Administrator include:
Configuring Harvester

Harvester must be configured to interact with a working Metacat installation. Thus, a Metacat installation that has been properly configured and installed is a pre-requisite to running Harvester. Additionally, Harvester has a number of settable properties that control its behavior. All Harvester configuration information is managed in a single file, metacat.properties, located at:

      METACAT_HOME/lib/metacat.properties
where METACAT_HOME is the top-level directory that Metacat is installed in.

Harvester properties are grouped together in metacat.properties, beginning after the comment line:

      # Harvester properties

The Harvester Administrator should edit metacat.properties, setting appropriate values for the harvesterAdministrator property, the smtpServer property, and possibly other properties. The following table is a summary of each property and its function.

Property Description Possible or default value
connectToMetacat This property determines whether Harvester should connect to Metacat to upload documents. It should be set to true under most circumstances. Setting this property to false can be useful for testing whether Harvester is able to retrieve documents from a site without actually connecting to Metacat to upload the documents. true | false
Default: true
delay The number of hours that Harvester will wait before beginning its first harvest. For example, if Harvester is run at 1:00 p.m., and the delay is set to 12, Harvester will begin its first harvest at 1:00 a.m. Default: 0
harvesterAdministrator The email address of the Harvester Administrator. Harvester will send email reports to this address after every harvest. You may enter multiple email addresses by separating each address with a comma or semicolon, for example, "name1@abc.edu,name2@abc.edu". An email address, or multiple email addresses separated by commas or semi-colons
logPeriod The number of days that Harvester should retain log entries of harvest operations in the database. Harvester log entries record information such as which documents were harvested, from which sites, and whether any errors were encountered during the harvest. Log entries older than logPeriod number of days are purged from the database at the end of each harvest. Default: 90
maxHarvests The maximum number of harvests that Harvester should execute before shutting down. When the Harvester program is executed, it will continue running until it has executed maxHarvests number of harvests and then the program will terminate. Default: 30
period The number of hours between harvests. Harvester will run a new harvest every period number of hours, until the maxHarvests number of harvests have been run. Default: 24
smtpServer The SMTP server that Harvester uses for sending email messages to the Harvester Administrator and to Site Contacts. A host name, for example: somehost.institution.edu

Default: localhost

Note that the default value will only work if the Harvester host machine has been configured as a SMTP server.
Harvester Operation Properties (GetDocError, GetDocSuccess, etc.) This group of properties is used by Harvester to report information about the operations it performs for inclusion in log entries and email messages. Under most circumstances the values of these properties should not be modified.  

Running Harvester
After Harvester has been appropriately configured, it can be run in either of two ways: (A) in a command window, or, (B) as a servlet. If you wish only to test that Harvester is functioning, or if you expect to use Harvester infrequently, it may desirable to run it from a command window. However, under most circumstances you will want Harvester to run continuously as a background servlet process. This eliminates the need to keep a command window continuously open while Harvester is running. Both of these procedures are described below.