Metacat Harvester

Back | Home | Next

Introduction

The Metacat Harvester (henceforth referred to as "Harvester") is a program that automates the retrieval of EML documents from one or more sites and their subsequent upload (insert or update) to Metacat. Harvester uses pull technology to retrieve and upload documents to Metacat on a regularly scheduled basis.

Although Harvester is included with a Metacat installation (beginning with Metacat version 1.4.0), it is an extention to Metacat's functionality that may be used optionally.

Definitions

The following table defines a number of terms that are useful in discussing Harvester and its features.

Term Definition
Harvester The Harvester program, a Java application that is bundled with the Metacat distribution. When a user installs Metacat on a system, the Harvester program is automatically included in the installation.
Harvester Administrator The individual who installs and manages Harvester. Typically, this would be the same individual who installs and manages Metacat at a given installation.
Harvest Site A location from which Harvester can retrieve EML documents. A given Harvester can retrieve documents from any number of different Harvest Sites.
Harvest The act (by Harvester) of visiting a Harvest Site, retrieving a number of EML documents, and inserting or updating the documents to Metacat.
Harvest List An XML document that lists a set of EML documents to be harvested. The Harvest List must conform to an XML Schema, harvestList.xsd.
Site Contact The individual at a particular Harvest Site who registers with Harvester, composes a Harvest List, and periodically prepares the site's EML documents for retrieval and upload to Metacat.
Harvest List URL A URL to the Harvest List, as specified by the Site Contact. Each Harvest Site corresponds to a Harvest List URL. Harvester uses the URL to locate and read a site's Harvest List.
Document URL A URL to an EML document, as specified in the Harvest List. The Harvest List may contain any number of Document URLs. Each Document URL provides a locator to a document to be harvested.
Harvester Registration Page A web page that provides a means for a Site Contact to register with Harvester to schedule regular harvests from the site. Registration involves logging in and then specifying various settings for the Harvest Site, such as the Harvest List URL, the harvest frequency, and the email address of the Site Contact.

Managing Harvester

Harvester is managed by the Harvester Administrator. Typically, the same individual who manages a Metacat server would also act as the Harvester Administrator. The responsibilities of the Harvester Administrator include:
Configuring Harvester

Harvester must be configured to interact with a working Metacat installation. Thus, a Metacat installation that has been properly configured and installed is a pre-requisite to running Harvester. Additionally, Harvester has a number of settable properties that control its behavior. All Harvester configuration information is managed in a single file, harvester.properties, located at:

      METACAT_HOME/lib/harvester/harvester.properties
where METACAT_HOME is the top-level directory that Metacat is installed in.

The Harvester Administrator should edit harvester.properties, setting appropriate values for the Metacat URL, database driver, database connection, and other settings. The following table is a summary of each property and its function.

Property Description Possible or default value
connectToMetacat This property determines whether Harvester should connect to Metacat to upload documents. It should be set to true under most circumstances. Setting this property to false can be useful for testing whether Harvester is able to retrieve documents from a site without actually connecting to Metacat to upload the documents. true | false
Default: true
dbDriver The JDBC driver to be used to access the backend database. This setting should match the value of the dbDriver property as set in the build.xml file as appropriate to the database being used (Oracle, PostgreSQL, or SQL Server). Examples:
oracle.jdbc.driver.OracleDriver
org.postgresql.Driver
com.microsoft.jdbc.sqlserver.SQLServerDriver
defaultDB The JDBC connection string that Metacat uses to connect to the backend database. This setting should match the value of the jdbc-connect property as set in the build.properties file in the associated Metacat installation. Example:
jdbc:oracle:thin:@server.domain.com:1521:Metacat
delay The number of hours that Harvester will wait before beginning its first harvest. For example, if Harvester is run at 1:00 p.m., and the delay is set to 12, Harvester will begin its first harvest at 1:00 a.m. Default: 0
harvesterAdministrator The email address of the Harvester Administrator. Harvester will send email reports to this address after every harvest. An email address
logPeriod The number of days that Harvester should retain log entries of harvest operations in the database. Harvester log entries record information such as which documents were harvested, from which sites, and whether any errors were encountered during the harvest. Log entries older than logPeriod number of days are purged from the database at the end of each harvest. Default: 90
maxHarvests The maximum number of harvests that Harvester should execute before shutting down. When the Harvester program is executed, it will continue running until it has executed maxHarvests number of harvests and then the program will terminate. Default: 30
metacatURL The URL of the Metacat servlet to which Harvester should connect for uploading documents. Example:
http://somehost.institution.edu:8080/knb/servlet/metacat
password The password that Harvester uses to access the backend database. This setting should match the value of the password property as set in the build.properties file in the associated Metacat installation.  
period The number of hours between harvests. Harvester will run a new harvest every period number of hours, until the maxHarvests number of harvests have been run. Default: 24
smtpServer The SMTP server that Harvester uses for sending email messages to the Harvester Administrator and to Site Contacts. A host name, for example: somehost.institution.edu

Default: localhost

Note that the default value will only work if the Harvester host machine has been configured as a SMTP server.
user The username that Metacat uses to access the backend database. This setting should match the user value as set in the build.properties file in the associated Metacat installation.  
Harvester Operation Properties (GetDocError, GetDocSuccess, etc.) This group of properties is used by Harvester to report information about the operations it performs for inclusion in log entries and email messages. Under most circumstances the values of these properties should not be modified.  

Running Harvester
After Harvester has been appropriately configured, it can be run as follows:
  1. Open a system command window or terminal window.
  2. Set the METACAT_HOME environment variable to the value of the Metacat installation directory. Some examples follow:
  3. cd to the following directory:
  4. Run the appropriate Harvester shell script, as determined by the operating system:

The Harvester application will start executing. It will begin its first harvest after delay number of hours (as specified in the harvester.properties file). The application will continue running a new harvest every period number of hours until a maxHarvests number of harvests have been completed.

Reviewing Harvester Reports to the Harvester Administrator

After every harvest, Harvester will send an email report to the Harvester Administrator detailing the operations that were performed during the harvest. The report will contain information about each of the Harvest Sites that were harvested from, such as which EML documents were harvested and whether any errors were encountered.

The harvest report will contain a list of log entries, where each log entry describes an operation that was performed by Harvester. Log entries that show a status value of 1 indicate that an error occurred during the operation, while those that show a status value of 0 indicate that the operation was completed successfully.

The Harvester Administrator should review the report, paying particularly close attention to any errors that are reported and to the accompanying error messages that are displayed. When errors are reported at a particular site, the Harvester Administrator should contact the Site Contact to determine the source of the error and its resolution. See Reviewing Harvester Reports to the Site Contact for a description of common sources of errors at a Harvest Site.

Errors that are independent of a particular site may indicate a problem with Harvester itself, Metacat, or the database connection. Refer to the error message to determine the source of the error and its resolution.

Managing a Harvest Site

A Harvest Site is managed by a Site Contact. The responsibilities of a Site Contact fall into the following categories:
Registering with Harvester

A Site Contact registers a site with Harvester by logging in to the Harvester Registration page and entering several items of information that Harvester needs to know about the site.

  1. Logging in to the Harvester Registration Page

    The Harvester Registration page is accessed from Metacat. For example, if the Metacat server that you wish to register with resides at the following URL:

      http://somehost.somelocation.edu:8080/knb/index.jsp
    then the Harvester Registration page would be accessed at:
      http://somehost.somelocation.edu:8080/knb/style/skins/dev/harvesterRegistrationLogin.html

    After bringing up this page in your browser, login to your Metacat account by entering your username and password. The username should include the full LDAP specification, for example:

      Username:   uid=jdoe,o=lter,dc=ecoinformatics,dc=org
      Password:   *******
      
    In some cases, a Site Contact may need to login to an anonymous account rather than his or her personal account. For example, a LTER Information Manager may need to login to a dedicated account, named with a three-letter acronym, that has been set up for the LTER site. For example:
      Username:   uid=GCE,o=lter,dc=ecoinformatics,dc=org
      Password:   *******
      
    is the account login that would be used by the LTER Information Mangager at the GCE (Georgia Coastal Ecosystems) site.

  2. Registering with Harvester

    After logging in, you will be presented with a web form that prompts you to enter information about your site and how often you want to schedule harvests at your site. For example:

      Email address:            myname@institution.edu
      Harvest List URL:         http://somehost.institution.edu/~myname/harvestList.xml
      Harvest Frequency (1-99): 2
      Unit:                     ( ) day(s)    (*) week(s)   ( ) month(s)
      
    After values have been entered for each of these fields, click the Register button to register your site with Harvester.

    In the example shown above, Harvester will attempt to harvest documents from the site once every 2 weeks, it will access the site's Harvest List at URL "http://somehost.institution.edu/~myname/harvestList.xml", and it will send email reports to the Site Contact at email address "myname@institution.edu".

  3. Unregistering with Harvester

    At any time after you have registered with Harvester, you may discontinue harvests at your site by unregistering. Simply login as described above and then click the Unregister button. After doing so, Harvester will discontinue harvests at the site.

Composing a Harvest List

A Harvest List is an XML file that holds a list of EML documents to be harvested. For each EML document in the list, the following information must be specified:

The contents of a Harvest List XML file must conform to a particular XML Schema, as defined in file harvestList.xsd. The contents of a valid Harvest List can best be illustrated by example. The sample Harvest List below contains two <document> elements that specify the information that Harvester needs to retrieve a pair of EML documents and upload them to Metacat:

<?xml version="1.0" encoding="UTF-8" ?>
<hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList" >
    <document>
        <docid>
            <scope>demoDocument</scope>
            <identifier>1</identifier>
            <revision>5</revision>
        </docid>
        <documentType>eml://ecoinformatics.org/eml-2.0.0</documentType>
        <documentURL>http://www.lternet.edu/~dcosta/document1.xml</documentURL>
    </document>
    <document>
        <docid>
            <scope>demoDocument</scope>
            <identifier>2</identifier>
            <revision>1</revision>
        </docid>
        <documentType>eml://ecoinformatics.org/eml-2.0.0</documentType>
        <documentURL>http://www.lternet.edu/~dcosta/document2.xml</documentURL>
    </document>
</hrv:harvestList>
  

After editing the Harvest List, ensure that the Harvest List XML file resides at the appropriate location on disk as specified by the URL that was entered during the registration process.

Preparing EML Documents for harvest

To prepare a set of EML documents for harvest, ensure that the following is true for each document:

Reviewing Harvester Reports to the Site Contact

After every scheduled harvest that takes place at a particular Harvest Site, Harvester will send an email report to the Site Contact detailing the operations that were performed during the harvest. The report will contain information about the operations that were performed by Harvester at that site, such as which EML documents were harvested and whether any errors were encountered.

The Site Contact should review the report, paying particularly close attention to any errors that are reported. Errors are indicated by operations that display a status value of 1, while operations that display a status value of 0 indicate that the operation completed successfully.

When errors are reported, the Site Contact should try to determine whether the source of the error is something that can be corrected at the site. Common causes of errors might be:

If the Site Contact is unable to determine the cause of the error and its resolution, he or she should contact the Harvester Administrator for assistance.

Back | Home | Next