Metacat Harvester |
Back | Home | Next |
Although Harvester is included with a Metacat installation (beginning with Metacat version 1.4.0), it is an extention to Metacat's functionality that may be used optionally.
Term | Definition |
Harvester | The Harvester program, a Java application that is bundled with the Metacat distribution. When a user installs Metacat on a system, the Harvester program is automatically included in the installation. |
Harvester Administrator | The individual who installs and manages Harvester. Typically, this would be the same individual who installs and manages Metacat at a given installation. |
Harvest Site | A location from which Harvester can retrieve EML documents. A given Harvester can retrieve documents from any number of different Harvest Sites. |
Harvest | The act (by Harvester) of visiting a Harvest Site, retrieving a number of EML documents, and inserting or updating the documents to Metacat. |
Harvest List | An XML document that lists a set of EML documents to be harvested. The Harvest List must conform to an XML Schema, harvestList.xsd. |
Site Contact | The individual at a particular Harvest Site who registers with Harvester, composes a Harvest List, and periodically prepares the site's EML documents for retrieval and upload to Metacat. |
Harvest List URL | A URL to the Harvest List, as specified by the Site Contact. Each Harvest Site corresponds to a Harvest List URL. Harvester uses the URL to locate and read a site's Harvest List. |
Document URL | A URL to an EML document, as specified in the Harvest List. The Harvest List may contain any number of Document URLs. Each Document URL provides a locator to a document to be harvested. |
Harvester Registration Page | A web page that provides a means for a Site Contact to register with Harvester to schedule regular harvests from the site. Registration involves logging in and then specifying various settings for the Harvest Site, such as the Harvest List URL, the harvest frequency, and the email address of the Site Contact. |
Harvester must be configured to interact with a working Metacat installation. Thus, a Metacat installation that has been properly configured and installed is a pre-requisite to running Harvester. Additionally, Harvester has a number of settable properties that control its behavior. All Harvester configuration information is managed in a single file, metacat.properties, located at:
METACAT_HOME/lib/metacat.propertieswhere METACAT_HOME is the top-level directory that Metacat is installed in.
Harvester properties are grouped together in metacat.properties, beginning after the comment line:
# Harvester properties
The Harvester Administrator should edit
metacat.properties,
setting appropriate values for the harvesterAdministrator
property, the smtpServer
property, and possibly other
properties. The following table is a summary of each property and its function.
Property | Description | Possible or default value |
connectToMetacat | This property determines whether Harvester should connect to
Metacat to upload documents. It should be set to true
under most circumstances. Setting this property to false
can be useful for testing whether Harvester is able to retrieve
documents from a site without actually connecting to Metacat to
upload the documents. |
true | false Default: true
|
delay | The number of hours that Harvester will wait before beginning its first harvest. For example, if Harvester is run at 1:00 p.m., and the delay is set to 12, Harvester will begin its first harvest at 1:00 a.m. | Default: 0 |
harvesterAdministrator | The email address of the Harvester Administrator. Harvester will send email reports to this address after every harvest. | An email address |
logPeriod | The number of days that Harvester should retain log entries of harvest
operations in the database. Harvester log entries record information
such as which documents were harvested, from which sites, and
whether any errors were encountered during the harvest. Log entries
older than logPeriod number of days are purged from the
database at the end of each harvest. |
Default: 90 |
maxHarvests | The maximum number of harvests that Harvester should execute before
shutting down. When the Harvester program is executed, it will
continue running until it has executed maxHarvests
number of harvests and then the program will terminate. |
Default: 30 |
period | The number of hours between harvests. Harvester will run a new
harvest every period number of hours, until the
maxHarvests number of harvests have been run. |
Default: 24 |
smtpServer | The SMTP server that Harvester uses for sending email messages to the Harvester Administrator and to Site Contacts. | A host name, for example: somehost.institution.edu
Default: localhost
Note that the default value will only work if the Harvester host machine has been configured as a SMTP server. |
Harvester Operation Properties (GetDocError, GetDocSuccess, etc.) | This group of properties is used by Harvester to report information about the operations it performs for inclusion in log entries and email messages. Under most circumstances the values of these properties should not be modified. |
set METACAT_HOME=C:\somePath\metacat
export METACAT_HOME=/home/somePath/metacat
cd %METACAT_HOME%\lib\harvester
cd $METACAT_HOME/lib/harvester
runHarvester.bat
sh runHarvester.sh
The Harvester application will start executing. It will begin its first
harvest after delay
number of hours (as specified in the
metacat.properties
file). The application will continue running a new harvest every
period
number of hours until a maxHarvests
number of harvests have been completed.
After every harvest, Harvester will send an email report to the Harvester Administrator detailing the operations that were performed during the harvest. The report will contain information about each of the Harvest Sites that were harvested from, such as which EML documents were harvested and whether any errors were encountered.
The harvest report will contain a list of log entries, where each log entry describes an operation that was performed by Harvester. Log entries that show a status value of 1 indicate that an error occurred during the operation, while those that show a status value of 0 indicate that the operation was completed successfully.
The Harvester Administrator should review the report, paying particularly close attention to any errors that are reported and to the accompanying error messages that are displayed. When errors are reported at a particular site, the Harvester Administrator should contact the Site Contact to determine the source of the error and its resolution. See Reviewing Harvester Reports to the Site Contact for a description of common sources of errors at a Harvest Site.
Errors that are independent of a particular site may indicate a problem with Harvester itself, Metacat, or the database connection. Refer to the error message to determine the source of the error and its resolution.
A Site Contact registers a site with Harvester by logging in to the Harvester Registration page and entering several items of information that Harvester needs to know about the site.
The Harvester Registration page is accessed from Metacat. For example, if the Metacat server that you wish to register with resides at the following URL:
http://somehost.somelocation.edu:8080/knb/index.jspthen the Harvester Registration page would be accessed at:
http://somehost.somelocation.edu:8080/knb/style/skins/knb/harvesterRegistrationLogin.html
After bringing up this page in your browser, login to your Metacat account by entering your username and password. The username should include the full LDAP specification, for example:
Username: uid=jdoe,o=lter,dc=ecoinformatics,dc=org Password: *******In some cases, a Site Contact may need to login to an anonymous account rather than his or her personal account. For example, a LTER Information Manager may need to login to a dedicated account, named with a three-letter acronym, that has been set up for the LTER site. For example:
Username: uid=GCE,o=lter,dc=ecoinformatics,dc=org Password: *******is the account login that would be used by the LTER Information Mangager at the GCE (Georgia Coastal Ecosystems) site.
After logging in, you will be presented with a web form that prompts you to enter information about your site and how often you want to schedule harvests at your site. For example:
Email address: myname@institution.edu Harvest List URL: http://somehost.institution.edu/~myname/harvestList.xml Harvest Frequency (1-99): 2 Unit: ( ) day(s) (*) week(s) ( ) month(s)After values have been entered for each of these fields, click the Register button to register your site with Harvester.
In the example shown above, Harvester will attempt to harvest documents from the site once every 2 weeks, it will access the site's Harvest List at URL "http://somehost.institution.edu/~myname/harvestList.xml", and it will send email reports to the Site Contact at email address "myname@institution.edu".
At any time after you have registered with Harvester, you may discontinue harvests at your site by unregistering. Simply login as described above and then click the Unregister button. After doing so, Harvester will discontinue harvests at the site.
A Harvest List is an XML file that holds a list of EML documents to be harvested. For each EML document in the list, the following information must be specified:
docid
, which consists of the:
scope
, e.g. "demoDocument". The scope is an identifier
that indicates which group of documents this document belongs to.
identifier
, e.g. "1". The identifier is a number that
uniquely identifies this document within the scope.
revision
, e.g. "5". The revision is a number that
indicates the current revision of this document.
documentType
, e.g. "eml://ecoinformatics.org/eml-2.0.0".
The documentType identifies the document as an EML document.documentURL
, e.g. "http://www.lternet.edu/~dcosta/document1.xml".
The documentURL specifies a place where Harvester can locate
and retrieve the document via HTTP.
The contents of a Harvest List XML file must conform to a particular
XML Schema, as defined in file
harvestList.xsd. The contents of a valid Harvest List
can best be illustrated by example. The sample Harvest List
below contains two <document
> elements that specify the
information that Harvester needs to retrieve a pair of EML documents and
upload them to Metacat:
<?xml version="1.0" encoding="UTF-8" ?> <hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList" > <document> <docid> <scope>demoDocument</scope> <identifier>1</identifier> <revision>5</revision> </docid> <documentType>eml://ecoinformatics.org/eml-2.0.0</documentType> <documentURL>http://www.lternet.edu/~dcosta/document1.xml</documentURL> </document> <document> <docid> <scope>demoDocument</scope> <identifier>2</identifier> <revision>1</revision> </docid> <documentType>eml://ecoinformatics.org/eml-2.0.0</documentType> <documentURL>http://www.lternet.edu/~dcosta/document2.xml</documentURL> </document> </hrv:harvestList>
After editing the Harvest List, ensure that the Harvest List XML file resides at the appropriate location on disk as specified by the URL that was entered during the registration process.
To prepare a set of EML documents for harvest, ensure that the following is true for each document:
After every scheduled harvest that takes place at a particular Harvest Site, Harvester will send an email report to the Site Contact detailing the operations that were performed during the harvest. The report will contain information about the operations that were performed by Harvester at that site, such as which EML documents were harvested and whether any errors were encountered.
The Site Contact should review the report, paying particularly close attention to any errors that are reported. Errors are indicated by operations that display a status value of 1, while operations that display a status value of 0 indicate that the operation completed successfully.
When errors are reported, the Site Contact should try to determine whether the source of the error is something that can be corrected at the site. Common causes of errors might be:
If the Site Contact is unable to determine the cause of the error and its resolution, he or she should contact the Harvester Administrator for assistance.
Back | Home | Next