Project

General

Profile

1 1969 costa
Harvester Class Descriptions               Revision 1.1         12/12/03
2
3
Class       Description
4
5
Harvester
6 2009 costa
            The main controller class. Reads the harvest site schedule from the
7
            database, creating a HarvestSiteSchedule object for each entry in
8
            the HARVEST_SITE_SCHEDULE table. Manages a list of HarvestLog
9
            entries to keep a permanent record of harvest operations. Provides
10
            operations to read the harvest log from the database, insert new
11
            log entries, and write the harvest log to the database.
12 1969 costa
13 2009 costa
HarvestSiteSchedule
14
            Manages site scheduling and other data for a site.
15
            Corresponds to a single entry in the HARVEST_SITE_SCHEDULE
16
            table. For a given site, stores its contactEmail address, the
17
            date of its last harvest, the date of its next harvest (if stored
18
            explicitly), the document list URL, the LDAP DN, the harvest
19
            frequency, and the frequency unit (e.g. days, weeks, or months).
20
            If the date of the next harvest has not been stored explicitly,
21
            it is derived from the date of the last harvest and the
22
            harvest frequency. Provides operations to get and parse the site?s
23
            document list. Provides operations to interact with the
24
            HARVEST_SITE_SCHEDULE database table. Manages a list a
25
            HarvestDocument objects, one for each entry in the site's
26
            document list.
27 1969 costa
28
HarvestDocument
29 2009 costa
            Represents a single document to be harvested. Stores data about
30
            the document (documentURL, documentType, identifier, revision,
31
            scope). Provides operations to get the document from the site
32
            and put (insert or update) the document to Metacat. Queries
33
            Metacat to determine whether Metacat already has the document
34
            and to determine the highest revision number of the document
35
            stored in Metacat.
36 1969 costa
37
HarvestLog
38
            Represents a single harvest log entry. For a given Harvest
39
            operation, records the date, type of operation, message string,
40 2009 costa
            status, and detailLogID (if this operation generated an error
41
            that involved a harvest document; see HarvestDetailLog below).
42
            Interacts with the HARVEST_LOG database table, and retrieves
43
            information about operations from the HARVEST_OPERATION table.
44 1969 costa
45 2009 costa
HarvestDetailLog
46
            Stores detailed information about a harvest operation on a document
47
            when the operation results in an error (e.g. the document could not
48
            be retrieved from the site, or could not be inserted or updated into
49
            Metacat). Stores a unique identifier (detailLogID), the identifier
50
            of its associated HarvestLog object (harvestLogID), the error message,
51
            and a HarvestDocument object. Interacts with the HARVEST_DETAIL_LOG
52
            table.
53
54 1969 costa
55 2009 costa
Sequence of Events for Harvester Pull Operation
56 1969 costa
57
1. Harvester starts up. (Add log entry to record startup operation.)
58
59
2. Harvester reads the Harvest registry from the database, creating a
60 2009 costa
   HarvestSiteSchedule object for every record in the
61
   HARVEST_SITE_SCHEDULE table.
62 1969 costa
63 2009 costa
3. For each HarvestSiteSchedule object:
64 1969 costa
65
    4. If this site is due to be harvested today:
66
67 2009 costa
        5. Get the harvest configuration from the documentListURL.
68 1969 costa
69 2009 costa
        6. Parse the document list file, creating a HarvestDocument
70 1969 costa
           object for each document element in the file. (If file is not a
71 2009 costa
           valid document list, log an exception, send email report, and exit.)
72 1969 costa
73 2009 costa
        7. For each HarvestDocument object:
74 1969 costa
75 2009 costa
            8. Query Metacat as to whether it already has this document.
76 1969 costa
77 2009 costa
            9. If Metacat does not already have the document:
78 1969 costa
79 2009 costa
                10. Get the document from the site, using the document URL.
80 1969 costa
81 2009 costa
                11. Parse the document to check that it is valid EML. (If not valid,
82
                    log an exception and continue.)
83 1969 costa
84
                12. Insert or update the document to Metacat.
85
86
                13. Log the result of inserting or updating the document to
87
                    Metacat (including any exceptions).
88
89 2009 costa
            14. Else, Metacat already has the document. Determine the
90
                highest revision of the document in Metacat and provide
91
                this information in the report to the site contact (15).
92
93
        15. Generate and send an email report to the site contact. The report
94 1969 costa
            will contain results of this harvest, and the (tentatively)
95
            scheduled date of the next harvest.
96
97 2009 costa
16. Generate and send an email report to the harvester administrator. The report
98 1969 costa
    is a composite of all the individual site reports. (Alternatively, this
99
    could be written to a log file instead of emailed. An email message could
100
    simply contain the location of the log file and a brief summary.)
101
102 2009 costa
17. Harvester shuts down. (Add log entry for shutdown.)
103 1969 costa