1 |
1969
|
costa
|
Harvester Class Descriptions Revision 1.1 12/12/03
|
2 |
|
|
|
3 |
|
|
Class Description
|
4 |
|
|
|
5 |
|
|
Harvester
|
6 |
2009
|
costa
|
The main controller class. Reads the harvest site schedule from the
|
7 |
|
|
database, creating a HarvestSiteSchedule object for each entry in
|
8 |
|
|
the HARVEST_SITE_SCHEDULE table. Manages a list of HarvestLog
|
9 |
|
|
entries to keep a permanent record of harvest operations. Provides
|
10 |
|
|
operations to read the harvest log from the database, insert new
|
11 |
|
|
log entries, and write the harvest log to the database.
|
12 |
1969
|
costa
|
|
13 |
2009
|
costa
|
HarvestSiteSchedule
|
14 |
|
|
Manages site scheduling and other data for a site.
|
15 |
|
|
Corresponds to a single entry in the HARVEST_SITE_SCHEDULE
|
16 |
|
|
table. For a given site, stores its contactEmail address, the
|
17 |
|
|
date of its last harvest, the date of its next harvest (if stored
|
18 |
|
|
explicitly), the document list URL, the LDAP DN, the harvest
|
19 |
|
|
frequency, and the frequency unit (e.g. days, weeks, or months).
|
20 |
|
|
If the date of the next harvest has not been stored explicitly,
|
21 |
|
|
it is derived from the date of the last harvest and the
|
22 |
|
|
harvest frequency. Provides operations to get and parse the site?s
|
23 |
|
|
document list. Provides operations to interact with the
|
24 |
|
|
HARVEST_SITE_SCHEDULE database table. Manages a list a
|
25 |
|
|
HarvestDocument objects, one for each entry in the site's
|
26 |
|
|
document list.
|
27 |
1969
|
costa
|
|
28 |
|
|
HarvestDocument
|
29 |
2009
|
costa
|
Represents a single document to be harvested. Stores data about
|
30 |
|
|
the document (documentURL, documentType, identifier, revision,
|
31 |
|
|
scope). Provides operations to get the document from the site
|
32 |
|
|
and put (insert or update) the document to Metacat. Queries
|
33 |
|
|
Metacat to determine whether Metacat already has the document
|
34 |
|
|
and to determine the highest revision number of the document
|
35 |
|
|
stored in Metacat.
|
36 |
1969
|
costa
|
|
37 |
|
|
HarvestLog
|
38 |
|
|
Represents a single harvest log entry. For a given Harvest
|
39 |
|
|
operation, records the date, type of operation, message string,
|
40 |
2009
|
costa
|
status, and detailLogID (if this operation generated an error
|
41 |
|
|
that involved a harvest document; see HarvestDetailLog below).
|
42 |
|
|
Interacts with the HARVEST_LOG database table, and retrieves
|
43 |
|
|
information about operations from the HARVEST_OPERATION table.
|
44 |
1969
|
costa
|
|
45 |
2009
|
costa
|
HarvestDetailLog
|
46 |
|
|
Stores detailed information about a harvest operation on a document
|
47 |
|
|
when the operation results in an error (e.g. the document could not
|
48 |
|
|
be retrieved from the site, or could not be inserted or updated into
|
49 |
|
|
Metacat). Stores a unique identifier (detailLogID), the identifier
|
50 |
|
|
of its associated HarvestLog object (harvestLogID), the error message,
|
51 |
|
|
and a HarvestDocument object. Interacts with the HARVEST_DETAIL_LOG
|
52 |
|
|
table.
|
53 |
|
|
|
54 |
1969
|
costa
|
|
55 |
2009
|
costa
|
Sequence of Events for Harvester Pull Operation
|
56 |
1969
|
costa
|
|
57 |
|
|
1. Harvester starts up. (Add log entry to record startup operation.)
|
58 |
|
|
|
59 |
|
|
2. Harvester reads the Harvest registry from the database, creating a
|
60 |
2009
|
costa
|
HarvestSiteSchedule object for every record in the
|
61 |
|
|
HARVEST_SITE_SCHEDULE table.
|
62 |
1969
|
costa
|
|
63 |
2009
|
costa
|
3. For each HarvestSiteSchedule object:
|
64 |
1969
|
costa
|
|
65 |
|
|
4. If this site is due to be harvested today:
|
66 |
|
|
|
67 |
2009
|
costa
|
5. Get the harvest configuration from the documentListURL.
|
68 |
1969
|
costa
|
|
69 |
2009
|
costa
|
6. Parse the document list file, creating a HarvestDocument
|
70 |
1969
|
costa
|
object for each document element in the file. (If file is not a
|
71 |
2009
|
costa
|
valid document list, log an exception, send email report, and exit.)
|
72 |
1969
|
costa
|
|
73 |
2009
|
costa
|
7. For each HarvestDocument object:
|
74 |
1969
|
costa
|
|
75 |
2009
|
costa
|
8. Query Metacat as to whether it already has this document.
|
76 |
1969
|
costa
|
|
77 |
2009
|
costa
|
9. If Metacat does not already have the document:
|
78 |
1969
|
costa
|
|
79 |
2009
|
costa
|
10. Get the document from the site, using the document URL.
|
80 |
1969
|
costa
|
|
81 |
2009
|
costa
|
11. Parse the document to check that it is valid EML. (If not valid,
|
82 |
|
|
log an exception and continue.)
|
83 |
1969
|
costa
|
|
84 |
|
|
12. Insert or update the document to Metacat.
|
85 |
|
|
|
86 |
|
|
13. Log the result of inserting or updating the document to
|
87 |
|
|
Metacat (including any exceptions).
|
88 |
|
|
|
89 |
2009
|
costa
|
14. Else, Metacat already has the document. Determine the
|
90 |
|
|
highest revision of the document in Metacat and provide
|
91 |
|
|
this information in the report to the site contact (15).
|
92 |
|
|
|
93 |
|
|
15. Generate and send an email report to the site contact. The report
|
94 |
1969
|
costa
|
will contain results of this harvest, and the (tentatively)
|
95 |
|
|
scheduled date of the next harvest.
|
96 |
|
|
|
97 |
2009
|
costa
|
16. Generate and send an email report to the harvester administrator. The report
|
98 |
1969
|
costa
|
is a composite of all the individual site reports. (Alternatively, this
|
99 |
|
|
could be written to a log file instead of emailed. An email message could
|
100 |
|
|
simply contain the location of the log file and a brief summary.)
|
101 |
|
|
|
102 |
2009
|
costa
|
17. Harvester shuts down. (Add log entry for shutdown.)
|
103 |
1969
|
costa
|
|