1
|
Harvester Class Descriptions Revision 1.1 12/12/03
|
2
|
|
3
|
Class Description
|
4
|
|
5
|
Harvester
|
6
|
The main controller class. Reads the harvest site schedule from the
|
7
|
database, creating a HarvestSiteSchedule object for each entry in
|
8
|
the HARVEST_SITE_SCHEDULE table. Manages a list of HarvestLog
|
9
|
entries to keep a permanent record of harvest operations. Provides
|
10
|
operations to read the harvest log from the database, insert new
|
11
|
log entries, and write the harvest log to the database.
|
12
|
|
13
|
HarvestSiteSchedule
|
14
|
Manages site scheduling and other data for a site.
|
15
|
Corresponds to a single entry in the HARVEST_SITE_SCHEDULE
|
16
|
table. For a given site, stores its contactEmail address, the
|
17
|
date of its last harvest, the date of its next harvest (if stored
|
18
|
explicitly), the document list URL, the LDAP DN, the harvest
|
19
|
frequency, and the frequency unit (e.g. days, weeks, or months).
|
20
|
If the date of the next harvest has not been stored explicitly,
|
21
|
it is derived from the date of the last harvest and the
|
22
|
harvest frequency. Provides operations to get and parse the site?s
|
23
|
document list. Provides operations to interact with the
|
24
|
HARVEST_SITE_SCHEDULE database table. Manages a list a
|
25
|
HarvestDocument objects, one for each entry in the site's
|
26
|
document list.
|
27
|
|
28
|
HarvestDocument
|
29
|
Represents a single document to be harvested. Stores data about
|
30
|
the document (documentURL, documentType, identifier, revision,
|
31
|
scope). Provides operations to get the document from the site
|
32
|
and put (insert or update) the document to Metacat. Queries
|
33
|
Metacat to determine whether Metacat already has the document
|
34
|
and to determine the highest revision number of the document
|
35
|
stored in Metacat.
|
36
|
|
37
|
HarvestLog
|
38
|
Represents a single harvest log entry. For a given Harvest
|
39
|
operation, records the date, type of operation, message string,
|
40
|
status, and detailLogID (if this operation generated an error
|
41
|
that involved a harvest document; see HarvestDetailLog below).
|
42
|
Interacts with the HARVEST_LOG database table, and retrieves
|
43
|
information about operations from the HARVEST_OPERATION table.
|
44
|
|
45
|
HarvestDetailLog
|
46
|
Stores detailed information about a harvest operation on a document
|
47
|
when the operation results in an error (e.g. the document could not
|
48
|
be retrieved from the site, or could not be inserted or updated into
|
49
|
Metacat). Stores a unique identifier (detailLogID), the identifier
|
50
|
of its associated HarvestLog object (harvestLogID), the error message,
|
51
|
and a HarvestDocument object. Interacts with the HARVEST_DETAIL_LOG
|
52
|
table.
|
53
|
|
54
|
|
55
|
Sequence of Events for Harvester Pull Operation
|
56
|
|
57
|
1. Harvester starts up. (Add log entry to record startup operation.)
|
58
|
|
59
|
2. Harvester reads the Harvest registry from the database, creating a
|
60
|
HarvestSiteSchedule object for every record in the
|
61
|
HARVEST_SITE_SCHEDULE table.
|
62
|
|
63
|
3. For each HarvestSiteSchedule object:
|
64
|
|
65
|
4. If this site is due to be harvested today:
|
66
|
|
67
|
5. Get the harvest configuration from the documentListURL.
|
68
|
|
69
|
6. Parse the document list file, creating a HarvestDocument
|
70
|
object for each document element in the file. (If file is not a
|
71
|
valid document list, log an exception, send email report, and exit.)
|
72
|
|
73
|
7. For each HarvestDocument object:
|
74
|
|
75
|
8. Query Metacat as to whether it already has this document.
|
76
|
|
77
|
9. If Metacat does not already have the document:
|
78
|
|
79
|
10. Get the document from the site, using the document URL.
|
80
|
|
81
|
11. Parse the document to check that it is valid EML. (If not valid,
|
82
|
log an exception and continue.)
|
83
|
|
84
|
12. Insert or update the document to Metacat.
|
85
|
|
86
|
13. Log the result of inserting or updating the document to
|
87
|
Metacat (including any exceptions).
|
88
|
|
89
|
14. Else, Metacat already has the document. Determine the
|
90
|
highest revision of the document in Metacat and provide
|
91
|
this information in the report to the site contact (15).
|
92
|
|
93
|
15. Generate and send an email report to the site contact. The report
|
94
|
will contain results of this harvest, and the (tentatively)
|
95
|
scheduled date of the next harvest.
|
96
|
|
97
|
16. Generate and send an email report to the harvester administrator. The report
|
98
|
is a composite of all the individual site reports. (Alternatively, this
|
99
|
could be written to a log file instead of emailed. An email message could
|
100
|
simply contain the location of the log file and a brief summary.)
|
101
|
|
102
|
17. Harvester shuts down. (Add log entry for shutdown.)
|
103
|
|
104
|
|