Revision 2009
Added by Duane Costa almost 21 years ago
docs/dev/harvester/HarvesterClassDescriptions.txt | ||
---|---|---|
3 | 3 |
Class Description |
4 | 4 |
|
5 | 5 |
Harvester |
6 |
The main controller class. Reads the harvest registry from the
|
|
7 |
database, creating a HarvestInformation object for each entry in
|
|
8 |
the harvest registry. Manages a list of HarvestLog entries to keep
|
|
9 |
a permanent record of harvest operations. Provides operations to
|
|
10 |
read the harvest log from the database, add new harvest log entries,
|
|
11 |
and write the harvest log to the database. |
|
6 |
The main controller class. Reads the harvest site schedule from the
|
|
7 |
database, creating a HarvestSiteSchedule object for each entry in
|
|
8 |
the HARVEST_SITE_SCHEDULE table. Manages a list of HarvestLog
|
|
9 |
entries to keep a permanent record of harvest operations. Provides
|
|
10 |
operations to read the harvest log from the database, insert new
|
|
11 |
log entries, and write the harvest log to the database.
|
|
12 | 12 |
|
13 |
HarvestInformation |
|
14 |
Represents information for a single harvest site in the harvest |
|
15 |
registry. For a given site, stores its ldapDN, its |
|
16 |
harvestConfigurationURL, the date of its last harvest, the date of |
|
17 |
its next harvest (if stored explicitly), and the harvest frequency. |
|
18 |
If the date of the next harvest has not been stored explicitly, it |
|
19 |
can be derived from the date of the last harvest and the harvest |
|
20 |
frequency. Provides operations to get and parse the site?s harvest |
|
21 |
configuration document. For each document listed in the harvest |
|
22 |
configuration document, creates a HarvestDocument object. Manages |
|
23 |
harvest scheduling for a site. Provides operations to interact with |
|
24 |
the HARVEST_INFORMATION database table. |
|
13 |
HarvestSiteSchedule |
|
14 |
Manages site scheduling and other data for a site. |
|
15 |
Corresponds to a single entry in the HARVEST_SITE_SCHEDULE |
|
16 |
table. For a given site, stores its contactEmail address, the |
|
17 |
date of its last harvest, the date of its next harvest (if stored |
|
18 |
explicitly), the document list URL, the LDAP DN, the harvest |
|
19 |
frequency, and the frequency unit (e.g. days, weeks, or months). |
|
20 |
If the date of the next harvest has not been stored explicitly, |
|
21 |
it is derived from the date of the last harvest and the |
|
22 |
harvest frequency. Provides operations to get and parse the site?s |
|
23 |
document list. Provides operations to interact with the |
|
24 |
HARVEST_SITE_SCHEDULE database table. Manages a list a |
|
25 |
HarvestDocument objects, one for each entry in the site's |
|
26 |
document list. |
|
25 | 27 |
|
26 | 28 |
HarvestDocument |
27 |
Represents a single document to be harvested. Stores data about the |
|
28 |
document (documentURL, documentType, docid, etc.). Provides |
|
29 |
operations to get the document from the site and put (insert or |
|
30 |
update) the document to Metacat. Provides operations to interact |
|
31 |
with the HARVEST_DOCUMENT database table. |
|
29 |
Represents a single document to be harvested. Stores data about |
|
30 |
the document (documentURL, documentType, identifier, revision, |
|
31 |
scope). Provides operations to get the document from the site |
|
32 |
and put (insert or update) the document to Metacat. Queries |
|
33 |
Metacat to determine whether Metacat already has the document |
|
34 |
and to determine the highest revision number of the document |
|
35 |
stored in Metacat. |
|
32 | 36 |
|
33 | 37 |
HarvestLog |
34 | 38 |
Represents a single harvest log entry. For a given Harvest |
35 | 39 |
operation, records the date, type of operation, message string, |
36 |
status, and docid (if this operation pertains to a HarvestDocument). |
|
37 |
Provides operations to interact with the HARVEST_LOG database table. |
|
40 |
status, and detailLogID (if this operation generated an error |
|
41 |
that involved a harvest document; see HarvestDetailLog below). |
|
42 |
Interacts with the HARVEST_LOG database table, and retrieves |
|
43 |
information about operations from the HARVEST_OPERATION table. |
|
38 | 44 |
|
45 |
HarvestDetailLog |
|
46 |
Stores detailed information about a harvest operation on a document |
|
47 |
when the operation results in an error (e.g. the document could not |
|
48 |
be retrieved from the site, or could not be inserted or updated into |
|
49 |
Metacat). Stores a unique identifier (detailLogID), the identifier |
|
50 |
of its associated HarvestLog object (harvestLogID), the error message, |
|
51 |
and a HarvestDocument object. Interacts with the HARVEST_DETAIL_LOG |
|
52 |
table. |
|
53 |
|
|
39 | 54 |
|
40 |
Sequence of Events for Pull Operation |
|
55 |
Sequence of Events for Harvester Pull Operation
|
|
41 | 56 |
|
42 | 57 |
1. Harvester starts up. (Add log entry to record startup operation.) |
43 | 58 |
|
44 | 59 |
2. Harvester reads the Harvest registry from the database, creating a |
45 |
HarvestInformation object for every record in the HARVEST_INFORMATION table. |
|
60 |
HarvestSiteSchedule object for every record in the |
|
61 |
HARVEST_SITE_SCHEDULE table. |
|
46 | 62 |
|
47 |
3. For each HarvestInformation object:
|
|
63 |
3. For each HarvestSiteSchedule object:
|
|
48 | 64 |
|
49 | 65 |
4. If this site is due to be harvested today: |
50 | 66 |
|
51 |
5. Get the harvest configuration from the harvestConfigurationURL.
|
|
67 |
5. Get the harvest configuration from the documentListURL.
|
|
52 | 68 |
|
53 |
6. Parse the harvest configuration file, creating a HarvestDocument
|
|
69 |
6. Parse the document list file, creating a HarvestDocument
|
|
54 | 70 |
object for each document element in the file. (If file is not a |
55 |
valid harvest configuration file, log an exception, send email |
|
56 |
report, and exit.) |
|
71 |
valid document list, log an exception, send email report, and exit.) |
|
57 | 72 |
|
58 |
7. For each document element in the harvest configuration file:
|
|
73 |
7. For each HarvestDocument object:
|
|
59 | 74 |
|
60 |
8. Get the document from the site, using the document URL.
|
|
75 |
8. Query Metacat as to whether it already has this document.
|
|
61 | 76 |
|
62 |
9. Parse the document to check that it is valid EML. (If not valid, |
|
63 |
log an exception and continue.) |
|
77 |
9. If Metacat does not already have the document: |
|
64 | 78 |
|
65 |
10. Query Metacat as to whether it already has this document
|
|
79 |
10. Get the document from the site, using the document URL.
|
|
66 | 80 |
|
67 |
11. If Metacat does not already have the document: |
|
81 |
11. Parse the document to check that it is valid EML. (If not valid, |
|
82 |
log an exception and continue.) |
|
68 | 83 |
|
69 | 84 |
12. Insert or update the document to Metacat. |
70 | 85 |
|
71 | 86 |
13. Log the result of inserting or updating the document to |
72 | 87 |
Metacat (including any exceptions). |
73 | 88 |
|
74 |
14. Generate and send an email report to the site contact. The report |
|
89 |
14. Else, Metacat already has the document. Determine the |
|
90 |
highest revision of the document in Metacat and provide |
|
91 |
this information in the report to the site contact (15). |
|
92 |
|
|
93 |
15. Generate and send an email report to the site contact. The report |
|
75 | 94 |
will contain results of this harvest, and the (tentatively) |
76 | 95 |
scheduled date of the next harvest. |
77 | 96 |
|
78 |
15. Generate and send an email report to the harvester administrator. The report
|
|
97 |
16. Generate and send an email report to the harvester administrator. The report
|
|
79 | 98 |
is a composite of all the individual site reports. (Alternatively, this |
80 | 99 |
could be written to a log file instead of emailed. An email message could |
81 | 100 |
simply contain the location of the log file and a brief summary.) |
82 | 101 |
|
83 |
16. Harvester shuts down. (Add log entry for shutdown.)
|
|
102 |
17. Harvester shuts down. (Add log entry for shutdown.)
|
|
84 | 103 |
|
85 | 104 |
|
Also available in: Unified diff
Revisions to Harvester Class Diagram and Harvester Class Descriptions.