Project

General

Profile

1
Harvester Class Descriptions               Revision 1.1         12/12/03
2

    
3
Class       Description
4

    
5
Harvester
6
            The main controller class. Reads the harvest registry from the
7
            database, creating a HarvestInformation object for each entry in
8
            the harvest registry. Manages a list of HarvestLog entries to keep
9
            a permanent record of harvest operations. Provides operations to
10
            read the harvest log from the database, add new harvest log entries,
11
            and write the harvest log to the database.
12

    
13
HarvestInformation
14
            Represents information for a single harvest site in the harvest
15
            registry. For a given site, stores its ldapDN, its
16
            harvestConfigurationURL, the date of its last harvest, the date of
17
            its next harvest (if stored explicitly), and the harvest frequency.
18
            If the date of the next harvest has not been stored explicitly, it
19
            can be derived from the date of the last harvest and the harvest
20
            frequency. Provides operations to get and parse the site?s harvest
21
            configuration document. For each document listed in the harvest
22
            configuration document, creates a HarvestDocument object. Manages
23
            harvest scheduling for a site. Provides operations to interact with
24
            the HARVEST_INFORMATION database table.
25

    
26
HarvestDocument
27
            Represents a single document to be harvested. Stores data about the
28
            document (documentURL, documentType, docid, etc.). Provides
29
            operations to get the document from the site and put (insert or
30
            update) the document to Metacat. Provides operations to interact
31
            with the HARVEST_DOCUMENT database table.
32

    
33
HarvestLog
34
            Represents a single harvest log entry. For a given Harvest
35
            operation, records the date, type of operation, message string,
36
            status, and docid (if this operation pertains to a HarvestDocument).
37
            Provides operations to interact with the HARVEST_LOG database table.
38

    
39
            
40
Sequence of Events for Pull Operation
41

    
42
1. Harvester starts up. (Add log entry to record startup operation.)
43

    
44
2. Harvester reads the Harvest registry from the database, creating a
45
   HarvestInformation object for every record in the HARVEST_INFORMATION table.
46

    
47
3. For each HarvestInformation object:
48

    
49
    4. If this site is due to be harvested today:
50

    
51
        5. Get the harvest configuration from the harvestConfigurationURL.
52

    
53
        6. Parse the harvest configuration file, creating a HarvestDocument 
54
           object for each document element in the file. (If file is not a 
55
           valid harvest configuration file, log an exception, send email 
56
           report, and exit.)
57

    
58
        7. For each document element in the harvest configuration file:
59

    
60
            8. Get the document from the site, using the document URL.
61

    
62
            9. Parse the document to check that it is valid EML. (If not valid, 
63
               log an exception and continue.)
64

    
65
            10. Query Metacat as to whether it already has this document
66

    
67
            11. If Metacat does not already have the document:
68

    
69
                12. Insert or update the document to Metacat.
70

    
71
                13. Log the result of inserting or updating the document to
72
                    Metacat (including any exceptions).
73

    
74
        14. Generate and send an email report to the site contact. The report 
75
            will contain results of this harvest, and the (tentatively)
76
            scheduled date of the next harvest.
77

    
78
15. Generate and send an email report to the harvester administrator. The report
79
    is a composite of all the individual site reports. (Alternatively, this 
80
    could be written to a log file instead of emailed. An email message could 
81
    simply contain the location of the log file and a brief summary.)
82

    
83
16. Harvester shuts down. (Add log entry for shutdown.)
84
                             
85

    
(1-1/7)