Project

General

Profile

« Previous | Next » 

Revision 2009

Revisions to Harvester Class Diagram and Harvester Class Descriptions.

View differences:

docs/dev/harvester/HarvesterClassDescriptions.txt
3 3
Class       Description
4 4

  
5 5
Harvester
6
            The main controller class. Reads the harvest registry from the
7
            database, creating a HarvestInformation object for each entry in
8
            the harvest registry. Manages a list of HarvestLog entries to keep
9
            a permanent record of harvest operations. Provides operations to
10
            read the harvest log from the database, add new harvest log entries,
11
            and write the harvest log to the database.
6
            The main controller class. Reads the harvest site schedule from the
7
            database, creating a HarvestSiteSchedule object for each entry in
8
            the HARVEST_SITE_SCHEDULE table. Manages a list of HarvestLog 
9
            entries to keep a permanent record of harvest operations. Provides 
10
            operations to read the harvest log from the database, insert new
11
            log entries, and write the harvest log to the database.
12 12

  
13
HarvestInformation
14
            Represents information for a single harvest site in the harvest
15
            registry. For a given site, stores its ldapDN, its
16
            harvestConfigurationURL, the date of its last harvest, the date of
17
            its next harvest (if stored explicitly), and the harvest frequency.
18
            If the date of the next harvest has not been stored explicitly, it
19
            can be derived from the date of the last harvest and the harvest
20
            frequency. Provides operations to get and parse the site?s harvest
21
            configuration document. For each document listed in the harvest
22
            configuration document, creates a HarvestDocument object. Manages
23
            harvest scheduling for a site. Provides operations to interact with
24
            the HARVEST_INFORMATION database table.
13
HarvestSiteSchedule
14
            Manages site scheduling and other data for a site.
15
            Corresponds to a single entry in the HARVEST_SITE_SCHEDULE 
16
            table. For a given site, stores its contactEmail address, the 
17
            date of its last harvest, the date of its next harvest (if stored
18
            explicitly), the document list URL, the LDAP DN, the harvest 
19
            frequency, and the frequency unit (e.g. days, weeks, or months). 
20
            If the date of the next harvest has not been stored explicitly, 
21
            it is derived from the date of the last harvest and the 
22
            harvest frequency. Provides operations to get and parse the site?s
23
            document list. Provides operations to interact with the 
24
            HARVEST_SITE_SCHEDULE database table. Manages a list a 
25
            HarvestDocument objects, one for each entry in the site's 
26
            document list.
25 27

  
26 28
HarvestDocument
27
            Represents a single document to be harvested. Stores data about the
28
            document (documentURL, documentType, docid, etc.). Provides
29
            operations to get the document from the site and put (insert or
30
            update) the document to Metacat. Provides operations to interact
31
            with the HARVEST_DOCUMENT database table.
29
            Represents a single document to be harvested. Stores data about
30
            the document (documentURL, documentType, identifier, revision,
31
            scope). Provides operations to get the document from the site 
32
            and put (insert or update) the document to Metacat. Queries
33
            Metacat to determine whether Metacat already has the document
34
            and to determine the highest revision number of the document
35
            stored in Metacat.
32 36

  
33 37
HarvestLog
34 38
            Represents a single harvest log entry. For a given Harvest
35 39
            operation, records the date, type of operation, message string,
36
            status, and docid (if this operation pertains to a HarvestDocument).
37
            Provides operations to interact with the HARVEST_LOG database table.
40
            status, and detailLogID (if this operation generated an error
41
            that involved a harvest document; see HarvestDetailLog below).
42
            Interacts with the HARVEST_LOG database table, and retrieves 
43
            information about operations from the HARVEST_OPERATION table.
38 44

  
45
HarvestDetailLog
46
            Stores detailed information about a harvest operation on a document
47
            when the operation results in an error (e.g. the document could not 
48
            be retrieved from the site, or could not be inserted or updated into 
49
            Metacat). Stores a unique identifier (detailLogID), the identifier 
50
            of its associated HarvestLog object (harvestLogID), the error message, 
51
            and a HarvestDocument object. Interacts with the HARVEST_DETAIL_LOG 
52
            table.
53

  
39 54
            
40
Sequence of Events for Pull Operation
55
Sequence of Events for Harvester Pull Operation
41 56

  
42 57
1. Harvester starts up. (Add log entry to record startup operation.)
43 58

  
44 59
2. Harvester reads the Harvest registry from the database, creating a
45
   HarvestInformation object for every record in the HARVEST_INFORMATION table.
60
   HarvestSiteSchedule object for every record in the
61
   HARVEST_SITE_SCHEDULE table.
46 62

  
47
3. For each HarvestInformation object:
63
3. For each HarvestSiteSchedule object:
48 64

  
49 65
    4. If this site is due to be harvested today:
50 66

  
51
        5. Get the harvest configuration from the harvestConfigurationURL.
67
        5. Get the harvest configuration from the documentListURL.
52 68

  
53
        6. Parse the harvest configuration file, creating a HarvestDocument 
69
        6. Parse the document list file, creating a HarvestDocument 
54 70
           object for each document element in the file. (If file is not a 
55
           valid harvest configuration file, log an exception, send email 
56
           report, and exit.)
71
           valid document list, log an exception, send email report, and exit.)
57 72

  
58
        7. For each document element in the harvest configuration file:
73
        7. For each HarvestDocument object:
59 74

  
60
            8. Get the document from the site, using the document URL.
75
            8. Query Metacat as to whether it already has this document.
61 76

  
62
            9. Parse the document to check that it is valid EML. (If not valid, 
63
               log an exception and continue.)
77
            9. If Metacat does not already have the document:
64 78

  
65
            10. Query Metacat as to whether it already has this document
79
                10. Get the document from the site, using the document URL.
66 80

  
67
            11. If Metacat does not already have the document:
81
                11. Parse the document to check that it is valid EML. (If not valid, 
82
                    log an exception and continue.)
68 83

  
69 84
                12. Insert or update the document to Metacat.
70 85

  
71 86
                13. Log the result of inserting or updating the document to
72 87
                    Metacat (including any exceptions).
73 88

  
74
        14. Generate and send an email report to the site contact. The report 
89
            14. Else, Metacat already has the document. Determine the
90
                highest revision of the document in Metacat and provide
91
                this information in the report to the site contact (15).
92

  
93
        15. Generate and send an email report to the site contact. The report 
75 94
            will contain results of this harvest, and the (tentatively)
76 95
            scheduled date of the next harvest.
77 96

  
78
15. Generate and send an email report to the harvester administrator. The report
97
16. Generate and send an email report to the harvester administrator. The report
79 98
    is a composite of all the individual site reports. (Alternatively, this 
80 99
    could be written to a log file instead of emailed. An email message could 
81 100
    simply contain the location of the log file and a brief summary.)
82 101

  
83
16. Harvester shuts down. (Add log entry for shutdown.)
102
17. Harvester shuts down. (Add log entry for shutdown.)
84 103
                             
85 104

  

Also available in: Unified diff