Bug #5522
opendownload linked KNB data and convert links in EML to ORE packages
0%
Description
The KNB data sets, and EML data in general, represent linkages to data as online/url linkages in EML documents. When we convert to the KNB to a DataONE Member Node, we need a mechanism to convert these EML packages to create DataONE ORE-base data packages. Depending on the specific situation, different steps will need to be taken:
1) For packages that arrive via the DataONE services, do nothing
2) For packages that arrive via the Metacat and EcoGrid services, check all online/url links:
a) if it is an ecogrid:// link, then create the corresponding link in an ore document
b) if it is a URL marked as "information" in EML, ignore it
c) if it is a URL marked as "download" in EML, then:
i) attempt to download the data, and if successful
-- check if it is real data (hard to do, but filtering out obvious HTML errors, login pages, HTML pages, etc would be tractable)
-- insert it into the MN using the permissions and policies specified in the EML document (need to determine what the ID would be for this object -- maybe the original URL, but need to ensure uniqueness and < 800 chars, etc)
-- add a link to the ORE document for this dataset
d) insert the final ORE document that's been assembled (need to determine the identifier to use)
This utility method should be callable in two ways:
1) For an existing EML document already in metacat, likely to be run on initial conversion and periodically to be sure all proper data packages are created
-- need to be sure that this doesn't create duplicate packages
2) On any INSERT or UPDATE calls
-- when EML is updated, need to rebuild the package
-- when data objects are updated, need to rebuild the package
-- but need to watch out for sequential ops not interfering (e.g., when Morpho updates a data file, then updates a EML file to point at the new data file in a second step, we should only create one new ORE package version)
-- on update calls, be sure to set appropriate obseletes/obsoletedBy properties on the ORE package (the update() calls themselves should handle these properties for the sysmeta for EML and data objects already)