Metacat: Issueshttps://projects.ecoinformatics.org/ecoinfo/https://projects.ecoinformatics.org/ecoinfo/ecoinfo/favicon.ico?14691340362015-07-02T19:00:00ZEcoinformatics Redmine
Redmine Support #6793 (In Progress): Update DOIs from KNB to redirect to view servicehttps://projects.ecoinformatics.org/ecoinfo/issues/67932015-07-02T19:00:00ZMatt Jonesjones@nceas.ucsb.edu
<p>In <a class="issue tracker-2 status-3 priority-2 priority-default closed" title="Feature: Create a new service to update DOI pointers (Resolved)" href="https://projects.ecoinformatics.org/ecoinfo/issues/6530">#6530</a> and <a class="issue tracker-5 status-5 priority-2 priority-default closed" title="Story: Keep DOI registrations current and resolvable (Closed)" href="https://projects.ecoinformatics.org/ecoinfo/issues/6440">#6440</a>, we added features to update DOI registrations, but we still have many originally assigned DOIs that redirect to the raw EML document rather than our landing page for a data set. We need to fix all of the /AA/ DOI registrations in the KNB and ensure they point to the right View service page. For DOIs for metadata, that would be the associated /view url for that DOI. For data files and resource maps, its to the view for the associated metadata. E.g.,</p>
<ul>
<li>Metadata doi:10.5063/AA/nceas.227.15 should redirect to <a class="external" href="https://knb.ecoinformatics.org/#view/doi:10.5063/AA/nceas.227.15">https://knb.ecoinformatics.org/#view/doi:10.5063/AA/nceas.227.15</a></li>
<li>Data doi:10.5063/AA/wtyburczy.30.1 should redirect to the same metadata file <a class="external" href="https://knb.ecoinformatics.org/#view/doi:10.5063/AA/nceas.227.15">https://knb.ecoinformatics.org/#view/doi:10.5063/AA/nceas.227.15</a></li>
</ul>
<p>Also, when a user updates metadata for a package (but doesn't change the data), the DOI redirect for the data will need to be updated to point to the new metadata. Let's verify that this is happening automatically in Metacat.</p> Task #5994 (New): Create REST API for accessing statisticshttps://projects.ecoinformatics.org/ecoinfo/issues/59942013-05-25T01:16:03ZMatt Jonesjones@nceas.ucsb.edu
<p>For objects, users, packages, nodes, etc.</p> Task #5993 (New): Summarize and index statistics for fast accesshttps://projects.ecoinformatics.org/ecoinfo/issues/59932013-05-25T01:15:33ZMatt Jonesjones@nceas.ucsb.eduTask #5992 (New): Track citationshttps://projects.ecoinformatics.org/ecoinfo/issues/59922013-05-25T01:15:07ZMatt Jonesjones@nceas.ucsb.edu
<p>Or interface with Impact Story</p> Task #5991 (New): Track viewshttps://projects.ecoinformatics.org/ecoinfo/issues/59912013-05-25T01:14:50ZMatt Jonesjones@nceas.ucsb.eduTask #5990 (New): Track downloadshttps://projects.ecoinformatics.org/ecoinfo/issues/59902013-05-25T01:14:43ZMatt Jonesjones@nceas.ucsb.eduFeature #5989 (In Progress): Track data download, view and citation statisticshttps://projects.ecoinformatics.org/ecoinfo/issues/59892013-05-25T01:14:28ZMatt Jonesjones@nceas.ucsb.edu
<p>Currently the only usage stats we have in Metacat are the raw logs. This new service would provide several statistical reports in machine-readable format intended for efficient use on clients for building user interface displays that show those statistics.</p>
<p>The service should include the following response statistics, and be extensible to add other tracked statistics as needed:</p>
<ol>
<li>Number of views (defined as number of times the metadata has been viewed on the web)</li>
<li>Number of package downloads (needs definition)</li>
<li>Size in bytes of package downloads</li>
<li>Number of citations (implement in a second phase)</li>
</ol>
<p>For each of these statistics, calling apps should be able to constrain the results to only include records matching:</p>
<ol>
<li>a PID or list of PIDs</li>
<li>creator or list of creators (DN, or ORCID, or some amalgam -- to be discussed)</li>
<li>a time range of access event (upload, download, etc.)</li>
<li>? spatial location of access event (upload, download, etc.)</li>
<li>? IP Address</li>
<li>accessor or list of accessors (DN, or ORCID, or some amalgam, needs ACL -- to be discussed)</li>
</ol>
<p>For each of these statistics, calling apps should be able to request the statistic aggregated by several specific facets, including the following (in order of importance):</p>
<ol>
<li>User (DN, or ORCID, or some amalgam -- to be discussed)</li>
<li>Time range, aggregated to requested unit (day, week, month, year)</li>
<li>? Spatial range, aggregated to requested unit (to be discussed)</li>
</ol>
<p>Intersections of these aggregated facets should also be possible, but are a lower priority than the facets alone. For example, when finished, one should be able to request the following reports, among others:</p>
<ol>
<li>{Views,Downloads,Bytes,Citations} for a given pid or list of pids</li>
<li>{Views,Downloads,Bytes,Citations} by user (aggregates across pids)</li>
<li>{Views,Downloads,Bytes,Citations} by month (aggregates across pids)</li>
<li>{Views,Downloads,Bytes,Citations} by spatial location (aggregates across pids)</li>
<li>{Views,Downloads,Bytes,Citations} for a given pid by month for a specific time range</li>
<li>{Views,Downloads,Bytes,Citations} by user by month</li>
<li>etc.</li>
</ol>
<p>The download format (JSON?, XML?) should allow for an extended set of response variables, and an extendable set of aggregating facets. Need to discuss, but probably XML as that is DataONE's initial choice for all other services. Contemplate both if useful.</p>
<p>The REST API for this service should be developed in the DataONE space, with intention of it being implementable by both other MNs and CNs in DataONE.</p> Bug #5522 (New): download linked KNB data and convert links in EML to ORE packageshttps://projects.ecoinformatics.org/ecoinfo/issues/55222011-10-28T21:06:37ZMatt Jonesjones@nceas.ucsb.edu
<p>The KNB data sets, and EML data in general, represent linkages to data as online/url linkages in EML documents. When we convert to the KNB to a DataONE Member Node, we need a mechanism to convert these EML packages to create DataONE ORE-base data packages. Depending on the specific situation, different steps will need to be taken:</p>
<p>1) For packages that arrive via the DataONE services, do nothing<br />2) For packages that arrive via the Metacat and EcoGrid services, check all online/url links:<br /> a) if it is an ecogrid:// link, then create the corresponding link in an ore document<br /> b) if it is a URL marked as "information" in EML, ignore it<br /> c) if it is a URL marked as "download" in EML, then:<br /> i) attempt to download the data, and if successful<br /> -- check if it is real data (hard to do, but filtering out obvious HTML errors, login pages, HTML pages, etc would be tractable)<br /> -- insert it into the MN using the permissions and policies specified in the EML document (need to determine what the ID would be for this object -- maybe the original URL, but need to ensure uniqueness and < 800 chars, etc)<br /> -- add a link to the ORE document for this dataset<br /> d) insert the final ORE document that's been assembled (need to determine the identifier to use)</p>
<p>This utility method should be callable in two ways:<br />1) For an existing EML document already in metacat, likely to be run on initial conversion and periodically to be sure all proper data packages are created<br /> -- need to be sure that this doesn't create duplicate packages<br />2) On any INSERT or UPDATE calls<br /> -- when EML is updated, need to rebuild the package<br /> -- when data objects are updated, need to rebuild the package<br /> -- but need to watch out for sequential ops not interfering (e.g., when Morpho updates a data file, then updates a EML file to point at the new data file in a second step, we should only create one new ORE package version)<br /> -- on update calls, be sure to set appropriate obseletes/obsoletedBy properties on the ORE package (the update() calls themselves should handle these properties for the sysmeta for EML and data objects already)</p> Bug #4551 (In Progress): performance enhancement through index reductionhttps://projects.ecoinformatics.org/ecoinfo/issues/45512009-11-14T01:20:05ZMatt Jonesjones@nceas.ucsb.edu
<p>PISCO configures their metacat server to index only absolute paths in their metadata documents (rather than relative paths), which decreases the size of their index tables by about 3/4 and results in significant query speed improvements.</p>
<p>This requires also refactoring clients (skins, Morpho, Kepler, ...) to only pose queries with absolute paths, as the queries would otherwise not take advantage of the index.</p>
<p>Investigate whether there are advantages to this approach for the KNB metacat installations -- I suspect it would, as the xml_index table would decrease in size massively. This may be important as we work on building the size of our collection through DataONE.</p> Bug #3835 (In Progress): design and implement OAI-PMH compliant harvest subsystemhttps://projects.ecoinformatics.org/ecoinfo/issues/38352009-02-24T02:06:53ZMatt Jonesjones@nceas.ucsb.edu
<p>Metacat's current harvest mechanism works well but is a proprietary system. The Dryad project has proposed to implement an OAI-PMH compliant harvest susbstem for Metacat in order to allow Metacat to interact more effectively with other systems that implement this protocol. This is a tracking bug for the design and implementation of this feature. Other more detailed bugs will be filed for specific tasks. It would be useful if the final system allowed Metacat to act as both an OAI-PMH Data Provider and as an OAI-PMH Service Provider, allowing us to both serve and harvest documents from OAI-PMH servers.</p>
<p>Some issues to consider and discuss:<br />1) lack of record authorization mechanisms in OAI-PMH. Metacat currently allows harvest with access controls on harvested records. Reverting to a purely OAI-PMH system would eliminate this capability that is used by many of our harvest clients (especially for data, but somewhat for metadata as well). So the design needs to consider a hybrid that allows both public records to be exposed through OAI-PMH and restricted records to be exposed through a protocol like Metacat's that supports access control. What is our design goal here?</p>
<p>2) A corollary of (1) is how to determine who is allowed to update a given record. Does OAI-PMH assume providers always originate from a constant URL endpoint in order to get around authenticating data providers? This is probably not reasonable for even short periods of time (a few years). A number of sites change domain names over short period of times, and the harvester needs to be able to adjust to these changes, update endpoints, and still handle record replacement. Maybe this is a non-issue if PMH allows provider endpoints to be updated.</p>
<p>3) Date-based change detection in OAI-PMH versus GUID-based versioning in metacat. How should these be reconciled? If a PMH harvest occurs every ten days, but a metadata document is revised three times in that interval, does OAI-PMH only get the most recent version? How are the other versions archived and made accessible over time?</p>
<p>4) Data objects. The Metacat harvester allows one to transfer objects of any type, which is used to harvest both metadata objects of various formats (e.g., EML and FGDC) as well as the associated data objects. Each of these objects has their own unique identifier. How would this be handled under OAI-PMH?</p>
<p>A nice background set of slides is here:<br /><a class="external" href="http://www.oaforum.org/otherfiles/berl_oai-tutorial_e.ppt">http://www.oaforum.org/otherfiles/berl_oai-tutorial_e.ppt</a></p> Bug #3142 (New): metacat client uses in-memory buffer for posting datahttps://projects.ecoinformatics.org/ecoinfo/issues/31422008-02-08T19:41:52ZMatt Jonesjones@nceas.ucsb.edu
<p>The size of XML files (and probably data files) that can be sent to metacat is memory limited in client applications because the MetacatClient implementation assumes the payload can be loaded into a memory buffer before it is sent. This is done to calculate the size of the payload before POSTing it. We need new insert(), update(), and upload() methods that take a size parameter so that the Reader or InputStream can be streamed directly over the http connection instead of being accumulated in an in-memory buffer.</p>
<p>We have code that does this in Morpho already using Apache's httpclient library, but this should make its way into MetacatClient. With JDK after 1.5.x, Sun's http protocol handler now supports streaming POSTs, but you have to set up a separate HttpURLConnection with a new protocol handler and call setFixedLengthStreamingMode(). See:<br /> <a class="external" href="http://java.sun.com/j2se/1.5.0/docs/api/java/net/HttpURLConnection.html#setFixedLengthStreamingMode(int">http://java.sun.com/j2se/1.5.0/docs/api/java/net/HttpURLConnection.html#setFixedLengthStreamingMode(int</a></p>
<p>This would be an alternative to using httpclient, but probably still requires registering a newly configured protocol handler.</p>
<p>We also may have trouble with Metacat, because it also reads data using a string, as described in bug # 1122.</p> Bug #3104 (New): provide accessor for organization lists for clients to build login widgetshttps://projects.ecoinformatics.org/ecoinfo/issues/31042008-01-25T08:00:59ZMatt Jonesjones@nceas.ucsb.edu
<p>Logins for metacat have a username that is unique combined with an organization name that together are used to build a distinguished name (DN) for login. Metacat clients generally use a dropdown list for users to choose their organization, and then generate the DN to be sent to metacat. These client-based lists of organizations have propagated across all of our metacat client skins, morpho, ldapweb utilities, and other places, so it is very hard to keep them updated.</p>
<p>We need an access method in metacat (getOrganizationList) that will return a list of organizations to clients that want to build the dropdown list dynamically. Metacat should get the list by querying the LDAP server for the list of organizations so that additions of new organizations to LDAP will be automatically accessible to all clients.</p> Bug #1542 (New): SQL Server support brokenhttps://projects.ecoinformatics.org/ecoinfo/issues/15422004-04-30T15:35:59ZMatt Jonesjones@nceas.ucsb.edu
<p>Support for the MS SQL Server database was maintained in versions prior to 1.3<br />of metacat. Now the xmltables-sqlserver.sql and the associated<br />upgrade*-sqlserver.sql are either not up to date or are missing entirely. Need<br />to port the database changes to SQL Server and test all functions, including<br />upgrades from 1.3 to 1.4 before releasing 1.4.</p> Bug #1452 (In Progress): dtd filenames clash if reused for multiple PUBLIC identifiershttps://projects.ecoinformatics.org/ecoinfo/issues/14522004-04-05T23:44:29ZMatt Jonesjones@nceas.ucsb.edu
<p>Problem reported by Rod Spears:</p>
<p>Ok, this is partially intended behavior. Metacat takes the following attitude<br />towards establishing the relationship between a PUBLIC identifier/namespace and<br />an associated DTD or schema:</p>
<pre><code>1) When a document is submitted, check its PUBLIC id/namespace<br /> a) if it is not registered, then try to retrieve the DTD from<br /> either the passed in parameters, or from the provided<br /> SYSTEM identifier or from an xsi:schemaLocation. If schema<br /> is obtained, cache it and record its location and the public <br /> identifier. Fail with error if schema can't be obtained.<br /> b) if we already have it registered, look up the cached version of <br /> the schema and use it for validation, ignoring any data the <br /> user passes in.</code></pre>
<p>This means that the first submitted docuemnt with a given type determines the<br />DTD/schema used for validation for all subsequent documents submitted as that<br />type. This allows an administrator to pre-register several document types that<br />are important to him and be sure that any submitted documents are valid with<br />respect to the schema he provided. Metacat ships with several pre-registered<br />schemas and DTDs for EML.</p>
<p>So, your issue is this: the first time you registered the DTD, it uploaded the<br />ecogridregistry.205.22.dtd file to metacat's dtd cache. Later, when you tried<br />to upload a new document using a different public ID but the same system ID, it<br />tried to save the file ecogridregistry.205.22.dtd but found that it already<br />existed in the dtd cache, so it couldn't. This is a bug. There's no reason<br />that we should use the identical filename as is passed in to us for the dtd<br />filename, and so we should be gracefully renaming the DTD file when a name is<br />already in use. This hasn't cropped up before because we haven't had people<br />using the same DTD for different PUBLIC identifiers. You can work around it by<br />simply renaming your DTD (to anything other than its current name) and then<br />resubmitting. I'll file this as yet another bug -- yikes.</p> Bug #213 (New): transaction support for packageshttps://projects.ecoinformatics.org/ecoinfo/issues/2132001-04-09T22:35:17ZMatt Jonesjones@nceas.ucsb.edu
<p>Need to build in transaction support for packages. a client should be able to<br />insert (or update) a bunch of components of a package and be sure that they all<br />succeed or all fail. This is especially important if we allow submissions as<br />"jar" files or otherwise. Still need to be able to insert individual compnents<br />though.</p>