Added a new class called Sitemap that is used to generate a series of XML documents representing the URLs of metacat documents following the sitemap protocol. The Sitemap Protocol is described at Google:
https://www.google.com/webmasters/tools/docs/en/protocol.html
The Sitemap class extends TimerTask so that it can be scheduled to run once a day or so. New configuration options were added to metacat.properties to control where the sitemaps are written and hw often they are updated. By default we do it once a day, as more often is overkill for search engines.
Included a JUnit unit test to test the Sitemap generation functionality.
Included changes to MetaCatServlet to schedule the Sitemap task the first time Metacat is called.
This implements the needed functionality for task (2), but needs to be tested more.
An example sitemap file for a single document generated looks like this:
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url><loc>http://knb.ecoinformatics.org/knb/metacat/test.1.1/knb</loc></url>
</urlset>
To finally finish the site indexing improvements, we need to:
(3) Improve the XSLT styles to set proper titles and META keywords for EML documents
(4) Test against a large document store
(5) Fix the Sitemap query
a) to include document types other than eml
b) to only include publicly accessible documents
(6) Register the sitemap files with Google once this system is installed on KNB