Bug #2826
closedadd ability for search engines to index metacat documents
0%
Description
Need to enable search engines to index metacat contents. This involves (1) making download URLs for each XML document that do not use the dynamic query syntax, and (2) providing a site map of the contents of the metacat server.
Task (1) is complete, and supports urls of this syntax:
http://localhost:8180/knb/metacat/test.1.1/knb
If a URL like this is processed, and subsequent 'action' parameters are ignored and are not processed.
Task (2) is yet to be finished.
Updated by Matt Jones over 17 years ago
Added a new class called Sitemap that is used to generate a series of XML documents representing the URLs of metacat documents following the sitemap protocol. The Sitemap Protocol is described at Google:
https://www.google.com/webmasters/tools/docs/en/protocol.html
The Sitemap class extends TimerTask so that it can be scheduled to run once a day or so. New configuration options were added to metacat.properties to control where the sitemaps are written and hw often they are updated. By default we do it once a day, as more often is overkill for search engines.
Included a JUnit unit test to test the Sitemap generation functionality.
Included changes to MetaCatServlet to schedule the Sitemap task the first time Metacat is called.
This implements the needed functionality for task (2), but needs to be tested more.
An example sitemap file for a single document generated looks like this:
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url><loc>http://knb.ecoinformatics.org/knb/metacat/test.1.1/knb</loc></url>
</urlset>
To finally finish the site indexing improvements, we need to:
(3) Improve the XSLT styles to set proper titles and META keywords for EML documents
(4) Test against a large document store
(5) Fix the Sitemap query
a) to include document types other than eml
b) to only include publicly accessible documents
(6) Register the sitemap files with Google once this system is installed on KNB
Updated by Matt Jones over 17 years ago
Finished task (3) by adding the EML dataset title to the HTML title in the document head. EML documents will now display better in search engines. Chose not to add META keyword tags because the consensus is that they are not used by search engines because so many people have tried to manipulate keywords.
Tasks 4, 5, 6 from Comment #1 are still outstanding.
Updated by Michael Daigle about 16 years ago
Changed the sql for the sitemap generation to only pull documents that are publically readable.
Added some file inserts to the test case. Each insert has different permissions. Only the public readable doc should be included in the sitemap.
Created documentation in the metacat tour for sitemap creation and registration.