Bug #2826: add ability for search engines to index metacat documents - Metacat - Ecoinformatics Redmine

Actions

Copy link

Bug #2826

closed

add ability for search engines to index metacat documents

Added by Matt Jones almost 18 years ago. Updated over 16 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Michael Daigle

Category:

metacat

Target version:

1.9

Start date:

04/17/2007

Due date:

% Done:

Estimated time:

Bugzilla-Id:

2826

Description

Need to enable search engines to index metacat contents. This involves (1) making download URLs for each XML document that do not use the dynamic query syntax, and (2) providing a site map of the contents of the metacat server.

Task (1) is complete, and supports urls of this syntax:
http://localhost:8180/knb/metacat/test.1.1/knb
If a URL like this is processed, and subsequent 'action' parameters are ignored and are not processed.

Task (2) is yet to be finished.

Actions

Copy link

Updated by Matt Jones almost 18 years ago

Added a new class called Sitemap that is used to generate a series of XML documents representing the URLs of metacat documents following the sitemap protocol. The Sitemap Protocol is described at Google:
https://www.google.com/webmasters/tools/docs/en/protocol.html

The Sitemap class extends TimerTask so that it can be scheduled to run once a day or so. New configuration options were added to metacat.properties to control where the sitemaps are written and hw often they are updated. By default we do it once a day, as more often is overkill for search engines.

Included a JUnit unit test to test the Sitemap generation functionality.

Included changes to MetaCatServlet to schedule the Sitemap task the first time Metacat is called.

This implements the needed functionality for task (2), but needs to be tested more.

An example sitemap file for a single document generated looks like this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url><loc>http://knb.ecoinformatics.org/knb/metacat/test.1.1/knb</loc></url>
</urlset>

To finally finish the site indexing improvements, we need to:
(3) Improve the XSLT styles to set proper titles and META keywords for EML documents
(4) Test against a large document store
(5) Fix the Sitemap query
a) to include document types other than eml
b) to only include publicly accessible documents
(6) Register the sitemap files with Google once this system is installed on KNB

Actions

Copy link

Updated by Matt Jones almost 18 years ago

Finished task (3) by adding the EML dataset title to the HTML title in the document head. EML documents will now display better in search engines. Chose not to add META keyword tags because the consensus is that they are not used by search engines because so many people have tried to manipulate keywords.

Tasks 4, 5, 6 from Comment #1 are still outstanding.

Actions

Copy link