KNB Software Development Guide: Metacat Query Caching and Indexing Mechanisms
KNB Home Data People Informatics Biocomplexity Education Software

Metacat Service Based Architecture

Back | Home | Next
Table of Contents
Overview
All XML Path Indexing
How All XML Path Indexing Works
Drawbacks to All XML Path Indexing
Query Caching in Memory
How Memory Caching Works
Memory Caching Drawbacks
Common XML Path Indexing
How Common Path Indexing Works
Common Path Indexing Drawbacks
Storing Unique Return Fields
How Storing Unique Return Fields Works
Drawbacks to Storing Unique Return Fields
Overview

Unfortunately, the act of searching (querying) an xml document that has been deconstructed and put into a relational database is somewhat slow. The table that stores xml elements (xml_nodes) becomes very large, very fast. In an effort to speed up these queries, metacat has implemented several levels of query caching and indexing:

We will discuss each of these caching strategies in turn.

All XML Path Indexing
How All XML Path Indexing Works

Aside from the column indexing that is typical with performance tuning on any database, an additional table named xml_index was created to improve search performance against the xml_nodes table. The xml_index table keeps track of the xml paths for each node in the nodes table. This includes subpaths, so if you had an element in the xml that looked like:

<a><b><c>value</c></b></a>

the paths that would get indexed for this node would be:

c
b/c
a/b/c

When querying, the query path is first looked up in xml_index and then joined to the xml_nodes table via the nodeid.

Drawbacks to All XML Path Indexing

Indexing all paths improves performance, but is still relatively slow since a join against the xml_nodes table takes time.

Query Caching in Memory
How Memory Caching Works

Metacat keeps a Hashtable of recent queries and their results. This information is only held for queries that were run by the "public" user. The key is the sql that that was run for the query. The value is the html that was returned from the query. This cache is checked first to see if the current sql statement was already run. If so, the associated result html is returned.

Memory Caching Drawbacks

The first issue is that every time a document is inserted, updated or deleted, the query cache is potentially out of sync. So anytime any document is changed, the entire cache is cleared, and the process of filling the cache starts over.

The second issue is that the cache could potentially become large, using a lot of memory. This could be caused if a large number of different queries are performed inbetween any documents being inserted, updated or deleted, and many of those queries return large result sets. Note that this is a corner case, and the cache size is configurable using the database.queryresultCacheSize property in metacat.properties.

Third, as mentioned before, this caching is only done on searches performed by the "public" user. Any logged in user searches will not be cached.

Common XML Path Indexing
How Common Path Indexing Works

Metacat queries work in two stages. The first is to get the matching document IDs. The second is to get the requested return fields for those IDs. As discussed in the All XML Path Indexing section above, some effort was made to improve the first step, but it was still slow. Metacat made use of the fact that there are certain paths in EML that are frequently searched to further speed up queries.

Paths that are determined to be searched frequently are entered as a comma delimited list in the xml.indexPaths property in metacat.properties. At startup time, metacat determines if there are any changes to the xml.indexPaths property, scans the xml_nodes table and adds any nodes with matching paths to the xml_path_index table. In this case, unlike the xml_index table, the node value is stored. So if a desired value is found for the given path, the docid is returned and the nodes table is never touched.

Common Path Indexing Drawbacks

An obvious issue with the xml_path_index table is that is only holds values for predetermined paths. Any other path will not benefit from this indexing.

Also, if a wildcard has been used in the search value, the database index for the value column will not be used. This will slow down the search.

Storing Unique Return Fields
How Storing Unique Return Fields Works

Once the matching docids have been retrieved, metacat makes a second pass and gets the desired return fields for each document. To help speed this up, metacat databases unique combinations of query return fields for each document.

This process involves two tables. The first, xml_returnfield, keeps unique return field combinations. The return fields are a pipe (|) delimited string in the table. The second table, xml_queryresult, keeps the id for the row in the xml_returnfield that has the appropriate return fields, the docid and the query result.

If the desired docid and result fields are found in these two tables, the query results are returned directly from the xml_queryresult table and the xml_indes and xml_nodes tables are never touched.

Drawbacks to Storing Unique Return Fields

The main issue with storing unique query results is that the query must be performed at least once for each unique set of results and docid before it is added to the xml_returnfield and xml_queryresult tables. This first run goes against the xml_index and xml_nodes tables and is slow.


Back | Home | Next