1 |
4830
|
daigle
|
<!--
|
2 |
|
|
* query-caching.html
|
3 |
|
|
*
|
4 |
|
|
* Authors: Michael Daigle
|
5 |
|
|
* Copyright: 2009 Regents of the University of California and the
|
6 |
|
|
* National Center for Ecological Analysis and Synthesis
|
7 |
|
|
* For Details: http://www.nceas.ucsb.edu/
|
8 |
|
|
* Created: 2009 March 4
|
9 |
|
|
* Version:
|
10 |
|
|
* File Info: '$ '
|
11 |
|
|
*
|
12 |
|
|
*
|
13 |
|
|
-->
|
14 |
|
|
<html>
|
15 |
|
|
<head>
|
16 |
|
|
<title>Metacat Query Caching and Indexing Mechanisms</title>
|
17 |
|
|
<!-- unfortunately, we have to look for the common css file in the
|
18 |
|
|
user docs directory. This is because the user docs deploy to
|
19 |
|
|
the top level of the metacat docs on the knb web server -->
|
20 |
|
|
<link rel="stylesheet" type="text/css" href="../user/common.css">
|
21 |
|
|
<link rel="stylesheet" type="text/css" href="./default.css">
|
22 |
|
|
</head>
|
23 |
|
|
|
24 |
|
|
<body>
|
25 |
|
|
<table class="tabledefault" width="100%">
|
26 |
|
|
<tr>
|
27 |
|
|
<td rowspan="2"><img src="./images/KNBLogo.gif"></td>
|
28 |
|
|
<td colspan="7"><div class="title">KNB Software Development Guide: Metacat Query Caching and Indexing Mechanisms</div></td>
|
29 |
|
|
</tr>
|
30 |
|
|
<tr>
|
31 |
|
|
<td><a href="/" class="toollink"> KNB Home </a></td>
|
32 |
|
|
<td><a href="/data.html" class="toollink"> Data </a></td>
|
33 |
|
|
<td><a href="/people.html" class="toollink"> People </a></td>
|
34 |
|
|
<td><a href="/informatics" class="toollink"> Informatics </a></td>
|
35 |
|
|
<td><a href="/biodiversity" class="toollink"> Biocomplexity </a></td>
|
36 |
|
|
<td><a href="/education" class="toollink"> Education </a></td>
|
37 |
|
|
<td><a href="/software" class="toollink"> Software </a></td>
|
38 |
|
|
</tr>
|
39 |
|
|
</table>
|
40 |
|
|
<br>
|
41 |
|
|
|
42 |
|
|
<table width="100%">
|
43 |
|
|
<tr>
|
44 |
|
|
<td class="tablehead" colspan="2"><p class="label">Metacat Service Based Architecture</p></td>
|
45 |
|
|
<td class="tablehead" colspan="2" align="right">
|
46 |
|
|
<!-- a href="add back file here when one exists" -->Back<!-- /a --> | <a href="./index.html">Home</a> |
|
47 |
|
|
<a href="./service-base.html">Next</a>
|
48 |
|
|
</td>
|
49 |
|
|
</tr>
|
50 |
|
|
</table>
|
51 |
|
|
|
52 |
|
|
<div class="header1">Table of Contents</div>
|
53 |
|
|
<div class="toc">
|
54 |
|
|
<div class="toc1"><a href="#Overview">Overview</a></div>
|
55 |
|
|
<div class="toc1"><a href="#XMLNodesIndex">All XML Path Indexing</a></div>
|
56 |
|
|
<div class="toc2"><a href="#HowXMLNodesIndexWorks">How All XML Path Indexing Works</a></div>
|
57 |
|
|
<div class="toc2"><a href="#XMLNodesIndexDrawbacks">Drawbacks to All XML Path Indexing</a></div>
|
58 |
|
|
<div class="toc1"><a href="#MemoryCaching">Query Caching in Memory</a></div>
|
59 |
|
|
<div class="toc2"><a href="#HowMemoryCachingWorks">How Memory Caching Works</a></div>
|
60 |
|
|
<div class="toc2"><a href="#MemoryCachingDrawbacks">Memory Caching Drawbacks</a></div>
|
61 |
|
|
<div class="toc1"><a href="#PathIndexing">Common XML Path Indexing</a></div>
|
62 |
|
|
<div class="toc2"><a href="#HowPathIndexingWorks">How Common Path Indexing Works</a></div>
|
63 |
|
|
<div class="toc2"><a href="#PathIndexingDrawbacks">Common Path Indexing Drawbacks</a></div>
|
64 |
|
|
<div class="toc1"><a href="#UniqueResultset">Storing Unique Storing Unique Return Fields</a></div>
|
65 |
|
|
<div class="toc2"><a href="#HowUniqueResultsetWorks">How Storing Unique Storing Unique Return Fields Works</a></div>
|
66 |
|
|
</div>
|
67 |
|
|
|
68 |
|
|
<a name="Overview"></a><div class="header1">Overview</div>
|
69 |
|
|
<p>Unfortunately, the act of searching (querying) an xml document that has
|
70 |
|
|
been deconstructed and put into a relational database is somewhat slow. The table that
|
71 |
|
|
stores xml elements (xml_nodes) becomes very large, very fast. In an
|
72 |
|
|
effort to speed up these queries, metacat has implemented several levels of query
|
73 |
|
|
caching and indexing:</p>
|
74 |
|
|
|
75 |
|
|
<ul>
|
76 |
|
|
<li>All XML path indexing - store all xml paths in a separate table.</li>
|
77 |
|
|
<li>Memory caching - recent query results are held in memory.</li>
|
78 |
|
|
<li>Common XML path indexing - store predetermined xml paths in a separate table.</li>
|
79 |
|
|
<li>Store unique resultset combinations - unique resultsets are stored in the
|
80 |
|
|
database on a per-docid basis.</li>
|
81 |
|
|
</ul>
|
82 |
|
|
|
83 |
|
|
<p>We will discuss each of these caching strategies in turn.</p>
|
84 |
|
|
|
85 |
|
|
<a name="XMLNodesIndex"></a><div class="header1">All XML Path Indexing</div>
|
86 |
|
|
<a name="HowXMLNodesIndexWorks"></a><div class="header2">How All XML Path Indexing Works</div>
|
87 |
|
|
<p>Aside from the column indexing that is typical with performance tuning on any
|
88 |
|
|
database, an additional table named xml_index was created to improve search
|
89 |
|
|
performance against the xml_nodes table. The xml_index table keeps track of the
|
90 |
|
|
xml paths for each node in the nodes table. This includes subpaths, so if you
|
91 |
|
|
had an element in the xml that looked like: </p>
|
92 |
|
|
|
93 |
|
|
<div class="code"><a><b><c>value</c></b></a></div>
|
94 |
|
|
|
95 |
|
|
<p>the paths that would get indexed for this node would be:</p>
|
96 |
|
|
|
97 |
|
|
<div class="code">c<br>b/c<br>a/b/c</div>
|
98 |
|
|
|
99 |
|
|
<p>When querying, the query path is first looked up in xml_index and then joined to
|
100 |
|
|
the xml_nodes table via the nodeid.
|
101 |
|
|
|
102 |
|
|
<a name="XMLNodesIndexDrawbacks"></a><div class="header2">Drawbacks to All XML Path Indexing</div>
|
103 |
|
|
<p>Indexing all paths improves performance, but is still relatively slow since a join
|
104 |
|
|
against the xml_nodes table takes time.</p>
|
105 |
|
|
|
106 |
|
|
<a name="MemoryCaching"></a><div class="header1">Query Caching in Memory</div>
|
107 |
|
|
<a name="HowMemoryCachingWorks"></a><div class="header2">How Memory Caching Works</div>
|
108 |
|
|
<p>Metacat keeps a Hashtable of recent queries and their results. This information is only
|
109 |
|
|
held for queries that were run by the "public" user. The key is the sql that
|
110 |
|
|
that was run for the query. The value is the html that was returned from the query.
|
111 |
|
|
This cache is checked first to see if the current sql statement was already run. If so,
|
112 |
|
|
the associated result html is returned.</p>
|
113 |
|
|
<a name="MemoryCachingDrawbacks"></a><div class="header2">Memory Caching Drawbacks</div>
|
114 |
|
|
<p>The first issue is that every time a document is inserted, updated or deleted, the
|
115 |
|
|
query cache is potentially out of sync. So anytime any document is changed, the entire
|
116 |
|
|
cache is cleared, and the process of filling the cache starts over.</p>
|
117 |
|
|
<p>The second issue is that the cache could potentially become large, using a lot of
|
118 |
|
|
memory. This could be caused if a large number of different queries are performed
|
119 |
|
|
inbetween any documents being inserted, updated or deleted, and many of those queries
|
120 |
|
|
return large result sets. Note that this is a corner case, and the cache size is
|
121 |
|
|
configurable using the database.queryresultCacheSize property in metacat.properties.<p>
|
122 |
|
|
<p>Third, as mentioned before, this caching is only done on searches performed by
|
123 |
|
|
the "public" user. Any logged in user searches will not be cached.</p>
|
124 |
|
|
|
125 |
|
|
<a name="PathIndexing"></a><div class="header1">Common XML Path Indexing</div>
|
126 |
|
|
<a name="HowPathIndexingWorks"></a><div class="header2">How Common Path Indexing Works</div>
|
127 |
|
|
<p>Metacat queries work in two stages. The first is to get the matching document IDs.
|
128 |
|
|
The second is to get the requested return fields for those IDs. As discussed in the
|
129 |
|
|
<a href="#XMLNodesIndex">All XML Path Indexing</a> section above, some effort was made to
|
130 |
|
|
improve the first step, but it was still slow. Metacat made use of the fact that
|
131 |
|
|
there are certain paths in EML that are frequently searched to further speed up queries.</p>
|
132 |
|
|
|
133 |
|
|
<p>Paths that are determined to be searched frequently are entered as a comma delimited
|
134 |
|
|
list in the xml.indexPaths property in metacat.properties. At startup time, metacat determines
|
135 |
|
|
if there are any changes to the xml.indexPaths property, scans the xml_nodes table and adds
|
136 |
|
|
any nodes with matching paths to the xml_path_index table. In this case, unlike the xml_index
|
137 |
|
|
table, the node value is stored. So if a desired value is found for the given path, the
|
138 |
|
|
docid is returned and the nodes table is never touched.
|
139 |
|
|
|
140 |
|
|
<a name="PathIndexingDrawbacks"></a><div class="header2">Common Path Indexing Drawbacks</div>
|
141 |
|
|
<p>An obvious issue with the xml_path_index table is that is only holds values for predetermined
|
142 |
|
|
paths. Any other path will not benefit from this indexing.</p>
|
143 |
|
|
|
144 |
|
|
<p>Also, if a wildcard has been used in the search value, the database index for the value
|
145 |
|
|
column will not be used. This will slow down the search.</p>
|
146 |
|
|
|
147 |
|
|
<a name="UniqueResultset"></a><div class="header1">Storing Unique Return Fields</div>
|
148 |
|
|
<a name="HowUniqueResultsetWorks"></a><div class="header2">How Storing Unique Return Fields Works</div>
|
149 |
|
|
<p>Once the matching docids have been retrieved, metacat makes a second pass and gets the desired
|
150 |
|
|
return fields for each document. To help speed this up, metacat databases unique combinations of
|
151 |
|
|
query return fields for each document.</p>
|
152 |
|
|
|
153 |
|
|
<p>This process involves two tables. The first, xml_returnfield, keeps unique return field combinations.
|
154 |
|
|
The return fields are a pipe (|) delimited string in the table. The second table, xml_queryresult, keeps
|
155 |
|
|
the id for the row in the xml_returnfield that has the appropriate return fields, the docid and the
|
156 |
|
|
query result.</p>
|
157 |
|
|
|
158 |
|
|
<p>If the desired docid and result fields are found in these two tables, the query results are returned
|
159 |
|
|
directly from the xml_queryresult table and the xml_nodes table is never touched.</p>
|
160 |
|
|
|
161 |
|
|
<p>Overall, this strategy works well.</p>
|
162 |
|
|
|
163 |
|
|
<br>
|
164 |
|
|
<!-- a href="add back file here when one exists" -->Back<!-- /a --> | <a href="./index.html">Home</a> |
|
165 |
|
|
<a href="./service-base.html">Next</a>
|
166 |
|
|
</ul>
|
167 |
|
|
|
168 |
|
|
</body>
|
169 |
|
|
</html>
|