1
|
<!--
|
2
|
* query-caching.html
|
3
|
*
|
4
|
* Authors: Michael Daigle
|
5
|
* Copyright: 2009 Regents of the University of California and the
|
6
|
* National Center for Ecological Analysis and Synthesis
|
7
|
* For Details: http://www.nceas.ucsb.edu/
|
8
|
* Created: 2009 March 4
|
9
|
* Version:
|
10
|
* File Info: '$ '
|
11
|
*
|
12
|
*
|
13
|
-->
|
14
|
<html>
|
15
|
<head>
|
16
|
<title>Metacat Query Caching and Indexing Mechanisms</title>
|
17
|
<!-- unfortunately, we have to look for the common css file in the
|
18
|
user docs directory. This is because the user docs deploy to
|
19
|
the top level of the metacat docs on the knb web server -->
|
20
|
<link rel="stylesheet" type="text/css" href="../user/common.css">
|
21
|
<link rel="stylesheet" type="text/css" href="./default.css">
|
22
|
</head>
|
23
|
|
24
|
<body>
|
25
|
<table class="tabledefault" width="100%">
|
26
|
<tr>
|
27
|
<td rowspan="2"><img src="./images/KNBLogo.gif"></td>
|
28
|
<td colspan="7"><div class="title">KNB Software Development Guide: Metacat Query Caching and Indexing Mechanisms</div></td>
|
29
|
</tr>
|
30
|
<tr>
|
31
|
<td><a href="/" class="toollink"> KNB Home </a></td>
|
32
|
<td><a href="/data.html" class="toollink"> Data </a></td>
|
33
|
<td><a href="/people.html" class="toollink"> People </a></td>
|
34
|
<td><a href="/informatics" class="toollink"> Informatics </a></td>
|
35
|
<td><a href="/biodiversity" class="toollink"> Biocomplexity </a></td>
|
36
|
<td><a href="/education" class="toollink"> Education </a></td>
|
37
|
<td><a href="/software" class="toollink"> Software </a></td>
|
38
|
</tr>
|
39
|
</table>
|
40
|
<br>
|
41
|
|
42
|
<table width="100%">
|
43
|
<tr>
|
44
|
<td class="tablehead" colspan="2"><p class="label">Metacat Service Based Architecture</p></td>
|
45
|
<td class="tablehead" colspan="2" align="right">
|
46
|
<!-- a href="add back file here when one exists" -->Back<!-- /a --> | <a href="./index.html">Home</a> |
|
47
|
<a href="./service-base.html">Next</a>
|
48
|
</td>
|
49
|
</tr>
|
50
|
</table>
|
51
|
|
52
|
<div class="header1">Table of Contents</div>
|
53
|
<div class="toc">
|
54
|
<div class="toc1"><a href="#Overview">Overview</a></div>
|
55
|
<div class="toc1"><a href="#XMLNodesIndex">All XML Path Indexing</a></div>
|
56
|
<div class="toc2"><a href="#HowXMLNodesIndexWorks">How All XML Path Indexing Works</a></div>
|
57
|
<div class="toc2"><a href="#XMLNodesIndexDrawbacks">Drawbacks to All XML Path Indexing</a></div>
|
58
|
<div class="toc1"><a href="#MemoryCaching">Query Caching in Memory</a></div>
|
59
|
<div class="toc2"><a href="#HowMemoryCachingWorks">How Memory Caching Works</a></div>
|
60
|
<div class="toc2"><a href="#MemoryCachingDrawbacks">Memory Caching Drawbacks</a></div>
|
61
|
<div class="toc1"><a href="#PathIndexing">Common XML Path Indexing</a></div>
|
62
|
<div class="toc2"><a href="#HowPathIndexingWorks">How Common Path Indexing Works</a></div>
|
63
|
<div class="toc2"><a href="#PathIndexingDrawbacks">Common Path Indexing Drawbacks</a></div>
|
64
|
<div class="toc1"><a href="#UniqueResultset">Storing Unique Storing Unique Return Fields</a></div>
|
65
|
<div class="toc2"><a href="#HowUniqueResultsetWorks">How Storing Unique Storing Unique Return Fields Works</a></div>
|
66
|
</div>
|
67
|
|
68
|
<a name="Overview"></a><div class="header1">Overview</div>
|
69
|
<p>Unfortunately, the act of searching (querying) an xml document that has
|
70
|
been deconstructed and put into a relational database is somewhat slow. The table that
|
71
|
stores xml elements (xml_nodes) becomes very large, very fast. In an
|
72
|
effort to speed up these queries, metacat has implemented several levels of query
|
73
|
caching and indexing:</p>
|
74
|
|
75
|
<ul>
|
76
|
<li>All XML path indexing - store all xml paths in a separate table.</li>
|
77
|
<li>Memory caching - recent query results are held in memory.</li>
|
78
|
<li>Common XML path indexing - store predetermined xml paths in a separate table.</li>
|
79
|
<li>Store unique resultset combinations - unique resultsets are stored in the
|
80
|
database on a per-docid basis.</li>
|
81
|
</ul>
|
82
|
|
83
|
<p>We will discuss each of these caching strategies in turn.</p>
|
84
|
|
85
|
<a name="XMLNodesIndex"></a><div class="header1">All XML Path Indexing</div>
|
86
|
<a name="HowXMLNodesIndexWorks"></a><div class="header2">How All XML Path Indexing Works</div>
|
87
|
<p>Aside from the column indexing that is typical with performance tuning on any
|
88
|
database, an additional table named xml_index was created to improve search
|
89
|
performance against the xml_nodes table. The xml_index table keeps track of the
|
90
|
xml paths for each node in the nodes table. This includes subpaths, so if you
|
91
|
had an element in the xml that looked like: </p>
|
92
|
|
93
|
<div class="code"><a><b><c>value</c></b></a></div>
|
94
|
|
95
|
<p>the paths that would get indexed for this node would be:</p>
|
96
|
|
97
|
<div class="code">c<br>b/c<br>a/b/c</div>
|
98
|
|
99
|
<p>When querying, the query path is first looked up in xml_index and then joined to
|
100
|
the xml_nodes table via the nodeid.
|
101
|
|
102
|
<a name="XMLNodesIndexDrawbacks"></a><div class="header2">Drawbacks to All XML Path Indexing</div>
|
103
|
<p>Indexing all paths improves performance, but is still relatively slow since a join
|
104
|
against the xml_nodes table takes time.</p>
|
105
|
|
106
|
<a name="MemoryCaching"></a><div class="header1">Query Caching in Memory</div>
|
107
|
<a name="HowMemoryCachingWorks"></a><div class="header2">How Memory Caching Works</div>
|
108
|
<p>Metacat keeps a Hashtable of recent queries and their results. This information is only
|
109
|
held for queries that were run by the "public" user. The key is the sql that
|
110
|
that was run for the query. The value is the html that was returned from the query.
|
111
|
This cache is checked first to see if the current sql statement was already run. If so,
|
112
|
the associated result html is returned.</p>
|
113
|
<a name="MemoryCachingDrawbacks"></a><div class="header2">Memory Caching Drawbacks</div>
|
114
|
<p>The first issue is that every time a document is inserted, updated or deleted, the
|
115
|
query cache is potentially out of sync. So anytime any document is changed, the entire
|
116
|
cache is cleared, and the process of filling the cache starts over.</p>
|
117
|
<p>The second issue is that the cache could potentially become large, using a lot of
|
118
|
memory. This could be caused if a large number of different queries are performed
|
119
|
inbetween any documents being inserted, updated or deleted, and many of those queries
|
120
|
return large result sets. Note that this is a corner case, and the cache size is
|
121
|
configurable using the database.queryresultCacheSize property in metacat.properties.<p>
|
122
|
<p>Third, as mentioned before, this caching is only done on searches performed by
|
123
|
the "public" user. Any logged in user searches will not be cached.</p>
|
124
|
|
125
|
<a name="PathIndexing"></a><div class="header1">Common XML Path Indexing</div>
|
126
|
<a name="HowPathIndexingWorks"></a><div class="header2">How Common Path Indexing Works</div>
|
127
|
<p>Metacat queries work in two stages. The first is to get the matching document IDs.
|
128
|
The second is to get the requested return fields for those IDs. As discussed in the
|
129
|
<a href="#XMLNodesIndex">All XML Path Indexing</a> section above, some effort was made to
|
130
|
improve the first step, but it was still slow. Metacat made use of the fact that
|
131
|
there are certain paths in EML that are frequently searched to further speed up queries.</p>
|
132
|
|
133
|
<p>Paths that are determined to be searched frequently are entered as a comma delimited
|
134
|
list in the xml.indexPaths property in metacat.properties. At startup time, metacat determines
|
135
|
if there are any changes to the xml.indexPaths property, scans the xml_nodes table and adds
|
136
|
any nodes with matching paths to the xml_path_index table. In this case, unlike the xml_index
|
137
|
table, the node value is stored. So if a desired value is found for the given path, the
|
138
|
docid is returned and the nodes table is never touched.
|
139
|
|
140
|
<a name="PathIndexingDrawbacks"></a><div class="header2">Common Path Indexing Drawbacks</div>
|
141
|
<p>An obvious issue with the xml_path_index table is that is only holds values for predetermined
|
142
|
paths. Any other path will not benefit from this indexing.</p>
|
143
|
|
144
|
<p>Also, if a wildcard has been used in the search value, the database index for the value
|
145
|
column will not be used. This will slow down the search.</p>
|
146
|
|
147
|
<a name="UniqueResultset"></a><div class="header1">Storing Unique Return Fields</div>
|
148
|
<a name="HowUniqueResultsetWorks"></a><div class="header2">How Storing Unique Return Fields Works</div>
|
149
|
<p>Once the matching docids have been retrieved, metacat makes a second pass and gets the desired
|
150
|
return fields for each document. To help speed this up, metacat databases unique combinations of
|
151
|
query return fields for each document.</p>
|
152
|
|
153
|
<p>This process involves two tables. The first, xml_returnfield, keeps unique return field combinations.
|
154
|
The return fields are a pipe (|) delimited string in the table. The second table, xml_queryresult, keeps
|
155
|
the id for the row in the xml_returnfield that has the appropriate return fields, the docid and the
|
156
|
query result.</p>
|
157
|
|
158
|
<p>If the desired docid and result fields are found in these two tables, the query results are returned
|
159
|
directly from the xml_queryresult table and the xml_nodes table is never touched.</p>
|
160
|
|
161
|
<p>Overall, this strategy works well.</p>
|
162
|
|
163
|
<br>
|
164
|
<!-- a href="add back file here when one exists" -->Back<!-- /a --> | <a href="./index.html">Home</a> |
|
165
|
<a href="./service-base.html">Next</a>
|
166
|
</ul>
|
167
|
|
168
|
</body>
|
169
|
</html>
|