Project

General

Profile

1
<!--
2
  * query-caching.html
3
  *
4
  *      Authors: Michael Daigle
5
  *    Copyright: 2009 Regents of the University of California and the
6
  *               National Center for Ecological Analysis and Synthesis
7
  *  For Details: http://www.nceas.ucsb.edu/
8
  *      Created: 2009 March 4
9
  *      Version: 
10
  *    File Info: '$ '
11
  * 
12
  * 
13
-->
14
<html>
15
<head>
16
<title>Metacat Query Caching and Indexing Mechanisms</title>
17
<!-- unfortunately, we have to look for the common css file in the 
18
     user docs directory.  This is because the user docs deploy to 
19
     the top level of the metacat docs on the knb web server -->
20
<link rel="stylesheet" type="text/css" href="../user/common.css">
21
<link rel="stylesheet" type="text/css" href="./default.css">
22
</head> 
23

    
24
<body>
25
  <table class="tabledefault" width="100%">
26
    <tr>
27
      <td rowspan="2"><img src="./images/KNBLogo.gif"></td>
28
      <td colspan="7"><div class="title">KNB Software Development Guide: Metacat Query Caching and Indexing Mechanisms</div></td>
29
    </tr>
30
    <tr>
31
      <td><a href="/" class="toollink"> KNB Home </a></td>
32
      <td><a href="/data.html" class="toollink"> Data </a></td>
33
      <td><a href="/people.html" class="toollink"> People </a></td>
34
      <td><a href="/informatics" class="toollink"> Informatics </a></td>
35
      <td><a href="/biodiversity" class="toollink"> Biocomplexity </a></td>
36
      <td><a href="/education" class="toollink"> Education </a></td>
37
      <td><a href="/software" class="toollink"> Software </a></td>
38
    </tr>
39
  </table>
40
  <br>
41

    
42
  <table width="100%">
43
    <tr>
44
      <td class="tablehead" colspan="2"><p class="label">Metacat Service Based Architecture</p></td>
45
      <td class="tablehead" colspan="2" align="right">
46
        <!-- a href="add back file here when one exists" -->Back<!-- /a --> | <a href="./index.html">Home</a> | 
47
        <a href="./service-base.html">Next</a>
48
      </td>
49
    </tr>
50
  </table>
51
  
52
  <div class="header1">Table of Contents</div>
53
  <div class="toc">
54
    <div class="toc1"><a href="#Overview">Overview</a></div>
55
    <div class="toc1"><a href="#XMLNodesIndex">All XML Path Indexing</a></div>
56
      <div class="toc2"><a href="#HowXMLNodesIndexWorks">How All XML Path Indexing Works</a></div>
57
      <div class="toc2"><a href="#XMLNodesIndexDrawbacks">Drawbacks to All XML Path Indexing</a></div>
58
    <div class="toc1"><a href="#MemoryCaching">Query Caching in Memory</a></div>
59
      <div class="toc2"><a href="#HowMemoryCachingWorks">How Memory Caching Works</a></div>
60
      <div class="toc2"><a href="#MemoryCachingDrawbacks">Memory Caching Drawbacks</a></div>
61
    <div class="toc1"><a href="#PathIndexing">Common XML Path Indexing</a></div>
62
      <div class="toc2"><a href="#HowPathIndexingWorks">How Common Path Indexing Works</a></div>
63
      <div class="toc2"><a href="#PathIndexingDrawbacks">Common Path Indexing Drawbacks</a></div>
64
    <div class="toc1"><a href="#UniqueResultset">Storing Unique Return Fields</a></div>
65
      <div class="toc2"><a href="#HowUniqueResultsetWorks">How Storing Unique Return Fields Works</a></div>
66
      <div class="toc2"><a href="#UniqueResultsetDrawbacks">Drawbacks to Storing Unique Return Fields</a></div>
67
  </div>  
68
  
69
  <a name="Overview"></a><div class="header1">Overview</div>   
70
  <p>Unfortunately, the act of searching (querying) an xml document that has
71
  been deconstructed and put into a relational database is somewhat slow.  The table that 
72
  stores xml elements (xml_nodes) becomes very large, very fast. In an
73
  effort to speed up these queries, metacat has implemented several levels of query
74
  caching and indexing:</p>
75
  
76
  <ul>
77
    <li>All XML path indexing - store all xml paths in a separate table.</li>
78
    <li>Memory caching - recent query results are held in memory.</li>
79
    <li>Common XML path indexing - store predetermined xml paths in a separate table.</li>
80
    <li>Store unique resultset combinations - unique resultsets are stored in the
81
    database on a per-docid basis.</li>
82
  </ul>  
83
    
84
  <p>We will discuss each of these caching strategies in turn.</p>
85
  
86
  <a name="XMLNodesIndex"></a><div class="header1">All XML Path Indexing</div>
87
  <a name="HowXMLNodesIndexWorks"></a><div class="header2">How All XML Path Indexing Works</div>
88
  <p>Aside from the column indexing that is typical with performance tuning on any 
89
  database, an additional table named xml_index was created to improve search 
90
  performance against the xml_nodes table.  The xml_index table keeps track of the 
91
  xml paths for each node in the nodes table.  This includes subpaths, so if you
92
  had an element in the xml that looked like: </p>
93
  
94
  <div class="code">&lt;a&gt;&lt;b&gt;&lt;c&gt;value&lt;/c&gt;&lt;/b&gt;&lt;/a&gt;</div>
95
  
96
  <p>the paths that would get indexed for this node would be:</p>
97
  
98
  <div class="code">c<br>b/c<br>a/b/c</div>
99
  
100
  <p>When querying, the query path is first looked up in xml_index and then joined to
101
  the xml_nodes table via the nodeid.  
102
  
103
  <a name="XMLNodesIndexDrawbacks"></a><div class="header2">Drawbacks to All XML Path Indexing</div>
104
  <p>Indexing all paths improves performance, but is still relatively slow since a join 
105
  against the xml_nodes table takes time.</p>
106
  
107
  <a name="MemoryCaching"></a><div class="header1">Query Caching in Memory</div> 
108
    <a name="HowMemoryCachingWorks"></a><div class="header2">How Memory Caching Works</div>  
109
    <p>Metacat keeps a Hashtable of recent queries and their results.  This information is only
110
    held for queries that were run by the "public" user.  The key is the sql that
111
    that was run for the query.  The value is the html that was returned from the query.
112
    This cache is checked first to see if the current sql statement was already run.  If so,
113
    the associated result html is returned.</p>
114
    <a name="MemoryCachingDrawbacks"></a><div class="header2">Memory Caching Drawbacks</div>  
115
    <p>The first issue is that every time a document is inserted, updated or deleted, the 
116
    query cache is potentially out of sync.  So anytime any document is changed, the entire
117
    cache is cleared, and the process of filling the cache starts over.</p>
118
    <p>The second issue is that the cache could potentially become large, using a lot of
119
    memory.  This could be caused if a large number of different queries are performed 
120
    inbetween any documents being inserted, updated or deleted, and many of those queries
121
    return large result sets.  Note that this is a corner case, and the cache size is 
122
    configurable using the database.queryresultCacheSize property in metacat.properties.<p>
123
    <p>Third, as mentioned before, this caching is only done on searches performed by
124
    the "public" user.  Any logged in user searches will not be cached.</p>
125
  
126
  <a name="PathIndexing"></a><div class="header1">Common XML Path Indexing</div> 
127
    <a name="HowPathIndexingWorks"></a><div class="header2">How Common Path Indexing Works</div> 
128
    <p>Metacat queries work in two stages.  The first is to get the matching document IDs.  
129
    The second is to get the requested return fields for those IDs.  As discussed in the 
130
    <a href="#XMLNodesIndex">All XML Path Indexing</a> section above, some effort was made to 
131
    improve the first step, but it was still slow.  Metacat made use of the fact that 
132
    there are certain paths in EML that are frequently searched to further speed up queries.</p>
133
    
134
    <p>Paths that are determined to be searched frequently are entered as a comma delimited
135
    list in the xml.indexPaths property in metacat.properties.  At startup time, metacat determines
136
    if there are any changes to the xml.indexPaths property, scans the xml_nodes table and adds
137
    any nodes with matching paths to the xml_path_index table.  In this case, unlike the xml_index
138
    table, the node value is stored.  So if a desired value is found for the given path, the 
139
    docid is returned and the nodes table is never touched.
140
    
141
    <a name="PathIndexingDrawbacks"></a><div class="header2">Common Path Indexing Drawbacks</div> 
142
    <p>An obvious issue with the xml_path_index table is that is only holds values for predetermined
143
    paths.  Any other path will not benefit from this indexing.</p>
144
    
145
    <p>Also, if a wildcard has been used in the search value, the database index for the value 
146
    column will not be used.  This will slow down the search.</p>
147
      
148
  <a name="UniqueResultset"></a><div class="header1">Storing Unique Return Fields</div> 
149
    <a name="HowUniqueResultsetWorks"></a><div class="header2">How Storing Unique Return Fields Works</div> 
150
    <p>Once the matching docids have been retrieved, metacat makes a second pass and gets the desired
151
    return fields for each document. To help speed this up, metacat databases unique combinations of 
152
    query return fields for each document.</p>
153
    
154
    <p>This process involves two tables.  The first, xml_returnfield, keeps unique return field combinations.
155
    The return fields are a pipe (|) delimited string in the table. The second table, xml_queryresult, keeps 
156
    the id for the row in the xml_returnfield that has the appropriate return fields, the docid and the 
157
    query result.</p>
158
    
159
    <p>If the desired docid and result fields are found in these two tables, the query results are returned
160
    directly from the xml_queryresult table and the xml_indes and xml_nodes tables are never touched.</p>
161
    
162
    <a name="UniqueResultsetDrawbacks"></a><div class="header2">Drawbacks to Storing Unique Return Fields</div> 
163
    <p>The main issue with storing unique query results is that the query must be performed at least once for
164
    each unique set of results and docid before it is added to the xml_returnfield and xml_queryresult tables. 
165
    This first run goes against the xml_index and xml_nodes tables and is slow.</p>
166
  <br>
167
  <!-- a href="add back file here when one exists" -->Back<!-- /a --> | <a href="./index.html">Home</a> | 
168
  <a href="./service-base.html">Next</a>
169
  </ul>
170

    
171
</body>
172
</html>
(28-28/35)