|
1 |
|
|
2 |
..
|
|
3 |
@startuml images/stats-activity-diagram.png
|
|
4 |
(*) --> "Initialize event log timer"
|
|
5 |
--> "read an event_log entry"
|
|
6 |
--> "read system metadata"
|
|
7 |
--> "write to stats Solr index"
|
|
8 |
@enduml
|
|
9 |
|
|
10 |
..
|
|
11 |
@startuml images/stats-query-sequence-diagram.png
|
|
12 |
participant client
|
|
13 |
client -> MNRestServlet : doGet(request)
|
|
14 |
activate MNRestServlet
|
|
15 |
MNRestServlet -> MNResourceHandler: handle(get)
|
|
16 |
activate MNResourceHandler
|
|
17 |
MNResourceHandler -> MNResourceHandler: doQuery(engine, query)
|
|
18 |
MNResourceHandler -> MNodeService: query(engine, query)
|
|
19 |
activate MNodeService
|
|
20 |
MNodeService -> StatsQueryService: query(query, subjects)
|
|
21 |
activate StatsQueryService
|
|
22 |
StatsQueryService -> SolrServer: query(query)
|
|
23 |
activate SolrServer
|
|
24 |
SolrServer -> StatsQueryService: inputstream
|
|
25 |
deactivate SolrServer
|
|
26 |
StatsQueryService -> MNodeService: inputstream
|
|
27 |
deactivate StatsQueryService
|
|
28 |
MNodeService -> MNResourceHandler: inputstream
|
|
29 |
deactivate MNodeService
|
|
30 |
MNResourceHandler -> MNRestServlet: response
|
|
31 |
deactivate MNResourceHandler
|
|
32 |
MNRestServlet -> client: response
|
|
33 |
deactivate MNRestServlet
|
|
34 |
@enduml
|
|
35 |
|
|
36 |
|
|
37 |
Metacat Usage Statistics Service
|
|
38 |
================================
|
|
39 |
|
|
40 |
Overview
|
|
41 |
--------
|
|
42 |
This document describes a proposed usage statistics service for Metacat.
|
|
43 |
|
|
44 |
This new service will provide Metacat usage information to clients about data and metacata access events.
|
|
45 |
|
|
46 |
|
|
47 |
Requirements
|
|
48 |
------------
|
|
49 |
|
|
50 |
The statistics service should have an easy to learn API that allows for query fields to be added
|
|
51 |
and provide reports in XML, JSON.
|
|
52 |
|
|
53 |
Provided Statistics
|
|
54 |
___________________
|
|
55 |
|
|
56 |
The service will include the following statistics:
|
|
57 |
|
|
58 |
* Dataset views
|
|
59 |
* Package downloads
|
|
60 |
* Size in bytes of package downloads
|
|
61 |
* Citations
|
|
62 |
|
|
63 |
Results Filtering
|
|
64 |
_________________
|
|
65 |
|
|
66 |
Reports returned by the service must be able to be filtered by the following fields:
|
|
67 |
|
|
68 |
* A PID or list of PIDs
|
|
69 |
* Creator or list of creators (DN, or ORCID, or some amalgam -- to be discussed)
|
|
70 |
* A time range of access event (upload, download, etc.)
|
|
71 |
* Spatial location of access event (upload, download, etc.)
|
|
72 |
* IP Address
|
|
73 |
* Accessor or list of accessors (DN, or ORCID, or some amalgam, needs ACL -- to be discussed)
|
|
74 |
|
|
75 |
Results Aggregation
|
|
76 |
___________________
|
|
77 |
|
|
78 |
Reports must be able to be aggregated by the following fields:
|
|
79 |
* User (DN, or ORCID, or some amalgam )
|
|
80 |
* Time range, aggregated to requested unit (day, week, month, year)
|
|
81 |
* Spatial range, aggregated to requested unit
|
|
82 |
|
|
83 |
Performance
|
|
84 |
___________
|
|
85 |
|
|
86 |
The query service should provide results quickly, as it will be used to construct the user dashboard and possibly other UI elements.
|
|
87 |
|
|
88 |
Statistics Service Solr Index
|
|
89 |
------------------------
|
|
90 |
Currently Metacat writes access information to the table ‘access_log’ that has the fields:
|
|
91 |
|
|
92 |
=========== ===========================
|
|
93 |
name data type
|
|
94 |
----------- ---------------------------
|
|
95 |
entryid bigint
|
|
96 |
ip_address character varying(512)
|
|
97 |
user_agent character varying(512)
|
|
98 |
principal character varying(512)
|
|
99 |
docid character varying(250)
|
|
100 |
event character varying(512)
|
|
101 |
date_logged timestamp without time zone
|
|
102 |
=========== ===========================
|
|
103 |
|
|
104 |
In order to provide fast queries, aggregation and faceting of selected fields, access log information will be exported from the current
|
|
105 |
‘access_log’ table and from
|
|
106 |
the ‘systemmetadata’ table into a new Solr index that will be configured in Metacat as a second Solr core. The new Solr index will
|
|
107 |
be based on access events and will contain the fields shown in the following table:
|
|
108 |
|
|
109 |
============== =========
|
|
110 |
name ddata type
|
|
111 |
----------- ---------------------------
|
|
112 |
id str
|
|
113 |
datetime date
|
|
114 |
event str
|
|
115 |
location location
|
|
116 |
pid str
|
|
117 |
rightsHolder str
|
|
118 |
principal str
|
|
119 |
size int
|
|
120 |
formatId str
|
|
121 |
============== =========
|
|
122 |
|
|
123 |
The new Solr index will contain the following fields:
|
|
124 |
|
|
125 |
::
|
|
126 |
|
|
127 |
<doc>
|
|
128 |
<str name="id">2E3E8935-364E-4000-9357-6CE4E067D236</str>
|
|
129 |
<date name="datetime">2014-01-01T01:01:01Z</date>
|
|
130 |
<str name="event">read</str>
|
|
131 |
<location name="location">45.17614,-93.87341</location>
|
|
132 |
<str name="pid">sla.2.1</str>
|
|
133 |
<str name=”rightsHolder”>uid=williams,o=unaffiliated,dc=ecoinformatics,dc=org</str>
|
|
134 |
<str name="principal">uid=williams,o=unaffiliated,dc=ecoinformatics,dc=org</str>
|
|
135 |
<int name="size">52273</int>
|
|
136 |
<str name=”formatId”>eml://ecoinformatics.org/eml-2.0.1</str>
|
|
137 |
</doc>
|
|
138 |
|
|
139 |
The second Solr core that will contain usage statistics will require a modification to the existing solr.xml file:
|
|
140 |
|
|
141 |
::
|
|
142 |
|
|
143 |
<solr persistent="false">
|
|
144 |
<!--
|
|
145 |
adminPath: RequestHandler path to manage cores.
|
|
146 |
If 'null' (or absent), cores will not be manageable via request handler
|
|
147 |
-->
|
|
148 |
<cores adminPath="/admin/cores" defaultCoreName="collection1">
|
|
149 |
<core name="collection1" instanceDir="." />
|
|
150 |
<core name=”stats” instanceDir=”.”/>
|
|
151 |
</cores>
|
|
152 |
</solr>
|
|
153 |
|
|
154 |
A Java TimerTask will run the import method that will read event records from the Metacat access_log table and combine these
|
|
155 |
record data from the systemmetadata table
|
|
156 |
and write this combined entry to the stats Solr index. Access_log entry types such as ‘synchronization_failed’ and ‘replication’
|
|
157 |
will be filtered out and
|
|
158 |
will not be written to the Solr index. The time of the last record imported from access_log will be stored so that subsequent
|
|
159 |
imports would start from the next unimported event record. If required, the data may be aggregated by time interval, such as week or
|
|
160 |
month.
|
|
161 |
|
|
162 |
The statistics service will be exposed as a new query engine with a DataONE URL such as:
|
|
163 |
|
|
164 |
::
|
|
165 |
|
|
166 |
https://hostname/knb/d1/mn/v1/query/stats/<query>
|
|
167 |
|
|
168 |
Queries will be passed to the new Solr query engine using the standard Solr query syntax.
|
|
169 |
|
|
170 |
|
|
171 |
One new class will be added to Metacat to handle stats queries, StatsQueryService. Figure 2 shows a call trace for a statistics
|
|
172 |
service query.
|
|
173 |
|
|
174 |
.. figure:: images/stats-query-sequence-diagram.png
|
|
175 |
|
|
176 |
Figure 2. Statistics query sequence diagram.
|
|
177 |
|
|
178 |
|
|
179 |
The StatsQuerySerivce class will transform the incoming query to Solr parameters, issue the query and returns the query result as a byte stream of text/html content.
|
|
180 |
|
|
181 |
Statistics Service Usage
|
|
182 |
------------------------
|
|
183 |
|
|
184 |
The following sections show some of the queries that will be available through the statistics service.
|
|
185 |
|
|
186 |
Usage of pids provided by a specified rights holder
|
|
187 |
___________________________________________________
|
|
188 |
|
|
189 |
The following example shows a query for download volume for pids created by rightsHolder=williams with download size statistics aggregated by pid:
|
|
190 |
|
|
191 |
::
|
|
192 |
|
|
193 |
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=rightsHolder:uid=williams*&fq=event:read&stats=true&stats.field=size&rows=0&stats.facet=pid
|
|
194 |
|
|
195 |
The following result is returned:
|
|
196 |
|
|
197 |
::
|
|
198 |
|
|
199 |
<?xml version="1.0" encoding="UTF-8"?>
|
|
200 |
<response>
|
|
201 |
...
|
|
202 |
<result name="response" numFound="8" start="0"/>
|
|
203 |
<lst name="stats">
|
|
204 |
<lst name="stats_fields">
|
|
205 |
<lst name="size">
|
|
206 |
<double name="min">30.0</double>
|
|
207 |
<double name="max">1000.0</double>
|
|
208 |
<double name="sum">3150.0</double>
|
|
209 |
<long name="count">8</long>
|
|
210 |
<long name="missing">0</long>
|
|
211 |
<double name="sumOfSquares">3004500.0</double>
|
|
212 |
<double name="mean">393.75</double>
|
|
213 |
<double name="stddev">502.0226944215627</double>
|
|
214 |
<lst name="facets">
|
|
215 |
<lst name="pid">
|
|
216 |
<lst name="sla.3.1">
|
|
217 |
<double name="min">1000.0</double>
|
|
218 |
<double name="max">1000.0</double>
|
|
219 |
<double name="sum">3000.0</double>
|
|
220 |
<long name="count">3</long>
|
|
221 |
<long name="missing">0</long>
|
|
222 |
<double name="sumOfSquares">3000000.0</double>
|
|
223 |
<double name="mean">1000.0</double>
|
|
224 |
<double name="stddev">0.0</double>
|
|
225 |
</lst>
|
|
226 |
<lst name="sla.2.1">
|
|
227 |
<double name="min">30.0</double>
|
|
228 |
<double name="max">30.0</double>
|
|
229 |
<double name="sum">150.0</double>
|
|
230 |
<long name="count">5</long>
|
|
231 |
<long name="missing">0</long>
|
|
232 |
<double name="sumOfSquares">4500.0</double>
|
|
233 |
<double name="mean">30.0</double>
|
|
234 |
<double name="stddev">0.0</double>
|
|
235 |
</lst>
|
|
236 |
</lst>
|
|
237 |
</lst>
|
|
238 |
</lst>
|
|
239 |
</lst>
|
|
240 |
</lst>
|
|
241 |
</response>
|
|
242 |
|
|
243 |
The previous query can be constrained to a specific time by adding a time range, i.e.
|
|
244 |
|
|
245 |
::
|
|
246 |
|
|
247 |
&fq=datetime:%[2013-01-01T23:59:59Z TO 2013-04-31T23:59:59Z]
|
|
248 |
|
|
249 |
Data uploads
|
|
250 |
____________
|
|
251 |
|
|
252 |
The following query shows counts of data uploads by format type by a specified user:
|
|
253 |
|
|
254 |
::
|
|
255 |
|
|
256 |
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=rightsHolder:uid=williams*&fq=event:create&facet=true&facet.field=formatId&rows=0
|
|
257 |
|
|
258 |
::
|
|
259 |
|
|
260 |
<?xml version="1.0" encoding="UTF-8"?>
|
|
261 |
<response>
|
|
262 |
...
|
|
263 |
<result name="response" numFound="3" start="0"/>
|
|
264 |
<lst name="facet_counts">
|
|
265 |
<lst name="facet_queries"/>
|
|
266 |
<lst name="facet_fields">
|
|
267 |
<lst name="formatId">
|
|
268 |
<int name="BIN">2</int>
|
|
269 |
<int name="eml://ecoinformatics.org/eml-2.1.1">1</int>
|
|
270 |
<int name="text/csv">0</int>
|
|
271 |
</lst>
|
|
272 |
</lst>
|
|
273 |
<lst name="facet_dates"/>
|
|
274 |
<lst name="facet_ranges"/>
|
|
275 |
</lst>
|
|
276 |
</response>
|
|
277 |
|
|
278 |
Data downloads
|
|
279 |
______________
|
|
280 |
|
|
281 |
The following query shows data download counts by a specific user for each month in 2013:
|
|
282 |
|
|
283 |
::
|
|
284 |
|
|
285 |
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=principal:williams&fq=event:read&fq=formatId:BIN&facet=true&facet.field=event&facet.range=datetime&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
|
|
286 |
|
|
287 |
::
|
|
288 |
|
|
289 |
<?xml version="1.0" encoding="UTF-8"?>
|
|
290 |
<response>
|
|
291 |
...
|
|
292 |
<lst name="facet_ranges">
|
|
293 |
<lst name="datetime">
|
|
294 |
<lst name="counts">
|
|
295 |
<int name="2013-01-01T01:01:01Z">0</int>
|
|
296 |
<int name="2013-02-01T01:01:01Z">0</int>
|
|
297 |
<int name="2013-03-01T01:01:01Z">0</int>
|
|
298 |
<int name="2013-04-01T01:01:01Z">0</int>
|
|
299 |
<int name="2013-05-01T01:01:01Z">0</int>
|
|
300 |
<int name="2013-06-01T01:01:01Z">2</int>
|
|
301 |
<int name="2013-07-01T01:01:01Z">1</int>
|
|
302 |
<int name="2013-08-01T01:01:01Z">0</int>
|
|
303 |
<int name="2013-09-01T01:01:01Z">0</int>
|
|
304 |
<int name="2013-10-01T01:01:01Z">0</int>
|
|
305 |
<int name="2013-11-01T01:01:01Z">0</int>
|
|
306 |
<int name="2013-12-01T01:01:01Z">0</int>
|
|
307 |
</lst>
|
|
308 |
<str name="gap">+1MONTH</str>
|
|
309 |
<date name="start">2013-01-01T01:01:01Z</date>
|
|
310 |
<date name="end">2014-01-01T01:01:01Z</date>
|
|
311 |
</lst>
|
|
312 |
</lst>
|
|
313 |
</lst>
|
|
314 |
</response>
|
|
315 |
|
|
316 |
The following query shows EML metadata downloads by a specific user for each month in 2013.
|
|
317 |
|
|
318 |
::
|
|
319 |
|
|
320 |
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=principal:*williams*&fq=event:read&fq=formatId:*eml*&facet=true&facet.field=event&facet.range=datetime&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
|
|
321 |
|
|
322 |
::
|
|
323 |
|
|
324 |
<?xml version="1.0" encoding="UTF-8"?>
|
|
325 |
<response>
|
|
326 |
...
|
|
327 |
<lst name="facet_ranges">
|
|
328 |
<lst name="datetime">
|
|
329 |
<lst name="counts">
|
|
330 |
<int name="2013-01-01T01:01:01Z">0</int>
|
|
331 |
<int name="2013-02-01T01:01:01Z">0</int>
|
|
332 |
<int name="2013-03-01T01:01:01Z">0</int>
|
|
333 |
<int name="2013-04-01T01:01:01Z">1</int>
|
|
334 |
<int name="2013-05-01T01:01:01Z">1</int>
|
|
335 |
<int name="2013-06-01T01:01:01Z">0</int>
|
|
336 |
<int name="2013-07-01T01:01:01Z">2</int>
|
|
337 |
<int name="2013-08-01T01:01:01Z">0</int>
|
|
338 |
<int name="2013-09-01T01:01:01Z">0</int>
|
|
339 |
<int name="2013-10-01T01:01:01Z">0</int>
|
|
340 |
<int name="2013-11-01T01:01:01Z">0</int>
|
|
341 |
<int name="2013-12-01T01:01:01Z">0</int>
|
|
342 |
</lst>
|
|
343 |
<str name="gap">+1MONTH</str>
|
|
344 |
<date name="start">2013-01-01T01:01:01Z</date>
|
|
345 |
<date name="end">2014-01-01T01:01:01Z</date>
|
|
346 |
</lst>
|
|
347 |
</lst>
|
|
348 |
</lst>
|
|
349 |
</response>
|
|
350 |
|
|
351 |
Unresolved Issues/Questions
|
|
352 |
---------------------------
|
|
353 |
|
|
354 |
1. How is the location of an event determined? What do we mean by location?
|
|
355 |
2. Currently Solr (3.x and 4.x) doesn’t allow faceting by date/time interval, so it isn't possible to use the stats component to calculate total download volume for a time interval over a time range, such as every month for the last 10 years. Therefor for calculated amounts, a query for each time interval is required.
|
|
356 |
3. Where will citation info come from? Do we import this into the Solr index?
|
|
357 |
4. Are there text fields that the statistics service should include, i.e. do we want to provide statistics for queries such as "how many pids were downloaded that mention kelp?"?
|
|
358 |
|
|
359 |
|
|
360 |
|
|
361 |
|
Design document for the new Metacat statistics service