1 |
8586
|
slaughter
|
|
2 |
|
|
..
|
3 |
|
|
@startuml images/stats-activity-diagram.png
|
4 |
|
|
(*) --> "Initialize event log timer"
|
5 |
|
|
--> "read an event_log entry"
|
6 |
|
|
--> "read system metadata"
|
7 |
|
|
--> "write to stats Solr index"
|
8 |
|
|
@enduml
|
9 |
|
|
|
10 |
|
|
..
|
11 |
|
|
@startuml images/stats-query-sequence-diagram.png
|
12 |
|
|
participant client
|
13 |
|
|
client -> MNRestServlet : doGet(request)
|
14 |
|
|
activate MNRestServlet
|
15 |
|
|
MNRestServlet -> MNResourceHandler: handle(get)
|
16 |
|
|
activate MNResourceHandler
|
17 |
|
|
MNResourceHandler -> MNResourceHandler: doQuery(engine, query)
|
18 |
|
|
MNResourceHandler -> MNodeService: query(engine, query)
|
19 |
|
|
activate MNodeService
|
20 |
|
|
MNodeService -> StatsQueryService: query(query, subjects)
|
21 |
|
|
activate StatsQueryService
|
22 |
|
|
StatsQueryService -> SolrServer: query(query)
|
23 |
|
|
activate SolrServer
|
24 |
|
|
SolrServer -> StatsQueryService: inputstream
|
25 |
|
|
deactivate SolrServer
|
26 |
|
|
StatsQueryService -> MNodeService: inputstream
|
27 |
|
|
deactivate StatsQueryService
|
28 |
|
|
MNodeService -> MNResourceHandler: inputstream
|
29 |
|
|
deactivate MNodeService
|
30 |
|
|
MNResourceHandler -> MNRestServlet: response
|
31 |
|
|
deactivate MNResourceHandler
|
32 |
|
|
MNRestServlet -> client: response
|
33 |
|
|
deactivate MNRestServlet
|
34 |
|
|
@enduml
|
35 |
|
|
|
36 |
|
|
|
37 |
|
|
Metacat Usage Statistics Service
|
38 |
|
|
================================
|
39 |
|
|
|
40 |
|
|
Overview
|
41 |
|
|
--------
|
42 |
|
|
This document describes a proposed usage statistics service for Metacat.
|
43 |
|
|
|
44 |
|
|
This new service will provide Metacat usage information to clients about data and metacata access events.
|
45 |
|
|
|
46 |
|
|
|
47 |
|
|
Requirements
|
48 |
|
|
------------
|
49 |
|
|
|
50 |
|
|
The statistics service should have an easy to learn API that allows for query fields to be added
|
51 |
|
|
and provide reports in XML, JSON.
|
52 |
|
|
|
53 |
|
|
Provided Statistics
|
54 |
|
|
___________________
|
55 |
|
|
|
56 |
|
|
The service will include the following statistics:
|
57 |
|
|
|
58 |
|
|
* Dataset views
|
59 |
|
|
* Package downloads
|
60 |
|
|
* Size in bytes of package downloads
|
61 |
|
|
* Citations
|
62 |
|
|
|
63 |
|
|
Results Filtering
|
64 |
|
|
_________________
|
65 |
|
|
|
66 |
|
|
Reports returned by the service must be able to be filtered by the following fields:
|
67 |
|
|
|
68 |
|
|
* A PID or list of PIDs
|
69 |
|
|
* Creator or list of creators (DN, or ORCID, or some amalgam -- to be discussed)
|
70 |
|
|
* A time range of access event (upload, download, etc.)
|
71 |
|
|
* Spatial location of access event (upload, download, etc.)
|
72 |
|
|
* IP Address
|
73 |
|
|
* Accessor or list of accessors (DN, or ORCID, or some amalgam, needs ACL -- to be discussed)
|
74 |
|
|
|
75 |
|
|
Results Aggregation
|
76 |
|
|
___________________
|
77 |
|
|
|
78 |
|
|
Reports must be able to be aggregated by the following fields:
|
79 |
|
|
* User (DN, or ORCID, or some amalgam )
|
80 |
|
|
* Time range, aggregated to requested unit (day, week, month, year)
|
81 |
|
|
* Spatial range, aggregated to requested unit
|
82 |
|
|
|
83 |
|
|
Performance
|
84 |
|
|
___________
|
85 |
|
|
|
86 |
|
|
The query service should provide results quickly, as it will be used to construct the user dashboard and possibly other UI elements.
|
87 |
|
|
|
88 |
|
|
Statistics Service Solr Index
|
89 |
8587
|
slaughter
|
-----------------------------
|
90 |
8586
|
slaughter
|
Currently Metacat writes access information to the table ‘access_log’ that has the fields:
|
91 |
|
|
|
92 |
|
|
=========== ===========================
|
93 |
|
|
name data type
|
94 |
|
|
----------- ---------------------------
|
95 |
|
|
entryid bigint
|
96 |
|
|
ip_address character varying(512)
|
97 |
|
|
user_agent character varying(512)
|
98 |
|
|
principal character varying(512)
|
99 |
|
|
docid character varying(250)
|
100 |
|
|
event character varying(512)
|
101 |
|
|
date_logged timestamp without time zone
|
102 |
|
|
=========== ===========================
|
103 |
|
|
|
104 |
|
|
In order to provide fast queries, aggregation and faceting of selected fields, access log information will be exported from the current
|
105 |
|
|
‘access_log’ table and from
|
106 |
|
|
the ‘systemmetadata’ table into a new Solr index that will be configured in Metacat as a second Solr core. The new Solr index will
|
107 |
|
|
be based on access events and will contain the fields shown in the following table:
|
108 |
|
|
|
109 |
8587
|
slaughter
|
============== ===========
|
110 |
8586
|
slaughter
|
name ddata type
|
111 |
8587
|
slaughter
|
-------------- -----------
|
112 |
8586
|
slaughter
|
id str
|
113 |
|
|
datetime date
|
114 |
|
|
event str
|
115 |
|
|
location location
|
116 |
|
|
pid str
|
117 |
|
|
rightsHolder str
|
118 |
|
|
principal str
|
119 |
|
|
size int
|
120 |
|
|
formatId str
|
121 |
8587
|
slaughter
|
============== ===========
|
122 |
8586
|
slaughter
|
|
123 |
|
|
The new Solr index will contain the following fields:
|
124 |
|
|
|
125 |
|
|
::
|
126 |
|
|
|
127 |
|
|
<doc>
|
128 |
|
|
<str name="id">2E3E8935-364E-4000-9357-6CE4E067D236</str>
|
129 |
|
|
<date name="datetime">2014-01-01T01:01:01Z</date>
|
130 |
|
|
<str name="event">read</str>
|
131 |
|
|
<location name="location">45.17614,-93.87341</location>
|
132 |
|
|
<str name="pid">sla.2.1</str>
|
133 |
|
|
<str name=”rightsHolder”>uid=williams,o=unaffiliated,dc=ecoinformatics,dc=org</str>
|
134 |
|
|
<str name="principal">uid=williams,o=unaffiliated,dc=ecoinformatics,dc=org</str>
|
135 |
|
|
<int name="size">52273</int>
|
136 |
|
|
<str name=”formatId”>eml://ecoinformatics.org/eml-2.0.1</str>
|
137 |
|
|
</doc>
|
138 |
|
|
|
139 |
|
|
The second Solr core that will contain usage statistics will require a modification to the existing solr.xml file:
|
140 |
|
|
|
141 |
|
|
::
|
142 |
|
|
|
143 |
|
|
<solr persistent="false">
|
144 |
|
|
<!--
|
145 |
|
|
adminPath: RequestHandler path to manage cores.
|
146 |
|
|
If 'null' (or absent), cores will not be manageable via request handler
|
147 |
|
|
-->
|
148 |
|
|
<cores adminPath="/admin/cores" defaultCoreName="collection1">
|
149 |
|
|
<core name="collection1" instanceDir="." />
|
150 |
|
|
<core name=”stats” instanceDir=”.”/>
|
151 |
|
|
</cores>
|
152 |
|
|
</solr>
|
153 |
|
|
|
154 |
|
|
A Java TimerTask will run the import method that will read event records from the Metacat access_log table and combine these
|
155 |
|
|
record data from the systemmetadata table
|
156 |
|
|
and write this combined entry to the stats Solr index. Access_log entry types such as ‘synchronization_failed’ and ‘replication’
|
157 |
|
|
will be filtered out and
|
158 |
|
|
will not be written to the Solr index. The time of the last record imported from access_log will be stored so that subsequent
|
159 |
|
|
imports would start from the next unimported event record. If required, the data may be aggregated by time interval, such as week or
|
160 |
|
|
month.
|
161 |
|
|
|
162 |
|
|
The statistics service will be exposed as a new query engine with a DataONE URL such as:
|
163 |
|
|
|
164 |
|
|
::
|
165 |
|
|
|
166 |
|
|
https://hostname/knb/d1/mn/v1/query/stats/<query>
|
167 |
|
|
|
168 |
|
|
Queries will be passed to the new Solr query engine using the standard Solr query syntax.
|
169 |
|
|
|
170 |
|
|
|
171 |
|
|
One new class will be added to Metacat to handle stats queries, StatsQueryService. Figure 2 shows a call trace for a statistics
|
172 |
|
|
service query.
|
173 |
|
|
|
174 |
|
|
.. figure:: images/stats-query-sequence-diagram.png
|
175 |
|
|
|
176 |
|
|
Figure 2. Statistics query sequence diagram.
|
177 |
|
|
|
178 |
|
|
|
179 |
|
|
The StatsQuerySerivce class will transform the incoming query to Solr parameters, issue the query and returns the query result as a byte stream of text/html content.
|
180 |
|
|
|
181 |
|
|
Statistics Service Usage
|
182 |
|
|
------------------------
|
183 |
|
|
|
184 |
|
|
The following sections show some of the queries that will be available through the statistics service.
|
185 |
|
|
|
186 |
|
|
Usage of pids provided by a specified rights holder
|
187 |
|
|
___________________________________________________
|
188 |
|
|
|
189 |
|
|
The following example shows a query for download volume for pids created by rightsHolder=williams with download size statistics aggregated by pid:
|
190 |
|
|
|
191 |
|
|
::
|
192 |
|
|
|
193 |
|
|
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=rightsHolder:uid=williams*&fq=event:read&stats=true&stats.field=size&rows=0&stats.facet=pid
|
194 |
|
|
|
195 |
|
|
The following result is returned:
|
196 |
|
|
|
197 |
|
|
::
|
198 |
|
|
|
199 |
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
200 |
|
|
<response>
|
201 |
|
|
...
|
202 |
|
|
<result name="response" numFound="8" start="0"/>
|
203 |
|
|
<lst name="stats">
|
204 |
|
|
<lst name="stats_fields">
|
205 |
|
|
<lst name="size">
|
206 |
|
|
<double name="min">30.0</double>
|
207 |
|
|
<double name="max">1000.0</double>
|
208 |
|
|
<double name="sum">3150.0</double>
|
209 |
|
|
<long name="count">8</long>
|
210 |
|
|
<long name="missing">0</long>
|
211 |
|
|
<double name="sumOfSquares">3004500.0</double>
|
212 |
|
|
<double name="mean">393.75</double>
|
213 |
|
|
<double name="stddev">502.0226944215627</double>
|
214 |
|
|
<lst name="facets">
|
215 |
|
|
<lst name="pid">
|
216 |
|
|
<lst name="sla.3.1">
|
217 |
|
|
<double name="min">1000.0</double>
|
218 |
|
|
<double name="max">1000.0</double>
|
219 |
|
|
<double name="sum">3000.0</double>
|
220 |
|
|
<long name="count">3</long>
|
221 |
|
|
<long name="missing">0</long>
|
222 |
|
|
<double name="sumOfSquares">3000000.0</double>
|
223 |
|
|
<double name="mean">1000.0</double>
|
224 |
|
|
<double name="stddev">0.0</double>
|
225 |
|
|
</lst>
|
226 |
|
|
<lst name="sla.2.1">
|
227 |
|
|
<double name="min">30.0</double>
|
228 |
|
|
<double name="max">30.0</double>
|
229 |
|
|
<double name="sum">150.0</double>
|
230 |
|
|
<long name="count">5</long>
|
231 |
|
|
<long name="missing">0</long>
|
232 |
|
|
<double name="sumOfSquares">4500.0</double>
|
233 |
|
|
<double name="mean">30.0</double>
|
234 |
|
|
<double name="stddev">0.0</double>
|
235 |
|
|
</lst>
|
236 |
|
|
</lst>
|
237 |
|
|
</lst>
|
238 |
|
|
</lst>
|
239 |
|
|
</lst>
|
240 |
|
|
</lst>
|
241 |
|
|
</response>
|
242 |
|
|
|
243 |
|
|
The previous query can be constrained to a specific time by adding a time range, i.e.
|
244 |
|
|
|
245 |
|
|
::
|
246 |
|
|
|
247 |
|
|
&fq=datetime:%[2013-01-01T23:59:59Z TO 2013-04-31T23:59:59Z]
|
248 |
|
|
|
249 |
|
|
Data uploads
|
250 |
|
|
____________
|
251 |
|
|
|
252 |
|
|
The following query shows counts of data uploads by format type by a specified user:
|
253 |
|
|
|
254 |
|
|
::
|
255 |
|
|
|
256 |
|
|
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=rightsHolder:uid=williams*&fq=event:create&facet=true&facet.field=formatId&rows=0
|
257 |
|
|
|
258 |
|
|
::
|
259 |
|
|
|
260 |
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
261 |
|
|
<response>
|
262 |
|
|
...
|
263 |
|
|
<result name="response" numFound="3" start="0"/>
|
264 |
|
|
<lst name="facet_counts">
|
265 |
|
|
<lst name="facet_queries"/>
|
266 |
|
|
<lst name="facet_fields">
|
267 |
|
|
<lst name="formatId">
|
268 |
|
|
<int name="BIN">2</int>
|
269 |
|
|
<int name="eml://ecoinformatics.org/eml-2.1.1">1</int>
|
270 |
|
|
<int name="text/csv">0</int>
|
271 |
|
|
</lst>
|
272 |
|
|
</lst>
|
273 |
|
|
<lst name="facet_dates"/>
|
274 |
|
|
<lst name="facet_ranges"/>
|
275 |
|
|
</lst>
|
276 |
|
|
</response>
|
277 |
|
|
|
278 |
|
|
Data downloads
|
279 |
|
|
______________
|
280 |
|
|
|
281 |
|
|
The following query shows data download counts by a specific user for each month in 2013:
|
282 |
|
|
|
283 |
|
|
::
|
284 |
|
|
|
285 |
|
|
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=principal:williams&fq=event:read&fq=formatId:BIN&facet=true&facet.field=event&facet.range=datetime&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
|
286 |
|
|
|
287 |
|
|
::
|
288 |
|
|
|
289 |
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
290 |
|
|
<response>
|
291 |
|
|
...
|
292 |
|
|
<lst name="facet_ranges">
|
293 |
|
|
<lst name="datetime">
|
294 |
|
|
<lst name="counts">
|
295 |
|
|
<int name="2013-01-01T01:01:01Z">0</int>
|
296 |
|
|
<int name="2013-02-01T01:01:01Z">0</int>
|
297 |
|
|
<int name="2013-03-01T01:01:01Z">0</int>
|
298 |
|
|
<int name="2013-04-01T01:01:01Z">0</int>
|
299 |
|
|
<int name="2013-05-01T01:01:01Z">0</int>
|
300 |
|
|
<int name="2013-06-01T01:01:01Z">2</int>
|
301 |
|
|
<int name="2013-07-01T01:01:01Z">1</int>
|
302 |
|
|
<int name="2013-08-01T01:01:01Z">0</int>
|
303 |
|
|
<int name="2013-09-01T01:01:01Z">0</int>
|
304 |
|
|
<int name="2013-10-01T01:01:01Z">0</int>
|
305 |
|
|
<int name="2013-11-01T01:01:01Z">0</int>
|
306 |
|
|
<int name="2013-12-01T01:01:01Z">0</int>
|
307 |
|
|
</lst>
|
308 |
|
|
<str name="gap">+1MONTH</str>
|
309 |
|
|
<date name="start">2013-01-01T01:01:01Z</date>
|
310 |
|
|
<date name="end">2014-01-01T01:01:01Z</date>
|
311 |
|
|
</lst>
|
312 |
|
|
</lst>
|
313 |
|
|
</lst>
|
314 |
|
|
</response>
|
315 |
|
|
|
316 |
|
|
The following query shows EML metadata downloads by a specific user for each month in 2013.
|
317 |
|
|
|
318 |
|
|
::
|
319 |
|
|
|
320 |
|
|
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=principal:*williams*&fq=event:read&fq=formatId:*eml*&facet=true&facet.field=event&facet.range=datetime&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
|
321 |
|
|
|
322 |
|
|
::
|
323 |
|
|
|
324 |
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
325 |
|
|
<response>
|
326 |
|
|
...
|
327 |
|
|
<lst name="facet_ranges">
|
328 |
|
|
<lst name="datetime">
|
329 |
|
|
<lst name="counts">
|
330 |
|
|
<int name="2013-01-01T01:01:01Z">0</int>
|
331 |
|
|
<int name="2013-02-01T01:01:01Z">0</int>
|
332 |
|
|
<int name="2013-03-01T01:01:01Z">0</int>
|
333 |
|
|
<int name="2013-04-01T01:01:01Z">1</int>
|
334 |
|
|
<int name="2013-05-01T01:01:01Z">1</int>
|
335 |
|
|
<int name="2013-06-01T01:01:01Z">0</int>
|
336 |
|
|
<int name="2013-07-01T01:01:01Z">2</int>
|
337 |
|
|
<int name="2013-08-01T01:01:01Z">0</int>
|
338 |
|
|
<int name="2013-09-01T01:01:01Z">0</int>
|
339 |
|
|
<int name="2013-10-01T01:01:01Z">0</int>
|
340 |
|
|
<int name="2013-11-01T01:01:01Z">0</int>
|
341 |
|
|
<int name="2013-12-01T01:01:01Z">0</int>
|
342 |
|
|
</lst>
|
343 |
|
|
<str name="gap">+1MONTH</str>
|
344 |
|
|
<date name="start">2013-01-01T01:01:01Z</date>
|
345 |
|
|
<date name="end">2014-01-01T01:01:01Z</date>
|
346 |
|
|
</lst>
|
347 |
|
|
</lst>
|
348 |
|
|
</lst>
|
349 |
|
|
</response>
|
350 |
|
|
|
351 |
|
|
Unresolved Issues/Questions
|
352 |
|
|
---------------------------
|
353 |
|
|
|
354 |
|
|
1. How is the location of an event determined? What do we mean by location?
|
355 |
|
|
2. Currently Solr (3.x and 4.x) doesn’t allow faceting by date/time interval, so it isn't possible to use the stats component to calculate total download volume for a time interval over a time range, such as every month for the last 10 years. Therefor for calculated amounts, a query for each time interval is required.
|
356 |
|
|
3. Where will citation info come from? Do we import this into the Solr index?
|
357 |
|
|
4. Are there text fields that the statistics service should include, i.e. do we want to provide statistics for queries such as "how many pids were downloaded that mention kelp?"?
|
358 |
|
|
|
359 |
|
|
|
360 |
|
|
|