1
|
|
2
|
..
|
3
|
@startuml images/stats-activity-diagram.png
|
4
|
(*) --> "Initialize event log timer"
|
5
|
--> "read an event_log entry"
|
6
|
--> "read system metadata"
|
7
|
--> "write to stats Solr index"
|
8
|
@enduml
|
9
|
|
10
|
..
|
11
|
@startuml images/stats-query-sequence-diagram.png
|
12
|
participant client
|
13
|
client -> MNRestServlet : doGet(request)
|
14
|
activate MNRestServlet
|
15
|
MNRestServlet -> MNResourceHandler: handle(get)
|
16
|
activate MNResourceHandler
|
17
|
MNResourceHandler -> MNResourceHandler: doQuery(engine, query)
|
18
|
MNResourceHandler -> MNodeService: query(engine, query)
|
19
|
activate MNodeService
|
20
|
MNodeService -> StatsQueryService: query(query, subjects)
|
21
|
activate StatsQueryService
|
22
|
StatsQueryService -> SolrServer: query(query)
|
23
|
activate SolrServer
|
24
|
SolrServer -> StatsQueryService: inputstream
|
25
|
deactivate SolrServer
|
26
|
StatsQueryService -> MNodeService: inputstream
|
27
|
deactivate StatsQueryService
|
28
|
MNodeService -> MNResourceHandler: inputstream
|
29
|
deactivate MNodeService
|
30
|
MNResourceHandler -> MNRestServlet: response
|
31
|
deactivate MNResourceHandler
|
32
|
MNRestServlet -> client: response
|
33
|
deactivate MNRestServlet
|
34
|
@enduml
|
35
|
|
36
|
|
37
|
Metacat Usage Statistics Service
|
38
|
================================
|
39
|
|
40
|
Overview
|
41
|
--------
|
42
|
This document describes a proposed usage statistics service for Metacat.
|
43
|
|
44
|
This new service will provide Metacat usage information to clients about data and metacata access events.
|
45
|
|
46
|
|
47
|
Requirements
|
48
|
------------
|
49
|
|
50
|
The statistics service should have an easy to learn API that allows for query fields to be added
|
51
|
and provide reports in XML, JSON.
|
52
|
|
53
|
Provided Statistics
|
54
|
___________________
|
55
|
|
56
|
The service will include the following statistics:
|
57
|
|
58
|
* Dataset views
|
59
|
* Package downloads
|
60
|
* Size in bytes of package downloads
|
61
|
* Citations
|
62
|
|
63
|
Results Filtering
|
64
|
_________________
|
65
|
|
66
|
Reports returned by the service must be able to be filtered by the following fields:
|
67
|
|
68
|
* A PID or list of PIDs
|
69
|
* Creator or list of creators (DN, or ORCID, or some amalgam -- to be discussed)
|
70
|
* A time range of access event (upload, download, etc.)
|
71
|
* Spatial location of access event (upload, download, etc.)
|
72
|
* IP Address
|
73
|
* Accessor or list of accessors (DN, or ORCID, or some amalgam, needs ACL -- to be discussed)
|
74
|
|
75
|
Results Aggregation
|
76
|
___________________
|
77
|
|
78
|
Reports must be able to be aggregated by the following fields:
|
79
|
* User (DN, or ORCID, or some amalgam )
|
80
|
* Time range, aggregated to requested unit (day, week, month, year)
|
81
|
* Spatial range, aggregated to requested unit
|
82
|
|
83
|
Performance
|
84
|
___________
|
85
|
|
86
|
The query service should provide results quickly, as it will be used to construct the user dashboard and possibly other UI elements.
|
87
|
|
88
|
Statistics Service Solr Index
|
89
|
-----------------------------
|
90
|
Currently Metacat writes access information to the table ‘access_log’ that has the fields:
|
91
|
|
92
|
=========== ===========================
|
93
|
name data type
|
94
|
----------- ---------------------------
|
95
|
entryid bigint
|
96
|
ip_address character varying(512)
|
97
|
user_agent character varying(512)
|
98
|
principal character varying(512)
|
99
|
docid character varying(250)
|
100
|
event character varying(512)
|
101
|
date_logged timestamp without time zone
|
102
|
=========== ===========================
|
103
|
|
104
|
In order to provide fast queries, aggregation and faceting of selected fields, access log information will be exported from the current
|
105
|
‘access_log’ table and from
|
106
|
the ‘systemmetadata’ table into a new Solr index that will be configured in Metacat as a second Solr core. The new Solr index will
|
107
|
be based on access events and will contain the fields shown in the following table:
|
108
|
|
109
|
============== ===========
|
110
|
name ddata type
|
111
|
-------------- -----------
|
112
|
id str
|
113
|
datetime date
|
114
|
event str
|
115
|
location location
|
116
|
pid str
|
117
|
rightsHolder str
|
118
|
principal str
|
119
|
size int
|
120
|
formatId str
|
121
|
============== ===========
|
122
|
|
123
|
The new Solr index will contain the following fields:
|
124
|
|
125
|
::
|
126
|
|
127
|
<doc>
|
128
|
<str name="id">2E3E8935-364E-4000-9357-6CE4E067D236</str>
|
129
|
<date name="datetime">2014-01-01T01:01:01Z</date>
|
130
|
<str name="event">read</str>
|
131
|
<location name="location">45.17614,-93.87341</location>
|
132
|
<str name="pid">sla.2.1</str>
|
133
|
<str name=”rightsHolder”>uid=williams,o=unaffiliated,dc=ecoinformatics,dc=org</str>
|
134
|
<str name="principal">uid=williams,o=unaffiliated,dc=ecoinformatics,dc=org</str>
|
135
|
<int name="size">52273</int>
|
136
|
<str name=”formatId”>eml://ecoinformatics.org/eml-2.0.1</str>
|
137
|
</doc>
|
138
|
|
139
|
The second Solr core that will contain usage statistics will require a modification to the existing solr.xml file:
|
140
|
|
141
|
::
|
142
|
|
143
|
<solr persistent="false">
|
144
|
<!--
|
145
|
adminPath: RequestHandler path to manage cores.
|
146
|
If 'null' (or absent), cores will not be manageable via request handler
|
147
|
-->
|
148
|
<cores adminPath="/admin/cores" defaultCoreName="collection1">
|
149
|
<core name="collection1" instanceDir="." />
|
150
|
<core name=”stats” instanceDir=”.”/>
|
151
|
</cores>
|
152
|
</solr>
|
153
|
|
154
|
A Java TimerTask will run the import method that will read event records from the Metacat access_log table and combine these
|
155
|
record data from the systemmetadata table
|
156
|
and write this combined entry to the stats Solr index. Access_log entry types such as ‘synchronization_failed’ and ‘replication’
|
157
|
will be filtered out and
|
158
|
will not be written to the Solr index. The time of the last record imported from access_log will be stored so that subsequent
|
159
|
imports would start from the next unimported event record. If required, the data may be aggregated by time interval, such as week or
|
160
|
month.
|
161
|
|
162
|
The statistics service will be exposed as a new query engine with a DataONE URL such as:
|
163
|
|
164
|
::
|
165
|
|
166
|
https://hostname/knb/d1/mn/v1/query/stats/<query>
|
167
|
|
168
|
Queries will be passed to the new Solr query engine using the standard Solr query syntax.
|
169
|
|
170
|
|
171
|
One new class will be added to Metacat to handle stats queries, StatsQueryService. Figure 2 shows a call trace for a statistics
|
172
|
service query.
|
173
|
|
174
|
.. figure:: images/stats-query-sequence-diagram.png
|
175
|
|
176
|
Figure 2. Statistics query sequence diagram.
|
177
|
|
178
|
|
179
|
The StatsQuerySerivce class will transform the incoming query to Solr parameters, issue the query and returns the query result as a byte stream of text/html content.
|
180
|
|
181
|
Statistics Service Usage
|
182
|
------------------------
|
183
|
|
184
|
The following sections show some of the queries that will be available through the statistics service.
|
185
|
|
186
|
Usage of pids provided by a specified rights holder
|
187
|
___________________________________________________
|
188
|
|
189
|
The following example shows a query for download volume for pids created by rightsHolder=williams with download size statistics aggregated by pid:
|
190
|
|
191
|
::
|
192
|
|
193
|
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=rightsHolder:uid=williams*&fq=event:read&stats=true&stats.field=size&rows=0&stats.facet=pid
|
194
|
|
195
|
The following result is returned:
|
196
|
|
197
|
::
|
198
|
|
199
|
<?xml version="1.0" encoding="UTF-8"?>
|
200
|
<response>
|
201
|
...
|
202
|
<result name="response" numFound="8" start="0"/>
|
203
|
<lst name="stats">
|
204
|
<lst name="stats_fields">
|
205
|
<lst name="size">
|
206
|
<double name="min">30.0</double>
|
207
|
<double name="max">1000.0</double>
|
208
|
<double name="sum">3150.0</double>
|
209
|
<long name="count">8</long>
|
210
|
<long name="missing">0</long>
|
211
|
<double name="sumOfSquares">3004500.0</double>
|
212
|
<double name="mean">393.75</double>
|
213
|
<double name="stddev">502.0226944215627</double>
|
214
|
<lst name="facets">
|
215
|
<lst name="pid">
|
216
|
<lst name="sla.3.1">
|
217
|
<double name="min">1000.0</double>
|
218
|
<double name="max">1000.0</double>
|
219
|
<double name="sum">3000.0</double>
|
220
|
<long name="count">3</long>
|
221
|
<long name="missing">0</long>
|
222
|
<double name="sumOfSquares">3000000.0</double>
|
223
|
<double name="mean">1000.0</double>
|
224
|
<double name="stddev">0.0</double>
|
225
|
</lst>
|
226
|
<lst name="sla.2.1">
|
227
|
<double name="min">30.0</double>
|
228
|
<double name="max">30.0</double>
|
229
|
<double name="sum">150.0</double>
|
230
|
<long name="count">5</long>
|
231
|
<long name="missing">0</long>
|
232
|
<double name="sumOfSquares">4500.0</double>
|
233
|
<double name="mean">30.0</double>
|
234
|
<double name="stddev">0.0</double>
|
235
|
</lst>
|
236
|
</lst>
|
237
|
</lst>
|
238
|
</lst>
|
239
|
</lst>
|
240
|
</lst>
|
241
|
</response>
|
242
|
|
243
|
The previous query can be constrained to a specific time by adding a time range, i.e.
|
244
|
|
245
|
::
|
246
|
|
247
|
&fq=datetime:%[2013-01-01T23:59:59Z TO 2013-04-31T23:59:59Z]
|
248
|
|
249
|
Data uploads
|
250
|
____________
|
251
|
|
252
|
The following query shows counts of data uploads by format type by a specified user:
|
253
|
|
254
|
::
|
255
|
|
256
|
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=rightsHolder:uid=williams*&fq=event:create&facet=true&facet.field=formatId&rows=0
|
257
|
|
258
|
::
|
259
|
|
260
|
<?xml version="1.0" encoding="UTF-8"?>
|
261
|
<response>
|
262
|
...
|
263
|
<result name="response" numFound="3" start="0"/>
|
264
|
<lst name="facet_counts">
|
265
|
<lst name="facet_queries"/>
|
266
|
<lst name="facet_fields">
|
267
|
<lst name="formatId">
|
268
|
<int name="BIN">2</int>
|
269
|
<int name="eml://ecoinformatics.org/eml-2.1.1">1</int>
|
270
|
<int name="text/csv">0</int>
|
271
|
</lst>
|
272
|
</lst>
|
273
|
<lst name="facet_dates"/>
|
274
|
<lst name="facet_ranges"/>
|
275
|
</lst>
|
276
|
</response>
|
277
|
|
278
|
Data downloads
|
279
|
______________
|
280
|
|
281
|
The following query shows data download counts by a specific user for each month in 2013:
|
282
|
|
283
|
::
|
284
|
|
285
|
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=principal:williams&fq=event:read&fq=formatId:BIN&facet=true&facet.field=event&facet.range=datetime&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
|
286
|
|
287
|
::
|
288
|
|
289
|
<?xml version="1.0" encoding="UTF-8"?>
|
290
|
<response>
|
291
|
...
|
292
|
<lst name="facet_ranges">
|
293
|
<lst name="datetime">
|
294
|
<lst name="counts">
|
295
|
<int name="2013-01-01T01:01:01Z">0</int>
|
296
|
<int name="2013-02-01T01:01:01Z">0</int>
|
297
|
<int name="2013-03-01T01:01:01Z">0</int>
|
298
|
<int name="2013-04-01T01:01:01Z">0</int>
|
299
|
<int name="2013-05-01T01:01:01Z">0</int>
|
300
|
<int name="2013-06-01T01:01:01Z">2</int>
|
301
|
<int name="2013-07-01T01:01:01Z">1</int>
|
302
|
<int name="2013-08-01T01:01:01Z">0</int>
|
303
|
<int name="2013-09-01T01:01:01Z">0</int>
|
304
|
<int name="2013-10-01T01:01:01Z">0</int>
|
305
|
<int name="2013-11-01T01:01:01Z">0</int>
|
306
|
<int name="2013-12-01T01:01:01Z">0</int>
|
307
|
</lst>
|
308
|
<str name="gap">+1MONTH</str>
|
309
|
<date name="start">2013-01-01T01:01:01Z</date>
|
310
|
<date name="end">2014-01-01T01:01:01Z</date>
|
311
|
</lst>
|
312
|
</lst>
|
313
|
</lst>
|
314
|
</response>
|
315
|
|
316
|
The following query shows EML metadata downloads by a specific user for each month in 2013.
|
317
|
|
318
|
::
|
319
|
|
320
|
http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=principal:*williams*&fq=event:read&fq=formatId:*eml*&facet=true&facet.field=event&facet.range=datetime&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
|
321
|
|
322
|
::
|
323
|
|
324
|
<?xml version="1.0" encoding="UTF-8"?>
|
325
|
<response>
|
326
|
...
|
327
|
<lst name="facet_ranges">
|
328
|
<lst name="datetime">
|
329
|
<lst name="counts">
|
330
|
<int name="2013-01-01T01:01:01Z">0</int>
|
331
|
<int name="2013-02-01T01:01:01Z">0</int>
|
332
|
<int name="2013-03-01T01:01:01Z">0</int>
|
333
|
<int name="2013-04-01T01:01:01Z">1</int>
|
334
|
<int name="2013-05-01T01:01:01Z">1</int>
|
335
|
<int name="2013-06-01T01:01:01Z">0</int>
|
336
|
<int name="2013-07-01T01:01:01Z">2</int>
|
337
|
<int name="2013-08-01T01:01:01Z">0</int>
|
338
|
<int name="2013-09-01T01:01:01Z">0</int>
|
339
|
<int name="2013-10-01T01:01:01Z">0</int>
|
340
|
<int name="2013-11-01T01:01:01Z">0</int>
|
341
|
<int name="2013-12-01T01:01:01Z">0</int>
|
342
|
</lst>
|
343
|
<str name="gap">+1MONTH</str>
|
344
|
<date name="start">2013-01-01T01:01:01Z</date>
|
345
|
<date name="end">2014-01-01T01:01:01Z</date>
|
346
|
</lst>
|
347
|
</lst>
|
348
|
</lst>
|
349
|
</response>
|
350
|
|
351
|
Unresolved Issues/Questions
|
352
|
---------------------------
|
353
|
|
354
|
1. How is the location of an event determined? What do we mean by location?
|
355
|
2. Currently Solr (3.x and 4.x) doesn’t allow faceting by date/time interval, so it isn't possible to use the stats component to calculate total download volume for a time interval over a time range, such as every month for the last 10 years. Therefor for calculated amounts, a query for each time interval is required.
|
356
|
3. Where will citation info come from? Do we import this into the Solr index?
|
357
|
4. Are there text fields that the statistics service should include, i.e. do we want to provide statistics for queries such as "how many pids were downloaded that mention kelp?"?
|
358
|
|
359
|
|
360
|
|
361
|
|