Project

General

Profile

Revision 8586

Design document for the new Metacat statistics service

View differences:

docs/user/metacat/source/development.rst
17 17

  
18 18
   identifiers
19 19
   doi
20
   statistics-service
20 21

  
docs/user/metacat/source/statistics-service.rst
1

  
2
..
3
  @startuml images/stats-activity-diagram.png
4
    (*) --> "Initialize event log timer"
5
    --> "read an event_log entry"
6
	--> "read system metadata"
7
	--> "write to stats Solr index"
8
  @enduml
9

  
10
..
11
  @startuml images/stats-query-sequence-diagram.png
12
	participant client
13
	client -> MNRestServlet : doGet(request)
14
	activate MNRestServlet
15
	MNRestServlet -> MNResourceHandler: handle(get)
16
	activate MNResourceHandler
17
	MNResourceHandler -> MNResourceHandler: doQuery(engine, query)
18
	MNResourceHandler -> MNodeService: query(engine, query)
19
	activate MNodeService
20
	MNodeService -> StatsQueryService: query(query, subjects)
21
	activate StatsQueryService
22
	StatsQueryService -> SolrServer: query(query)
23
	activate SolrServer
24
	SolrServer -> StatsQueryService: inputstream
25
	deactivate SolrServer
26
	StatsQueryService -> MNodeService: inputstream
27
	deactivate StatsQueryService
28
	MNodeService -> MNResourceHandler: inputstream
29
	deactivate MNodeService
30
	MNResourceHandler -> MNRestServlet: response
31
	deactivate MNResourceHandler
32
	MNRestServlet -> client: response
33
	deactivate MNRestServlet
34
  @enduml
35

  
36

  
37
Metacat Usage Statistics Service
38
================================
39

  
40
Overview
41
--------
42
This document describes a proposed usage statistics service for Metacat. 
43

  
44
This new service will provide Metacat usage information to clients about data and metacata access events.
45

  
46

  
47
Requirements
48
------------
49

  
50
The statistics service should have an easy to learn API that allows for query fields to be added 
51
and provide reports in XML, JSON.
52

  
53
Provided Statistics
54
___________________
55

  
56
The service will include the following statistics: 
57

  
58
	* Dataset views
59
	* Package downloads
60
	* Size in bytes of package downloads
61
	* Citations
62

  
63
Results Filtering
64
_________________
65

  
66
Reports returned by the service must be able to be filtered by the following fields:
67

  
68
	* A PID or list of PIDs
69
	* Creator or list of creators (DN, or ORCID, or some amalgam -- to be discussed)
70
	* A time range of access event (upload, download, etc.)
71
	* Spatial location of access event (upload, download, etc.)
72
	* IP Address
73
	* Accessor or list of accessors (DN, or ORCID, or some amalgam, needs ACL -- to be discussed)
74

  
75
Results Aggregation
76
___________________
77

  
78
Reports must be able to be aggregated by the following fields:
79
	* User (DN, or ORCID, or some amalgam )
80
	* Time range, aggregated to requested unit (day, week, month, year)
81
	* Spatial range, aggregated to requested unit
82
	
83
Performance
84
___________
85

  
86
The query service should provide results quickly, as it will be used to construct the user dashboard and possibly other UI elements.
87

  
88
Statistics Service Solr Index
89
------------------------
90
Currently Metacat writes access information to the table ‘access_log’ that has the fields:
91
	
92
=========== ===========================
93
name        data type
94
----------- ---------------------------
95
entryid     bigint
96
ip_address  character varying(512)
97
user_agent  character varying(512)
98
principal   character varying(512)
99
docid       character varying(250)
100
event       character varying(512) 
101
date_logged timestamp without time zone
102
=========== ===========================
103

  
104
In order to provide fast queries, aggregation and faceting of selected fields, access log information will be exported from the current 
105
‘access_log’ table and from 
106
the ‘systemmetadata’ table into a new Solr index that will be configured in Metacat as a second Solr core. The new Solr index will 
107
be based on access events and will contain the fields shown in the following table:
108

  
109
==============  =========
110
name             ddata type
111
-----------     ---------------------------
112
id 		    	str
113
datetime		date
114
event			str
115
location		location
116
pid				str
117
rightsHolder	str
118
principal		str
119
size			int
120
formatId   		str
121
==============  =========
122

  
123
The new Solr index will contain the following fields:
124

  
125
::
126

  
127
	<doc>
128
	<str name="id">2E3E8935-364E-4000-9357-6CE4E067D236</str>
129
	<date name="datetime">2014-01-01T01:01:01Z</date>
130
	<str name="event">read</str>
131
	<location name="location">45.17614,-93.87341</location>
132
	<str name="pid">sla.2.1</str>
133
	<str name=”rightsHolder”>uid=williams,o=unaffiliated,dc=ecoinformatics,dc=org</str>
134
	<str name="principal">uid=williams,o=unaffiliated,dc=ecoinformatics,dc=org</str>
135
	<int name="size">52273</int>
136
	<str name=”formatId”>eml://ecoinformatics.org/eml-2.0.1</str>
137
	</doc>
138

  
139
The second Solr core that will contain usage statistics will require a modification to the existing solr.xml file:
140

  
141
::
142

  
143
	<solr persistent="false">
144
	  <!--
145
	  adminPath: RequestHandler path to manage cores.
146
	    If 'null' (or absent), cores will not be manageable via request handler
147
	  -->
148
	  <cores adminPath="/admin/cores" defaultCoreName="collection1">
149
	    <core name="collection1" instanceDir="." />
150
	    <core name=”stats” instanceDir=”.”/>
151
	  </cores>
152
	</solr>
153

  
154
A Java TimerTask will run the import method that will read event records from the Metacat access_log table and combine these 
155
record data from the systemmetadata table 
156
and write this combined entry to the stats Solr index. Access_log entry types such as ‘synchronization_failed’ and ‘replication’ 
157
will be filtered out and
158
will not be written to the Solr index. The time of the last record imported from access_log will be stored so that subsequent 
159
imports would start from the next unimported event record. If required, the data may be aggregated by time interval, such as week or 
160
month.
161

  
162
The statistics service will be exposed as a new query engine with a DataONE URL such as:
163

  
164
::
165

  
166
	https://hostname/knb/d1/mn/v1/query/stats/<query>
167

  
168
Queries will be passed to the new Solr query engine using the standard Solr query syntax.
169

  
170

  
171
One new class will be added to Metacat to handle stats queries, StatsQueryService. Figure 2 shows a call trace for a statistics
172
service query.
173

  
174
.. figure:: images/stats-query-sequence-diagram.png
175

  
176
   Figure 2. Statistics query sequence diagram.
177

  
178

  
179
The StatsQuerySerivce class will transform the incoming query to Solr parameters, issue the query and returns the query result as a byte stream of text/html content.
180

  
181
Statistics Service Usage
182
------------------------
183

  
184
The following sections show some of the queries that will be available through the statistics service.
185

  
186
Usage of pids provided by a specified rights holder
187
___________________________________________________
188

  
189
The following example shows a query for download volume for pids created by rightsHolder=williams with download size statistics aggregated by pid:
190

  
191
::
192

  
193
	http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=rightsHolder:uid=williams*&fq=event:read&stats=true&stats.field=size&rows=0&stats.facet=pid
194

  
195
The following result is returned:
196

  
197
::
198

  
199
	<?xml version="1.0" encoding="UTF-8"?>
200
	<response>
201
	  ...
202
	  <result name="response" numFound="8" start="0"/>
203
	  <lst name="stats">
204
	    <lst name="stats_fields">
205
	      <lst name="size">
206
	        <double name="min">30.0</double>
207
	        <double name="max">1000.0</double>
208
	        <double name="sum">3150.0</double>
209
	        <long name="count">8</long>
210
	        <long name="missing">0</long>
211
	        <double name="sumOfSquares">3004500.0</double>
212
	        <double name="mean">393.75</double>
213
	        <double name="stddev">502.0226944215627</double>
214
	        <lst name="facets">
215
	          <lst name="pid">
216
	            <lst name="sla.3.1">
217
	              <double name="min">1000.0</double>
218
	              <double name="max">1000.0</double>
219
	              <double name="sum">3000.0</double>
220
	              <long name="count">3</long>
221
	              <long name="missing">0</long>
222
	              <double name="sumOfSquares">3000000.0</double>
223
	              <double name="mean">1000.0</double>
224
	              <double name="stddev">0.0</double>
225
	            </lst>
226
	            <lst name="sla.2.1">
227
	              <double name="min">30.0</double>
228
	              <double name="max">30.0</double>
229
	              <double name="sum">150.0</double>
230
	              <long name="count">5</long>
231
	              <long name="missing">0</long>
232
	              <double name="sumOfSquares">4500.0</double>
233
	              <double name="mean">30.0</double>
234
	              <double name="stddev">0.0</double>
235
	            </lst>
236
	          </lst>
237
	        </lst>
238
	      </lst>
239
	    </lst>
240
	  </lst>
241
	</response>
242
	
243
The previous query can be constrained to a specific time by adding a time range, i.e.
244

  
245
::
246

  
247
	&fq=datetime:%[2013-01-01T23:59:59Z TO 2013-04-31T23:59:59Z]
248

  
249
Data uploads 
250
____________
251

  
252
The following query shows counts of data uploads by format type by a specified user:
253

  
254
::
255

  
256
	http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=rightsHolder:uid=williams*&fq=event:create&facet=true&facet.field=formatId&rows=0
257

  
258
::
259

  
260
	<?xml version="1.0" encoding="UTF-8"?>
261
	<response>
262
	  ...
263
	  <result name="response" numFound="3" start="0"/>
264
	  <lst name="facet_counts">
265
	    <lst name="facet_queries"/>
266
	    <lst name="facet_fields">
267
	      <lst name="formatId">
268
	        <int name="BIN">2</int>
269
	        <int name="eml://ecoinformatics.org/eml-2.1.1">1</int>
270
	        <int name="text/csv">0</int>
271
	      </lst>
272
	    </lst>
273
	    <lst name="facet_dates"/>
274
	    <lst name="facet_ranges"/>
275
	  </lst>
276
	</response>
277

  
278
Data downloads
279
______________
280

  
281
The following query shows data download counts by a specific user for each month in 2013:
282

  
283
::
284

  
285
    http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=principal:williams&fq=event:read&fq=formatId:BIN&facet=true&facet.field=event&facet.range=datetime&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
286

  
287
::
288

  
289
	<?xml version="1.0" encoding="UTF-8"?>
290
	<response>
291
	    ...
292
	    <lst name="facet_ranges">
293
	      <lst name="datetime">
294
	        <lst name="counts">
295
	          <int name="2013-01-01T01:01:01Z">0</int>
296
	          <int name="2013-02-01T01:01:01Z">0</int>
297
	          <int name="2013-03-01T01:01:01Z">0</int>
298
	          <int name="2013-04-01T01:01:01Z">0</int>
299
	          <int name="2013-05-01T01:01:01Z">0</int>
300
	          <int name="2013-06-01T01:01:01Z">2</int>
301
	          <int name="2013-07-01T01:01:01Z">1</int>
302
	          <int name="2013-08-01T01:01:01Z">0</int>
303
	          <int name="2013-09-01T01:01:01Z">0</int>
304
	          <int name="2013-10-01T01:01:01Z">0</int>
305
	          <int name="2013-11-01T01:01:01Z">0</int>
306
	          <int name="2013-12-01T01:01:01Z">0</int>
307
	        </lst>
308
	        <str name="gap">+1MONTH</str>
309
	        <date name="start">2013-01-01T01:01:01Z</date>
310
	        <date name="end">2014-01-01T01:01:01Z</date>
311
	      </lst>
312
	    </lst>
313
	  </lst>
314
	</response>
315

  
316
The following query shows EML metadata downloads by a specific user for each month in 2013.
317

  
318
::
319

  
320
	http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=principal:*williams*&fq=event:read&fq=formatId:*eml*&facet=true&facet.field=event&facet.range=datetime&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH
321

  
322
::
323

  
324
	<?xml version="1.0" encoding="UTF-8"?>
325
	<response>
326
		...
327
	    <lst name="facet_ranges">
328
	      <lst name="datetime">
329
	        <lst name="counts">
330
	          <int name="2013-01-01T01:01:01Z">0</int>
331
	          <int name="2013-02-01T01:01:01Z">0</int>
332
	          <int name="2013-03-01T01:01:01Z">0</int>
333
	          <int name="2013-04-01T01:01:01Z">1</int>
334
	          <int name="2013-05-01T01:01:01Z">1</int>
335
	          <int name="2013-06-01T01:01:01Z">0</int>
336
	          <int name="2013-07-01T01:01:01Z">2</int>
337
	          <int name="2013-08-01T01:01:01Z">0</int>
338
	          <int name="2013-09-01T01:01:01Z">0</int>
339
	          <int name="2013-10-01T01:01:01Z">0</int>
340
	          <int name="2013-11-01T01:01:01Z">0</int>
341
	          <int name="2013-12-01T01:01:01Z">0</int>
342
	        </lst>
343
	        <str name="gap">+1MONTH</str>
344
	        <date name="start">2013-01-01T01:01:01Z</date>
345
	        <date name="end">2014-01-01T01:01:01Z</date>
346
	      </lst>
347
	    </lst>
348
	  </lst>
349
	</response>
350

  
351
Unresolved Issues/Questions
352
---------------------------
353

  
354
	1. How is the location of an event determined? What do we mean by location?
355
	2. Currently Solr (3.x and 4.x) doesn’t allow faceting by date/time interval, so it isn't possible to use the stats component to calculate total download volume for a time interval over a time range, such as every month for the last 10 years. Therefor for  calculated amounts, a query for each time interval is required. 
356
	3. Where will citation info come from? Do we import this into the Solr index?
357
	4. Are there text fields that the statistics service should include, i.e. do we want to provide statistics for queries such as "how many pids were downloaded that mention kelp?"?
358

  
359

  
360

  
361

  

Also available in: Unified diff