1
|
DataONE Member Node Support
|
2
|
===========================
|
3
|
DataONE_ is a federation of data repositories that aims to improve
|
4
|
interoperability among data repository software systems and advance the
|
5
|
preservation of scientific data for future use.
|
6
|
Metacat deployments can be configured to participate in DataONE_. This
|
7
|
chapter describes the DataONE_ data federation, its architecture, and the
|
8
|
way in which Metacat can be used to participate as a node in the DataONE system.
|
9
|
|
10
|
.. _DataONE: http://dataone.org/
|
11
|
|
12
|
What is DataONE?
|
13
|
----------------
|
14
|
The DataONE_ project is a collaboration among scientists, technologists, librarians,
|
15
|
and social scientists to build a robust, interoperable, and sustainable system for
|
16
|
preserving and accessing Earth observational data at national and global scales.
|
17
|
Supported by the U.S. National Science Foundation, DataONE partners focus on
|
18
|
technological, financial, and organizational sustainability approaches to
|
19
|
building a distributed network of data repositories that are fully interoperable,
|
20
|
even when those repositories use divergent underlying software and support different
|
21
|
data and metadata content standards. DataONE defines a common web-service service
|
22
|
programming interface that allows the main software components of the DataONE system
|
23
|
to seamlessly communicate. The components of the DataONE system include:
|
24
|
|
25
|
* DataONE Service Interface
|
26
|
* Member Nodes
|
27
|
* Coordinating Nodes
|
28
|
* Investigator Toolkit
|
29
|
|
30
|
Metacat implements the services needed to operate as a DataONE Member Node,
|
31
|
as described below. The service interface then allows many different scientific
|
32
|
software tools for data management, analysis, visualization and other parts of
|
33
|
the scientific lifecycle to directly communicate with Metacat without being
|
34
|
further specialized beyond the support needed for DataONE. This streamlines the
|
35
|
process of writing scientific software both for servers and client tools.
|
36
|
|
37
|
The DataONE Service Interface
|
38
|
-----------------------------
|
39
|
DataONE acheives interoperability by defining a lightweight but powerful set of
|
40
|
REST_ web services that can be implemented by various data management software
|
41
|
systems to allow those systems to effectively communicate with one another,
|
42
|
exchange data, metadata, and other scientific objects. This `DataONE Service Interface`_
|
43
|
is an open standard that defines the communication protocols and technical
|
44
|
expectations for software components that wish to participate in the DataONE
|
45
|
federation. This service interface is divided into `four distinct tiers`_, with the
|
46
|
intention that any given software system may implement only those tiers that are
|
47
|
relevant to their repository; for example, a data aggregator might only implement
|
48
|
the Tier 1 interfaces that provide anonymous access to public data sets, while
|
49
|
a complete data management system like Metacat can implement all four tiers:
|
50
|
|
51
|
1. **Tier 1:** Read-only, anonymous data access
|
52
|
2. **Tier 2:** Read-only, with authentication and access control
|
53
|
3. **Tier 3:** Full Write access
|
54
|
4. **Tier 4:** Replication target services
|
55
|
|
56
|
.. _REST: http://en.wikipedia.org/wiki/Representational_state_transfer
|
57
|
|
58
|
.. _DataONE Service Interface: http://releases.dataone.org/online/d1-architecture-1.0.0
|
59
|
|
60
|
.. _four distinct tiers: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/index.html
|
61
|
|
62
|
Member Nodes
|
63
|
------------
|
64
|
In DataONE, Member Nodes represent the core of the network, in that they represent
|
65
|
particular scientific communities, manage and preserve their data and metadata, and
|
66
|
provide tools to their community for contributing, managing, and accessing data.
|
67
|
DataONE provides a standard way for these individual repositories to interact, and helps
|
68
|
to coordinate among the Member Nodes in the federation. This allows Member Nodes
|
69
|
to provide services to each other, such as replication of data for backup and failover.
|
70
|
To be a Member Node, a repository must implement the Member Node service interface,
|
71
|
and then register with DataONE. Metacat provides this implementation automatically,
|
72
|
and provides an easy configuration option to register a Metacat instance as a
|
73
|
DataONE Member Node (see configuration section below). If you are deploying a Metacat
|
74
|
instance, it is relatively simple to become a Member Node, but keep in mind that
|
75
|
DataONE is aiming for longevity and preservation, and so is selecting for nodes
|
76
|
that have long-term data preservation as part of their mission.
|
77
|
|
78
|
Coordinating Nodes
|
79
|
------------------
|
80
|
The DataONE Coordinating Nodes provide a set of services to Member Nodes that
|
81
|
allow Member Nodes to easily interact with one another and to provide a unified
|
82
|
view of the whole DataONE Federation. The main services provided by Coordinating
|
83
|
Nodes are:
|
84
|
|
85
|
* Global search index for all metadata and web portal for data discovery
|
86
|
* Resolution service to map unique identifiers to the Member Nodes that hold data
|
87
|
* Authentication against a shared set of accounts based on CILogon_ and InCommon_
|
88
|
* Replication management services to reliably replicate data according to
|
89
|
policies set by the Member Nodes
|
90
|
* Fixity checking to ensure that preserved objects remain valid
|
91
|
* Member Node registration and management
|
92
|
* Aggregated logging for data access across the whole federation
|
93
|
|
94
|
Three geographically distributed Coordinating Nodes replicate these coordinating
|
95
|
services at UC Santa Barbara, the University of New Mexico, and the Oak Ridge Campus.
|
96
|
Coordinating Nodes are set up in a fully redundant manner, such that any of the coordinating
|
97
|
nodes can be offline and the others will continue to provide availability of the services
|
98
|
without interruption. The DataONE services expose their services at::
|
99
|
|
100
|
https://cn.dataone.org/cn
|
101
|
|
102
|
And the DataONE search portal is available at:
|
103
|
|
104
|
https://cn.dataone.org/
|
105
|
|
106
|
.. _CILogon: http://www.cilogon.org
|
107
|
|
108
|
.. _InCommon: http://incommon.org
|
109
|
|
110
|
Investigator Toolkit
|
111
|
--------------------
|
112
|
In order to provide scientists with convenient access to the data and metadata in
|
113
|
DataONE, the third component represents a library of software tools that have been
|
114
|
adapted to work with DataONE via the service interface and can be used to
|
115
|
discover, manage, analyze, and visualize data in DataONE. For example, DataONE
|
116
|
plans to release metadata editors (e.g., Morpho), data search tools (e.g., Mercury),
|
117
|
data access tools (e.g., ONEDrive), and data analysis tools (e.g., R) that all
|
118
|
know how to interact with DataONE Member Nodes and Coordinating Nodes. Consequently,
|
119
|
scientists will be able to access data from any DataONE Member Node, such as a Metacat
|
120
|
node, directly from within the R environment. In addition, software tools that
|
121
|
are written to work with one Member Node should also work with others, thereby
|
122
|
greatly increasing the efficiency of creating an entire toolkit of software that
|
123
|
is useful to investigators.
|
124
|
|
125
|
Because DataONE services are REST web services, software written in any
|
126
|
programming language can be adapted to interact with DataONE.
|
127
|
In addition, to ease the process of adapting tools to work with DataONE, libraries
|
128
|
are provided for common programming languages such as Java (d1-libclient-java)
|
129
|
and Python (d1_libclient-python) are provided that allow simple function calls
|
130
|
to be used to access any DataONE service.
|
131
|
|
132
|
Configuring Metacat as a Member Node
|
133
|
------------------------------------
|
134
|
Configuring Metacat as a DataONE Member Node is accomplished with the standard
|
135
|
Metacat Administrative configuration utility. To access the utility, visit the
|
136
|
following URL::
|
137
|
|
138
|
http://<yourhost.org>/<context>/admin
|
139
|
|
140
|
where ``<yourhost.org>`` represents the hostname of your webserver running metacat,
|
141
|
and ``<context>`` is the name of the web context in which Metacat was installed.
|
142
|
Once at the administrative utility, click on the DataONE configuration link, which
|
143
|
should show the following screen:
|
144
|
|
145
|
.. figure:: images/screenshots/screen-dataone-config.png
|
146
|
:align: center
|
147
|
|
148
|
The configuration screen for configuring Metacat as a DataONE node.
|
149
|
|
150
|
To configure Metacat as a node in the DataONE network, configure the properties shown
|
151
|
in the figure above. The Node Name should be a short name for the node that can
|
152
|
be used in user interface displays that list the node. For example, one node in
|
153
|
DataONE is the 'Knowledge Network for Biocomplexity'. Also provide a brief sentence
|
154
|
or two describing the node, including its intended scope and purpose.
|
155
|
|
156
|
The Node Identifier field is a unique identifier assigned by DataONE to identify
|
157
|
this node even when the node changes physical locations over time. After Metacat
|
158
|
registers with the DataONE Coordinating Nodes (when you click 'Register' at the
|
159
|
bottom of this form), the Node Identifier should not be changed. **It is critical that
|
160
|
you not change the Node Identifier after registration**, as that will break the connection with the
|
161
|
DataONE network. Changing this field should only happen in the rare case
|
162
|
in which a new Metacat instance is being established to act as the provider for an
|
163
|
existing DataONE Member Node, in which case the field can be edited to set it to
|
164
|
the value of a valid, existing Node Identifier.
|
165
|
|
166
|
The Node Subject and Node Certificate Path are linked fields that are critical for
|
167
|
proper operation of the node. To act as a Member Node in DataONE, you must obtain
|
168
|
an X.509 certificate that can be used to identify this node and allow it to securely
|
169
|
communicate using SSL with other nodes and client applications. This certificate can
|
170
|
be obtained from the DataONE Certificate Authority.
|
171
|
Once you have the certificate in hand, use a tool such
|
172
|
as ``openssl`` to determine the exact subject distinguished name in the
|
173
|
certificate, and use that to set the Node Subject field. Set the Node
|
174
|
Certificate Path to the location on the system in which you stored the
|
175
|
certificate file. Be sure to protect the certificate file, as it contains the
|
176
|
private key that is used to authenticate this node within DataONE.
|
177
|
|
178
|
.. Note::
|
179
|
|
180
|
For Tier 2 deployments and above, the Metacat Member Node must have Apache configured to request
|
181
|
client certificates. Detailed instructions are included at the end of this chapter.
|
182
|
|
183
|
The ``Enable DataONE Services`` checkbox allows the administrator to decide whether to
|
184
|
turn on synchronization with the DataONE network. When this box is unchecked, the
|
185
|
DataONE Coordinating Nodes will not attempt to synchronize at all, but when checked,
|
186
|
then DataONE will periodically contact the node to synchronize all metadata content.
|
187
|
To be part of the DataONE network, this box must be checked as that allows
|
188
|
DataONE to receive a copy of the metadata associated with each object in the Metacat
|
189
|
system. The switch is provided for those rare cases when a node needs to be disconnected
|
190
|
from DataONE for maintenance or service outages. When the box is checked, DataONE
|
191
|
contacts the node using the schedule provided in the ``Synchronization Schedule``
|
192
|
fields. The example in the dialog above has synchronization occurring once every third
|
193
|
minutes at the 10 second mark of those minutes. The syntax for these schedules
|
194
|
follows the Quartz Crontab Entry syntax, which provides for many flexible schedule
|
195
|
configurations. If the administrator desires a less frequent schedule, such as daily,
|
196
|
that can be configured by changing the ``*`` in the ``Hours`` field to be a concrete
|
197
|
hour (such as ``11``) and the ``Minutes`` field to a concrete value like``15``,
|
198
|
which would change the schedule to synchronize at 11:15 am daily.
|
199
|
|
200
|
The Replication section is used to configure replication options for the node
|
201
|
overall and for objects stored in Metacat. The ``Accept and Store Replicas``
|
202
|
checkbox is used to indicate that the administrator of this node is willing to allow
|
203
|
replica data and metadata from other Member Nodes to be stored on this node. We
|
204
|
encourage people to allow replication to their nodes, as this increases the
|
205
|
scalability and flexibility of the network overall. The three "Default" fields set
|
206
|
the default values for the replication policies for data and metadata on this node
|
207
|
that are generated when System Metadata is not available for an object (such as when
|
208
|
it originates from a client that is not DataONE compliant). The ``Default Number of
|
209
|
Replicas`` determines how many replica copies of the object should be stored on
|
210
|
other Member Nodes. A value of 0 or less indicates that no replicas should be
|
211
|
stored. In addition, you can specify a list of nodes that are either preferred for
|
212
|
use when choosing replica nodes, or that are blocked from use as replica nodes.
|
213
|
This allows Member Nodes to set up bidirectional agreements with partner nodes to
|
214
|
replicate data across their sites. The values for both ``Default Preferred Nodes``
|
215
|
and ``Default Blocked Nodes`` is a comma-separated list of NodeReference identifiers
|
216
|
that were assigned to the target nodes by DataONE.
|
217
|
|
218
|
Once these parameters have been properly set, us the ``Register`` button to
|
219
|
request to register with the DataONE Coordinating Node. This will generate a
|
220
|
registration document describing this Metacat instance and send it to the
|
221
|
Coordinating Node registration service. At that point, all that remains is to wait for
|
222
|
the DataONE administrators to approve the node registration. Details of the approval
|
223
|
process can be found on the `DataONE web site`_.
|
224
|
|
225
|
.. _DataONE web site: http://www.dataone.org
|
226
|
|
227
|
Access Control Policies
|
228
|
-----------------------
|
229
|
Metacat has supported fine grained access control for objects in the system since
|
230
|
its inception. DataONE has devised a simple but effective access control system
|
231
|
that is compatible with the prior system in Metacat. For each object in the DataONE
|
232
|
system (including data objects, scientific metadata objects, and resource maps),
|
233
|
a SystemMetadata_ document describes the critical metadata needed to manage that
|
234
|
object in the system. This metadata includes a ``RightsHolder`` field and an
|
235
|
``AuthoritativeMemberNode`` field that are used to list the people and node that
|
236
|
have ultimate control over the disposition of the object. In addition, a separate
|
237
|
AccessPolicy_ can be included in the ``SystemMetadata`` for the object. This ``AccessPolicy``
|
238
|
consists of a set of rules that grant additional permissions to other people,
|
239
|
groups, and systems in DataONE. For example, for one data file, two users
|
240
|
(Alice and Bob) may be able make changes to the object, and the general public may
|
241
|
be allowed to read the object. In the absence of explicit rules extending these permissions,
|
242
|
Metacat enforces the rule that only the ``RightsHolder`` and ``AuthoritativeMemberNode`` have
|
243
|
rights to the object, and that the Coordinating Node can manage ``SystemMetadata``
|
244
|
for the object. An example AccessPolicy that might be submitted with a dataset
|
245
|
(giving Alice and Bob permission to read and write the object) follows:
|
246
|
|
247
|
::
|
248
|
|
249
|
...
|
250
|
<accessPolicy>
|
251
|
<allow>
|
252
|
<subject>/C=US/O=SomeIdP/CN=Alice</subject>
|
253
|
<subject>/C=US/O=SomeIdP/CN=Bob</subject>
|
254
|
<permission>read</permission>
|
255
|
<permission>write</permission>
|
256
|
</allow>
|
257
|
</accessPolicy>
|
258
|
...
|
259
|
|
260
|
These AccessPolicies can be embedded inside of the ``SystemMetadata`` that accompany
|
261
|
submission of an object through the `MNStorage.create`_ and `MNStorage.update`_ services,
|
262
|
or can be set using the `CNAuthorization.setAccessPolicy`_ service.
|
263
|
|
264
|
.. _SystemMetadata: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/Types.html#Types.AccessPolicy
|
265
|
|
266
|
.. _AccessPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/Types.html#Types.AccessPolicy
|
267
|
|
268
|
.. _MNStorage.create: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/MN_APIs.html#MNStorage.create
|
269
|
|
270
|
.. _MNStorage.update: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/MN_APIs.html#MNStorage.update
|
271
|
|
272
|
.. _CNAuthorization.setAccessPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/CN_APIs.html#CNAuthorization.setAccessPolicy
|
273
|
|
274
|
Configuration as a replication target
|
275
|
-------------------------------------
|
276
|
DataONE is designed to enable a robust preservation environment through replication
|
277
|
of digital objects at multiple Member Nodes. Any Member Node in DataONE that implements
|
278
|
the Tier 4 Service interface can offer to act as a target for object replication.
|
279
|
Currently, Metacat configuration supports turning this replication function on or off.
|
280
|
When the 'Act as a replication target' checkbox is checked, then Metacat will notify
|
281
|
the Coordinating Nodes in DataONE that it is available to house replicas of objects
|
282
|
from other nodes. Shortly thereafter, the Coordinating Nodes may notify Metacat to
|
283
|
replicate objects from throughout the system, which it will start to do. There objects
|
284
|
will begin to be listed in the Metacat catalog.
|
285
|
|
286
|
.. Note::
|
287
|
|
288
|
Future versions of Metacat will allow finer specification of the Node
|
289
|
Replication Policy, which determines the set of objects
|
290
|
that it is willing to replicate, using constraints on object size, total objects,
|
291
|
source nodes, and object format types.
|
292
|
|
293
|
Object Replication Policies
|
294
|
---------------------------
|
295
|
In addition to access control, each object also can have a ``ReplicationPolicy``
|
296
|
associated with it that determines whether DataONE should attempt to replicate the
|
297
|
object for failover and backup purposes to other Member Nodes in the federation.
|
298
|
Both the ``RightsHolder`` and ``AuthoritativeMemberNode`` for an object can set the
|
299
|
``ReplicationPolicy``, which consists of fields that describe how many replicas
|
300
|
should be maintained, and any nodes that are preferred for housing those replicas, or
|
301
|
that should be blocked from housing replicas.
|
302
|
|
303
|
These ReplicationPolicies can be embedded inside of the ``SystemMetadata`` that accompany
|
304
|
submission of an object through the `MNStorage.create`_ and `MNStorage.update`_ services,
|
305
|
or can be set using the `CNReplication.setReplicationPolicy`_ service.
|
306
|
|
307
|
.. _CNReplication.setReplicationPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/CN_APIs.html#CNReplication.setReplicationPolicy
|
308
|
|
309
|
|
310
|
Generating DataONE System Metadata
|
311
|
----------------------------------
|
312
|
When a Metacat instance becomes a Member Node, System Metadata must be generated for the existing content.
|
313
|
This can be invoked in the Replication configuration screen of the Metacat administration interface. Initially,
|
314
|
Metacat instances will only need to generate System Metadata for their local content (the ``localhost`` entry).
|
315
|
In cases where Metacat has participated in replication with other Metacat servers, it may be useful to generate System Metadata
|
316
|
for those replica records as well. Please consult both the replication partner's administrator and the DataONE administrators before
|
317
|
generating System Metadata for replica content.
|
318
|
|
319
|
.. figure:: images/screenshots/screen-dataone-replication.png
|
320
|
:align: center
|
321
|
|
322
|
The replication configuration screen for generating System Metadata.
|
323
|
|
324
|
Apache configuration details
|
325
|
----------------------------
|
326
|
These Apache directives are crucial for Metacat to function as a Tier 2+ Member Node
|
327
|
|
328
|
::
|
329
|
|
330
|
...
|
331
|
AllowEncodedSlashes On
|
332
|
AcceptPathInfo On
|
333
|
JkOptions +ForwardURICompatUnparsed
|
334
|
SSLEngine on
|
335
|
SSLOptions +StrictRequire +StdEnvVars +ExportCertData
|
336
|
SSLVerifyClient optional
|
337
|
SSLVerifyDepth 10
|
338
|
SSLCertificateFile /etc/ssl/certs/<your_server_certificate>
|
339
|
SSLCertificateKeyFile /etc/ssl/private/<your_server_key>
|
340
|
SSLCACertificatePath /etc/ssl/certs/
|
341
|
...
|
342
|
|
343
|
Where ``<your_server_certificate>`` and ``<your_server_key>`` are the certificate/key pair used by Apache
|
344
|
to identify the server to clients. The DataONE Certiciate Authority certificate - available from the DataONE administrators -
|
345
|
will also need to be added to the directory specified by ``SSLCACertificatePath``
|
346
|
in order to validate client certificates signed by that authority. DataONE has also provided a CA chain file that may be used in lieu of directory-based CA
|
347
|
confinguration. The `SSLCACertificateFile`` directive should be used when configuring your member node with the DataONE CA chain.
|
348
|
When these changes have been applied, Apache should be restarted:
|
349
|
|
350
|
::
|
351
|
|
352
|
cd /etc/ssl/certs
|
353
|
sudo c_rehash
|
354
|
sudo /etc/init.d/apache2 restart
|
355
|
|
356
|
|
357
|
Configure Tomcat to allow DataONE identifiers
|
358
|
----------------------------------------------
|
359
|
Edit ``/etc/tomcat/catalina.properties`` to include:
|
360
|
|
361
|
::
|
362
|
|
363
|
org.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true
|
364
|
org.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH=true
|