1
|
DataONE Member Node Support
|
2
|
===========================
|
3
|
DataONE_ is a federation of data repositories that aims to improve
|
4
|
interoperability among data repository software systems and advance the
|
5
|
preservation of scientific data for future use.
|
6
|
Metacat deployments can be configured to participate in DataONE_. This
|
7
|
chapter describes the DataONE_ data federation, its architecture, and the
|
8
|
way in which Metacat can be used to participate as a node in the DataONE system.
|
9
|
|
10
|
.. _DataONE: http://dataone.org/
|
11
|
|
12
|
What is DataONE?
|
13
|
----------------
|
14
|
The DataONE_ project is a collaboration among scientists, technologists, librarians,
|
15
|
and social scientists to build a robust, interoperable, and sustainable system for
|
16
|
preserving and accessing Earth observational data at national and global scales.
|
17
|
Supported by the U.S. National Science Foundation, DataONE partners focus on
|
18
|
technological, financial, and organizational sustainability approaches to
|
19
|
building a distributed network of data repositories that are fully interoperable,
|
20
|
even when those repositories use divergent underlying software and support different
|
21
|
data and metadata content standards. DataONE defines a common web-service service
|
22
|
programming interface that allows the main software components of the DataONE system
|
23
|
to seamlessly communicate. The components of the DataONE system include:
|
24
|
|
25
|
* DataONE Service Interface
|
26
|
* Member Nodes
|
27
|
* Coordinating Nodes
|
28
|
* Investigator Toolkit
|
29
|
|
30
|
Metacat implements the services needed to operate as a DataONE Member Node,
|
31
|
as described below. The service interface then allows many different scientific
|
32
|
software tools for data management, analysis, visualization and other parts of
|
33
|
the scientific lifecycle to directly communicate with Metacat without being
|
34
|
further specialized beyond the support needed for DataONE. This streamlines the
|
35
|
process of writing scientific software both for servers and client tools.
|
36
|
|
37
|
The DataONE Service Interface
|
38
|
-----------------------------
|
39
|
DataONE acheives interoperability by defining a lightweight but powerful set of
|
40
|
REST_ web services that can be implemented by various data management software
|
41
|
systems to allow those systems to effectively communicate with one another,
|
42
|
exchange data, metadata, and other scientific objects. This `DataONE Service Interface`_
|
43
|
is an open standard that defines the communication protocols and technical
|
44
|
expectations for software components that wish to participate in the DataONE
|
45
|
federation. This service interface is divided into `four distinct tiers`_, with the
|
46
|
intention that any given software system may implement only those tiers that are
|
47
|
relevant to their repository; for example, a data aggregator might only implement
|
48
|
the Tier 1 interfaces that provide anonymous access to public data sets, while
|
49
|
a complete data management system like Metacat can implement all four tiers:
|
50
|
|
51
|
1. **Tier 1:** Read-only, anonymous data access
|
52
|
2. **Tier 2:** Read-only, with authentication and access control
|
53
|
3. **Tier 3:** Full Write access
|
54
|
4. **Tier 4:** Replication target services
|
55
|
|
56
|
.. _REST: http://en.wikipedia.org/wiki/Representational_state_transfer
|
57
|
|
58
|
.. _DataONE Service Interface: http://releases.dataone.org/online/d1-architecture-1.0.0
|
59
|
|
60
|
.. _four distinct tiers: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/index.html
|
61
|
|
62
|
Member Nodes
|
63
|
------------
|
64
|
In DataONE, Member Nodes represent the core of the network, in that they represent
|
65
|
particular scientific communities, manage and preserve their data and metadata, and
|
66
|
provide tools to their community for contributing, managing, and accessing data.
|
67
|
DataONE provides a standard way for these individual repositories to interact, and helps
|
68
|
to coordinate among the Member Nodes in the federation. This allows Member Nodes
|
69
|
to provide services to each other, such as replication of data for backup and failover.
|
70
|
To be a Member Node, a repository must implement the Member Node service interface,
|
71
|
and then register with DataONE. Metacat provides this implementation automatically,
|
72
|
and provides an easy configuration option to register a Metacat instance as a
|
73
|
DataONE Member Node (see configuration section below). If you are deploying a Metacat
|
74
|
instance, it is relatively simple to become a Member Node, but keep in mind that
|
75
|
DataONE is aiming for longevity and preservation, and so is selecting for nodes
|
76
|
that have long-term data preservation as part of their mission.
|
77
|
|
78
|
Coordinating Nodes
|
79
|
------------------
|
80
|
The DataONE Coordinating Nodes provide a set of services to Member Nodes that
|
81
|
allow Member Nodes to easily interact with one another and to provide a unified
|
82
|
view of the whole DataONE Federation. The main services provided by Coordinating
|
83
|
Nodes are:
|
84
|
|
85
|
* Global search index for all metadata and web portal for data discovery
|
86
|
* Resolution service to map unique identifiers to the Member Nodes that hold data
|
87
|
* Authentication against a shared set of accounts based on CILogon_ and InCommon_
|
88
|
* Replication management services to reliably replicate data according to
|
89
|
policies set by the Member Nodes
|
90
|
* Fixity checking to ensure that preserved objects remain valid
|
91
|
* Member Node registration and management
|
92
|
* Aggregated logging for data access across the whole federation
|
93
|
|
94
|
Three geographically distributed Coordinating Nodes replicate these coordinating
|
95
|
services at UC Santa Barbara, the University of New Mexico, and the Oak Ridge Campus.
|
96
|
Coordinating Nodes are set up in a fully redundant manner, such that any of the coordinating
|
97
|
nodes can be offline and the others will continue to provide availability of the services
|
98
|
without interruption. The DataONE services expose their services at::
|
99
|
|
100
|
https://cn.dataone.org/cn
|
101
|
|
102
|
And the DataONE search portal is available at:
|
103
|
|
104
|
https://cn.dataone.org/
|
105
|
|
106
|
.. _CILogon: http://www.cilogon.org
|
107
|
|
108
|
.. _InCommon: http://incommon.org
|
109
|
|
110
|
Investigator Toolkit
|
111
|
--------------------
|
112
|
In order to provide scientists with convenient access to the data and metadata in
|
113
|
DataONE, the third component represents a library of software tools that have been
|
114
|
adapted to work with DataONE via the service interface and can be used to
|
115
|
discover, manage, analyze, and visualize data in DataONE. For example, DataONE
|
116
|
plans to release metadata editors (e.g., Morpho), data search tools (e.g., Mercury),
|
117
|
data access tools (e.g., ONEDrive), and data analysis tools (e.g., R) that all
|
118
|
know how to interact with DataONE Member Nodes and Coordinating Nodes. Consequently,
|
119
|
scientists will be able to access data from any DataONE Member Node, such as a Metacat
|
120
|
node, directly from within the R environment. In addition, software tools that
|
121
|
are written to work with one Member Node should also work with others, thereby
|
122
|
greatly increasing the efficiency of creating an entire toolkit of software that
|
123
|
is useful to investigators.
|
124
|
|
125
|
Because DataONE services are REST web services, software written in any
|
126
|
programming language can be adapted to interact with DataONE.
|
127
|
In addition, to ease the process of adapting tools to work with DataONE, libraries
|
128
|
are provided for common programming languages such as Java (d1-libclient-java)
|
129
|
and Python (d1_libclient-python) are provided that allow simple function calls
|
130
|
to be used to access any DataONE service.
|
131
|
|
132
|
Configuring Metacat as a Member Node
|
133
|
------------------------------------
|
134
|
Configuring Metacat as a DataONE Member Node is accomplished with the standard
|
135
|
Metacat Administrative configuration utility. To access the utility, visit the
|
136
|
following URL::
|
137
|
|
138
|
http://<yourhost.org>/<context>/admin
|
139
|
|
140
|
where ``<yourhost.org>`` represents the hostname of your webserver running metacat,
|
141
|
and ``<context>`` is the name of the web context in which Metacat was installed.
|
142
|
Once at the administrative utility, click on the DataONE configuration link, which
|
143
|
should show the following screen:
|
144
|
|
145
|
.. figure:: images/screenshots/screen-dataone-config.png
|
146
|
:align: center
|
147
|
|
148
|
The configuration screen for configuring Metacat as a DataONE node.
|
149
|
|
150
|
To configure Metacat as a node in the DataONE network, configure the properties shown
|
151
|
in the figure above. The Node Name should be a short name for the node that can
|
152
|
be used in user interface displays that list the node. For example, one node in
|
153
|
DataONE is the 'Knowledge Network for Biocomplexity'. Also provide a brief sentence
|
154
|
or two describing the node, including its intended scope and purpose.
|
155
|
|
156
|
The Node Identifier field is a unique identifer assigned by DataONE to identify
|
157
|
this node even when the node changes physical locations over time. When Metacat
|
158
|
registers with the DataONE Coordinating Nodes (when you click 'Register' at the
|
159
|
bottom of this form), the Node Identifier is automatically set. **It is critical that
|
160
|
you not change the Node Identifier**, as that will break the connection with the
|
161
|
DataONE network. The ability to edit this field is only provided for the rare case
|
162
|
in which a new Metacat instance is being established to act as the provider for an
|
163
|
existing DataONE Member Node, in which case the field can be edited to set it to
|
164
|
the value of a valid, existing Node Identifier.
|
165
|
|
166
|
The Node Subject and Node Certificate Path are linked fields that are critical for
|
167
|
proper operation of the node. To act as a Member Node in DataONE, you must obtain
|
168
|
an X.509 certificate that can be used to identify this node and allow it to securely
|
169
|
communicate using SSL with other nodes and client applications. This certificate can
|
170
|
either be obtained from the DataONE Certificate Authority, or from a commercial
|
171
|
provider of certificates. Once you have the certificate in hand, use a tool such
|
172
|
as ``openssl`` to determine the exact subject distinguished name in the
|
173
|
certificate, and use that to set the Node Subject field. Set the Node
|
174
|
Certificate Path to the location on the system in which you stored the
|
175
|
certificate file.
|
176
|
|
177
|
The ``Synchronize`` checkbox allows the administrator to decide whether to turn on
|
178
|
synchronization with the DataONE network. When this box is unchecked, the DataONE
|
179
|
Coordinating Nodes will not attempt to synchronize at all, but when checked, then
|
180
|
DataONE will periodically contact the node to synchrnize all metadata content.
|
181
|
To be part of the DataONE network, this box must be checked as that allows
|
182
|
DataONE to receive a copy of the metadata associated with each object in the Metacat
|
183
|
system. The switch is provided for those rare cases when a node needs to be disconnected
|
184
|
from DataONE for maintenance or service outages. When the box is checked, DataONE
|
185
|
contacts the node using the schedule provided in the ``Synchronization Schedule``
|
186
|
fields. The example in the dialog above has synchronization occurring once every third
|
187
|
minutes at the 10 second mark of those minutes. The syntax for these schedules
|
188
|
follows the Quartz Crontab Entry syntax, which provides for many flexible schedule
|
189
|
configurations. If the administrator desires a less frequent schedule, such as daily,
|
190
|
that can be configured by changing the ``*`` in the ``Hours`` field to be a concrete
|
191
|
hour (such as ``11``) and the ``Minutes`` field to a concrete value like``15``,
|
192
|
which would change the schedule to synchronize at 11:15 am daily.
|
193
|
|
194
|
Once these parameters have been properly set, us the ``Register`` button to
|
195
|
request to register with the DataONE Coordinating Node. This will generate a
|
196
|
registration document describing this Metacat instance and send it to the
|
197
|
Coordinating Node registration service, which will return a unique Node Identifier
|
198
|
which will be recorded by Metacat. At that point, all that remains is to wait for
|
199
|
the DataONE administrators to approve the node registration. Details of the approval
|
200
|
process can be found on the `DataONE web site`_.
|
201
|
|
202
|
.. _DataONE web site: http://www.dataone.org
|
203
|
|
204
|
Access Control Policies
|
205
|
-----------------------
|
206
|
Metacat has supported fine grained access control for objects in the system since
|
207
|
its inception. DataONE has devised a simple but effective access control system
|
208
|
that is compatible with the prior system in Metacat. For each object in the DataONE
|
209
|
system (including data objects, scientific metadata objects, and resource maps),
|
210
|
a SystemMetadata_ document describes the critical metadata needed to manage that
|
211
|
object in the system. This metadata includes a ``RightsHolder`` field and an
|
212
|
``AuthoritativeMemberNode`` field that are used to list the people and node that
|
213
|
have ultimate control over the disposition of the object. In addition, a separate
|
214
|
AccessPolicy_ can be included in the ``SystemMetadata`` for the object. This ``AccessPolicy``
|
215
|
consists of a set of rules that grant additional permissions to other people,
|
216
|
groups, and systems in DataONE. For example, for one data file, two users
|
217
|
(Alice and Bob) may be able make changes to the object, and the general public may
|
218
|
be allowed to read the object. In the absence of explicit rules extending these permissions,
|
219
|
Metacat enforces the rule that only the ``RightsHolder`` and ``AuthoritativeMemberNode`` have
|
220
|
rights to the object, and that the Coordinating Node can manage ``SystemMetadata``
|
221
|
for the object. An example AccessPolicy that might be submitted with a dataset
|
222
|
(giving Alice and Bob permission to read and write the object) follows:
|
223
|
|
224
|
::
|
225
|
|
226
|
...
|
227
|
<accessPolicy>
|
228
|
<allow>
|
229
|
<subject>/C=US/O=SomeIdP/CN=Alice</subject>
|
230
|
<subject>/C=US/O=SomeIdP/CN=Bob</subject>
|
231
|
<permission>read</permission>
|
232
|
<permission>write</permission>
|
233
|
</allow>
|
234
|
</accessPolicy>
|
235
|
...
|
236
|
|
237
|
These AccessPolicies can be embedded inside of the ``SystemMetadata`` that accompany
|
238
|
submission of an object through the `MNStorage.create`_ and `MNStorage.update`_ services,
|
239
|
or can be set using the `CNAuthorization.setAccessPolicy`_ service.
|
240
|
|
241
|
.. _SystemMetadata: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/Types.html#Types.AccessPolicy
|
242
|
|
243
|
.. _AccessPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/Types.html#Types.AccessPolicy
|
244
|
|
245
|
.. _MNStorage.create: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/MN_APIs.html#MNStorage.create
|
246
|
|
247
|
.. _MNStorage.update: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/MN_APIs.html#MNStorage.update
|
248
|
|
249
|
.. _CNAuthorization.setAccessPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/CN_APIs.html#CNAuthorization.setAccessPolicy
|
250
|
|
251
|
Configuration as a replication target
|
252
|
-------------------------------------
|
253
|
DataONE is designed to enable a robust preservation environment through replication
|
254
|
of digital objects at multiple Member Nodes. Any Member Node in DataONE that implements
|
255
|
the Tier 4 Service interface can offer to act as a target for object replication.
|
256
|
Currently, Metacat configuration supports turning this replication function on or off.
|
257
|
When the 'Act as a replication target' checkbox is checked, then Metacat will notify
|
258
|
the Coordinating Nodes in DataONE that it is available to house replicas of objects
|
259
|
from other nodes. Shortly thereafter, the Coordinating Nodes may notify Metacat to
|
260
|
replicate objects from throughout the system, which it will start to do. There objects
|
261
|
will begin to be listed in the Metacat catalog.
|
262
|
|
263
|
.. Note::
|
264
|
|
265
|
Future versions of Metacat will allow finer specification of the Node
|
266
|
Replication Policy, which determines the set of objects
|
267
|
that it is willing to replicate, using constraints on object size, total objects,
|
268
|
source nodes, and object format types.
|
269
|
|
270
|
Object Replication Policies
|
271
|
---------------------------
|
272
|
In addition to access control, each object also can have a ``ReplicationPolicy``
|
273
|
associated with it that determines whether DataONE should attempt to replicate the
|
274
|
object for failover and backup purposes to other Member Nodes in the federation.
|
275
|
Both the ``RightsHolder`` and ``AuthoritativeMemberNode`` for an object can set the
|
276
|
``ReplicationPolicy``, which consists of fields that describe how many replicas
|
277
|
should be maintained, and any nodes that are preferred for housing those replicas, or
|
278
|
that should be blocked from housing replicas.
|
279
|
|
280
|
These ReplicationPolicies can be embedded inside of the ``SystemMetadata`` that accompany
|
281
|
submission of an object through the `MNStorage.create`_ and `MNStorage.update`_ services,
|
282
|
or can be set using the `CNReplication.setReplicationPolicy`_ service.
|
283
|
|
284
|
.. _CNReplication.setReplicationPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/CN_APIs.html#CNReplication.setReplicationPolicy
|