Project

General

Profile

1
DataONE Member Node Support
2
===========================
3
DataONE_ is a federation of data repositories that aims to improve 
4
interoperability among data repository software systems and advance the
5
preservation of scientific data for future use.
6
Metacat deployments can be configured to participate in DataONE_. This 
7
chapter describes the DataONE_ data federation,  its architecture, and the 
8
way in which Metacat can be used to participate as a node in the DataONE system.
9

    
10
.. _DataONE: http://dataone.org/
11

    
12
What is DataONE?
13
----------------
14
The DataONE_ project is a collaboration among scientists, technologists, librarians,
15
and social scientists to build a robust, interoperable, and sustainable system for
16
preserving and accessing Earth observational data at national and global scales.  
17
Supported by the U.S. National Science Foundation, DataONE partners focus on
18
technological, financial, and organizational sustainability approaches to 
19
building a distributed network of data repositories that are fully interoperable,
20
even when those repositories use divergent underlying software and support different
21
data and metadata content standards. DataONE defines a common web-service service 
22
programming interface that allows the main software components of the DataONE system
23
to seamlessly communicate. The components of the DataONE system include:
24

    
25
* DataONE Service Interface
26
* Member Nodes
27
* Coordinating Nodes
28
* Investigator Toolkit
29

    
30
Metacat implements the services needed to operate as a DataONE Member Node, 
31
as described below.  The service interface then allows many different scientific 
32
software tools for data management, analysis, visualization and other parts of 
33
the scientific lifecycle to directly communicate with Metacat without being
34
further specialized beyond the support needed for DataONE.  This streamlines the
35
process of writing scientific software both for servers and client tools.
36

    
37
The DataONE Service Interface
38
-----------------------------
39
DataONE acheives interoperability by defining a lightweight but powerful set of 
40
REST_ web services that can be implemented by various data management software 
41
systems to allow those systems to effectively communicate with one another, 
42
exchange data, metadata, and other scientific objects.  This `DataONE Service Interface`_
43
is an open standard that defines the communication protocols and technical 
44
expectations for software components that wish to participate in the DataONE
45
federation. This service interface is divided into `four distinct tiers`_, with the 
46
intention that any given software system may implement only those tiers that are 
47
relevant to their repository; for example, a data aggregator might only implement
48
the Tier 1 interfaces that provide anonymous access to public data sets, while
49
a complete data management system like Metacat can implement all four tiers:
50

    
51
1. **Tier 1:** Read-only, anonymous data access
52
2. **Tier 2:** Read-only, with authentication and access control
53
3. **Tier 3:** Full Write access
54
4. **Tier 4:** Replication target services
55

    
56
.. _REST: http://en.wikipedia.org/wiki/Representational_state_transfer
57

    
58
.. _DataONE Service Interface: http://releases.dataone.org/online/d1-architecture-1.0.0
59

    
60
.. _four distinct tiers: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/index.html
61

    
62
Member Nodes
63
------------
64
In DataONE, Member Nodes represent the core of the network, in that they represent
65
particular scientific communities, manage and preserve their data and metadata, and
66
provide tools to their community for contributing, managing, and accessing data.
67
DataONE provides a standard way for these individual repositories to interact, and helps
68
to coordinate among the Member Nodes in the federation.  This allows Member Nodes
69
to provide services to each other, such as replication of data for backup and failover.
70
To be a Member Node, a repository must implement the Member Node service interface, 
71
and then register with DataONE.  Metacat provides this implementation automatically,
72
and provides an easy configuration option to register a Metacat instance as a 
73
DataONE Member Node (see configuration section below). If you are deploying a Metacat
74
instance, it is relatively simple to become a Member Node, but keep in mind that 
75
DataONE is aiming for longevity and preservation, and so is selecting for nodes
76
that have long-term data preservation as part of their mission. 
77

    
78
Coordinating Nodes
79
------------------
80
The DataONE Coordinating Nodes provide a set of services to Member Nodes that
81
allow Member Nodes to easily interact with one another and to provide a unified
82
view of the whole DataONE Federation.  The main services provided by Coordinating
83
Nodes are:
84

    
85
* Global search index for all metadata and web portal for data discovery
86
* Resolution service to map unique identifiers to the Member Nodes that hold data
87
* Authentication against a shared set of accounts based on CILogon_ and InCommon_
88
* Replication management services to reliably replicate data according to 
89
  policies set by the Member Nodes
90
* Fixity checking to ensure that preserved objects remain valid
91
* Member Node registration and management
92
* Aggregated logging for data access across the whole federation
93

    
94
Three geographically distributed Coordinating Nodes replicate these coordinating 
95
services at UC Santa Barbara, the University of New Mexico, and the Oak Ridge Campus.
96
Coordinating Nodes are set up in a fully redundant manner, such that any of the coordinating
97
nodes can be offline and the others will continue to provide availability of the services
98
without interruption.  The DataONE services expose their services at::
99

    
100
  https://cn.dataone.org/cn
101
  
102
And the DataONE search portal is available at:
103

    
104
  https://cn.dataone.org/
105

    
106
.. _CILogon: http://www.cilogon.org
107

    
108
.. _InCommon: http://incommon.org
109

    
110
Investigator Toolkit
111
--------------------
112
In order to provide scientists with convenient access to the data and metadata in
113
DataONE, the third component represents a library of software tools that have been 
114
adapted to work with DataONE via the service interface and can be used to
115
discover, manage, analyze, and visualize data in DataONE.  For example, DataONE
116
plans to release metadata editors (e.g., Morpho), data search tools (e.g., Mercury), 
117
data access tools (e.g., ONEDrive), and data analysis tools (e.g., R) that all 
118
know how to interact with DataONE Member Nodes and Coordinating Nodes.  Consequently,
119
scientists will be able to access data from any DataONE Member Node, such as a Metacat
120
node, directly from within the R environment.  In addition, software tools that 
121
are written to work with one Member Node should also work with others, thereby
122
greatly increasing the efficiency of creating an entire toolkit of software that
123
is useful to investigators.  
124

    
125
Because DataONE services are REST web services, software written in any
126
programming language can be adapted to interact with DataONE.
127
In addition, to ease the process of adapting tools to work with DataONE, libraries
128
are provided for common programming languages such as Java (d1-libclient-java) 
129
and Python (d1_libclient-python) are provided that allow simple function calls 
130
to be used to access any DataONE service.
131

    
132
Configuring Metacat as a Member Node
133
------------------------------------
134
Configuring Metacat as a DataONE Member Node is accomplished with the standard
135
Metacat Administrative configuration utility. To access the utility, visit the 
136
following URL::
137

    
138
  http://<yourhost.org>/<context>/admin
139
  
140
where ``<yourhost.org>`` represents the hostname of your webserver running metacat,
141
and ``<context>`` is the name of the web context in which Metacat was installed.
142
Once at the administrative utility, click on the DataONE configuration link, which
143
should show the following screen:
144

    
145
.. figure:: images/screenshots/screen-dataone-config.png
146
   :align: center
147
   
148
   The configuration screen for configuring Metacat as a DataONE node.
149

    
150
To configure Metacat as a node in the DataONE network, configure the properties shown
151
in the figure above.  The Node Name should be a short name for the node that can
152
be used in user interface displays that list the node.  For example, one node in
153
DataONE is the 'Knowledge Network for Biocomplexity'.  Also provide a brief sentence
154
or two describing the node, including its intended scope and purpose.  
155

    
156
The Node Identifier field is a unique identifer assigned by DataONE to identify
157
this node even when the node changes physical locations over time.  When Metacat
158
registers with the DataONE Coordinating Nodes (when you click 'Register' at the
159
bottom of this form), the Node Identifier is automatically set.  **It is critical that
160
you not change the Node Identifier**, as that will break the connection with the
161
DataONE network.  The ability to edit this field is only provided for the rare case
162
in which a new Metacat instance is being established to act as the provider for an 
163
existing DataONE Member Node, in which case the field can be edited to set it to
164
the value of a valid, existing Node Identifier.
165

    
166
The Node Subject and Node Certificate Path are linked fields that are critical for
167
proper operation of the node.  To act as a Member Node in DataONE, you must obtain
168
an X.509 certificate that can be used to identify this node and allow it to securely
169
communicate using SSL with other nodes and client applications.  This certificate can 
170
either be obtained from the DataONE Certificate Authority, or from a commercial 
171
provider of certificates. Once you have the certificate in hand, use a tool such 
172
as ``openssl`` to determine the exact subject distinguished name in the 
173
certificate, and use that to set the Node Subject field.  Set the Node 
174
Certificate Path to the location on the system in which you stored the 
175
certificate file.
176

    
177
The ``Synchronize`` checkbox allows the administrator to decide whether to turn on
178
synchronization with the DataONE network.  When this box is unchecked, the DataONE
179
Coordinating Nodes will not attempt to synchronize at all, but when checked, then
180
DataONE will periodically contact the node to synchrnize all metadata content.
181
To be part of the DataONE network, this box must be checked as that allows 
182
DataONE to receive a copy of the metadata associated with each object in the Metacat
183
system.  The switch is provided for those rare cases when a node needs to be disconnected
184
from DataONE for maintenance or service outages.  When the box is checked, DataONE
185
contacts the node using the schedule provided in the ``Synchronization Schedule``
186
fields.  The example in the dialog above has synchronization occurring once every third
187
minutes at the 10 second mark of those minutes.  The syntax for these schedules
188
follows the Quartz Crontab Entry syntax, which provides for many flexible schedule 
189
configurations.  If the administrator desires a less frequent schedule, such as daily, 
190
that can be configured by changing the ``*`` in the ``Hours`` field to be a concrete 
191
hour (such as ``11``) and the ``Minutes`` field to a concrete value like``15``, 
192
which would change the schedule to synchronize at 11:15 am daily.  
193

    
194
Once these parameters have been properly set, us the ``Register`` button to
195
request to register with the DataONE Coordinating Node.  This will generate a
196
registration document describing this Metacat instance and send it to the 
197
Coordinating Node registration service, which will return a unique Node Identifier
198
which will be recorded by Metacat.  At that point, all that remains is to wait for
199
the DataONE administrators to approve the node registration.  Details of the approval
200
process can be found on the `DataONE web site`_.
201

    
202
.. _DataONE web site: http://www.dataone.org
203

    
204
Access Control Policies
205
-----------------------
206
Metacat has supported fine grained access control for objects in the system since
207
its inception.  DataONE has devised a simple but effective access control system
208
that is compatible with the prior system in Metacat.  For each object in the DataONE
209
system (including data objects, scientific metadata objects, and resource maps), 
210
a SystemMetadata_ document describes the critical metadata needed to manage that
211
object in the system.  This metadata includes a ``RightsHolder`` field and an
212
``AuthoritativeMemberNode`` field that are used to list the people and node that
213
have ultimate control over the disposition of the object.  In addition, a separate
214
AccessPolicy_ can be included in the ``SystemMetadata`` for the object.  This ``AccessPolicy``
215
consists of a set of rules that grant additional permissions to other people, 
216
groups, and systems in DataONE.  For example, for one data file, two users 
217
(Alice and Bob) may be able make changes to the object, and the general public may
218
be allowed to read the object.  In the absence of explicit rules extending these permissions,
219
Metacat enforces the rule that only the ``RightsHolder`` and ``AuthoritativeMemberNode`` have
220
rights to the object, and that the Coordinating Node can manage ``SystemMetadata``
221
for the object.  An example AccessPolicy that might be submitted with a dataset
222
(giving Alice and Bob permission to read and write the object) follows:
223

    
224
::
225

    
226
  ...
227
  <accessPolicy>
228
      <allow>
229
        <subject>/C=US/O=SomeIdP/CN=Alice</subject>
230
        <subject>/C=US/O=SomeIdP/CN=Bob</subject>
231
        <permission>read</permission>
232
        <permission>write</permission>
233
      </allow>
234
  </accessPolicy>
235
  ...
236
  
237
These AccessPolicies can be embedded inside of the ``SystemMetadata`` that accompany
238
submission of an object through the `MNStorage.create`_ and `MNStorage.update`_ services, 
239
or can be set using the `CNAuthorization.setAccessPolicy`_ service.
240

    
241
.. _SystemMetadata: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/Types.html#Types.AccessPolicy
242

    
243
.. _AccessPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/Types.html#Types.AccessPolicy
244

    
245
.. _MNStorage.create: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/MN_APIs.html#MNStorage.create
246

    
247
.. _MNStorage.update: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/MN_APIs.html#MNStorage.update
248

    
249
.. _CNAuthorization.setAccessPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/CN_APIs.html#CNAuthorization.setAccessPolicy
250

    
251
Configuration as a replication target
252
-------------------------------------
253
DataONE is designed to enable a robust preservation environment through replication
254
of digital objects at multiple Member Nodes.  Any Member Node in DataONE that implements
255
the Tier 4 Service interface can offer to act as a target for object replication.  
256
Currently, Metacat configuration supports turning this replication function on or off.
257
When the 'Act as a replication target' checkbox is checked, then Metacat will notify
258
the Coordinating Nodes in DataONE that it is available to house replicas of objects
259
from other nodes.  Shortly thereafter, the Coordinating Nodes may notify Metacat to
260
replicate objects from throughout the system, which it will start to do.  There objects
261
will begin to be listed in the Metacat catalog.
262

    
263
.. Note:: 
264
  
265
  Future versions of Metacat will allow finer specification of the Node
266
  Replication Policy, which determines the set of objects
267
  that it is willing to replicate, using constraints on object size, total objects, 
268
  source nodes, and object format types.
269

    
270
Object Replication Policies
271
---------------------------
272
In addition to access control, each object also can have a ``ReplicationPolicy``
273
associated with it that determines whether DataONE should attempt to replicate the
274
object for failover and backup purposes to other Member Nodes in the federation. 
275
Both the ``RightsHolder`` and ``AuthoritativeMemberNode`` for an object can set the
276
``ReplicationPolicy``, which consists of fields that describe how many replicas 
277
should be maintained, and any nodes that are preferred for housing those replicas, or
278
that should be blocked from housing replicas.  
279

    
280
These ReplicationPolicies can be embedded inside of the ``SystemMetadata`` that accompany
281
submission of an object through the `MNStorage.create`_ and `MNStorage.update`_ services, 
282
or can be set using the `CNReplication.setReplicationPolicy`_ service.
283

    
284
.. _CNReplication.setReplicationPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/CN_APIs.html#CNReplication.setReplicationPolicy
(5-5/20)