Project

General

Profile

1
DataONE Member Node Support
2
===========================
3
DataONE_ is a federation of data repositories that aims to improve 
4
interoperability among data repository software systems and advance the
5
preservation of scientific data for future use.
6
Metacat deployments can be configured to participate in DataONE_. This 
7
chapter describes the DataONE_ data federation,  its architecture, and the 
8
way in which Metacat can be used to participate as a node in the DataONE system.
9

    
10
.. _DataONE: http://dataone.org/
11

    
12
What is DataONE?
13
----------------
14
The DataONE_ project is a collaboration among scientists, technologists, librarians,
15
and social scientists to build a robust, interoperable, and sustainable system for
16
preserving and accessing Earth observational data at national and global scales.  
17
Supported by the U.S. National Science Foundation, DataONE partners focus on
18
technological, financial, and organizational sustainability approaches to 
19
building a distributed network of data repositories that are fully interoperable,
20
even when those repositories use divergent underlying software and support different
21
data and metadata content standards. DataONE defines a common web-service service 
22
programming interface that allows the main software components of the DataONE system
23
to seamlessly communicate. The components of the DataONE system include:
24

    
25
* DataONE Service Interface
26
* Member Nodes
27
* Coordinating Nodes
28
* Investigator Toolkit
29

    
30
Metacat implements the services needed to operate as a DataONE Member Node, 
31
as described below.  The service interface then allows many different scientific 
32
software tools for data management, analysis, visualization and other parts of 
33
the scientific lifecycle to directly communicate with Metacat without being
34
further specialized beyond the support needed for DataONE.  This streamlines the
35
process of writing scientific software both for servers and client tools.
36

    
37
The DataONE Service Interface
38
-----------------------------
39
DataONE acheives interoperability by defining a lightweight but powerful set of 
40
REST_ web services that can be implemented by various data management software 
41
systems to allow those systems to effectively communicate with one another, 
42
exchange data, metadata, and other scientific objects.  This `DataONE Service Interface`_
43
is an open standard that defines the communication protocols and technical 
44
expectations for software components that wish to participate in the DataONE
45
federation. This service interface is divided into `four distinct tiers`_, with the 
46
intention that any given software system may implement only those tiers that are 
47
relevant to their repository; for example, a data aggregator might only implement
48
the Tier 1 interfaces that provide anonymous access to public data sets, while
49
a complete data management system like Metacat can implement all four tiers:
50

    
51
1. **Tier 1:** Read-only, anonymous data access
52
2. **Tier 2:** Read-only, with authentication and access control
53
3. **Tier 3:** Full Write access
54
4. **Tier 4:** Replication target services
55

    
56
.. _REST: http://en.wikipedia.org/wiki/Representational_state_transfer
57

    
58
.. _DataONE Service Interface: http://releases.dataone.org/online/d1-architecture-1.0.0
59

    
60
.. _four distinct tiers: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/index.html
61

    
62
Member Nodes
63
------------
64
In DataONE, Member Nodes represent the core of the network, in that they represent
65
particular scientific communities, manage and preserve their data and metadata, and
66
provide tools to their community for contributing, managing, and accessing data.
67
DataONE provides a standard way for these individual repositories to interact, and helps
68
to coordinate among the Member Nodes in the federation.  This allows Member Nodes
69
to provide services to each other, such as replication of data for backup and failover.
70
To be a Member Node, a repository must implement the Member Node service interface, 
71
and then register with DataONE.  Metacat provides this implementation automatically,
72
and provides an easy configuration option to register a Metacat instance as a 
73
DataONE Member Node (see configuration section below). If you are deploying a Metacat
74
instance, it is relatively simple to become a Member Node, but keep in mind that 
75
DataONE is aiming for longevity and preservation, and so is selecting for nodes
76
that have long-term data preservation as part of their mission. 
77

    
78
Coordinating Nodes
79
------------------
80
The DataONE Coordinating Nodes provide a set of services to Member Nodes that
81
allow Member Nodes to easily interact with one another and to provide a unified
82
view of the whole DataONE Federation.  The main services provided by Coordinating
83
Nodes are:
84

    
85
* Global search index for all metadata and web portal for data discovery
86
* Resolution service to map unique identifiers to the Member Nodes that hold data
87
* Authentication against a shared set of accounts based on CILogon_ and InCommon_
88
* Replication management services to reliably replicate data according to 
89
  policies set by the Member Nodes
90
* Fixity checking to ensure that preserved objects remain valid
91
* Member Node registration and management
92
* Aggregated logging for data access across the whole federation
93

    
94
Three geographically distributed Coordinating Nodes replicate these coordinating 
95
services at UC Santa Barbara, the University of New Mexico, and the Oak Ridge Campus.
96
Coordinating Nodes are set up in a fully redundant manner, such that any of the coordinating
97
nodes can be offline and the others will continue to provide availability of the services
98
without interruption.  The DataONE services expose their services at::
99

    
100
  https://cn.dataone.org/cn
101
  
102
And the DataONE search portal is available at:
103

    
104
  https://cn.dataone.org/
105

    
106
.. _CILogon: http://www.cilogon.org
107

    
108
.. _InCommon: http://incommon.org
109

    
110
Investigator Toolkit
111
--------------------
112
In order to provide scientists with convenient access to the data and metadata in
113
DataONE, the third component represents a library of software tools that have been 
114
adapted to work with DataONE via the service interface and can be used to
115
discover, manage, analyze, and visualize data in DataONE.  For example, DataONE
116
plans to release metadata editors (e.g., Morpho), data search tools (e.g., Mercury), 
117
data access tools (e.g., ONEDrive), and data analysis tools (e.g., R) that all 
118
know how to interact with DataONE Member Nodes and Coordinating Nodes.  Consequently,
119
scientists will be able to access data from any DataONE Member Node, such as a Metacat
120
node, directly from within the R environment.  In addition, software tools that 
121
are written to work with one Member Node should also work with others, thereby
122
greatly increasing the efficiency of creating an entire toolkit of software that
123
is useful to investigators.  
124

    
125
Because DataONE services are REST web services, software written in any
126
programming language can be adapted to interact with DataONE.
127
In addition, to ease the process of adapting tools to work with DataONE, libraries
128
are provided for common programming languages such as Java (d1-libclient-java) 
129
and Python (d1_libclient-python) are provided that allow simple function calls 
130
to be used to access any DataONE service.
131

    
132
Configuring Metacat as a Member Node
133
------------------------------------
134
Configuring Metacat as a DataONE Member Node is accomplished with the standard
135
Metacat Administrative configuration utility. To access the utility, visit the 
136
following URL::
137

    
138
  http://<yourhost.org>/<context>/admin
139
  
140
where ``<yourhost.org>`` represents the hostname of your webserver running metacat,
141
and ``<context>`` is the name of the web context in which Metacat was installed.
142
Once at the administrative utility, click on the DataONE configuration link, which
143
should show the following screen:
144

    
145
.. figure:: images/screenshots/screen-dataone-config.png
146
   :align: center
147
   
148
   The configuration screen for configuring Metacat as a DataONE node.
149

    
150
To configure Metacat as a node in the DataONE network, configure the properties shown
151
in the figure above.  The Node Name should be a short name for the node that can
152
be used in user interface displays that list the node.  For example, one node in
153
DataONE is the 'Knowledge Network for Biocomplexity'.  Also provide a brief sentence
154
or two describing the node, including its intended scope and purpose.  
155

    
156
The Node Identifier field is a unique identifier assigned by DataONE to identify
157
this node even when the node changes physical locations over time.  After Metacat
158
registers with the DataONE Coordinating Nodes (when you click 'Register' at the
159
bottom of this form), the Node Identifier should not be changed.  **It is critical that
160
you not change the Node Identifier after registration**, as that will break the connection with the
161
DataONE network.  Changing this field should only happen in the rare case
162
in which a new Metacat instance is being established to act as the provider for an 
163
existing DataONE Member Node, in which case the field can be edited to set it to
164
the value of a valid, existing Node Identifier.
165

    
166
The Node Subject and Node Certificate Path are linked fields that are critical for
167
proper operation of the node.  To act as a Member Node in DataONE, you must obtain
168
an X.509 certificate that can be used to identify this node and allow it to securely
169
communicate using SSL with other nodes and client applications.  This certificate can 
170
be obtained from the DataONE Certificate Authority. 
171
Once you have the certificate in hand, use a tool such 
172
as ``openssl`` to determine the exact subject distinguished name in the 
173
certificate, and use that to set the Node Subject field.  Set the Node 
174
Certificate Path to the location on the system in which you stored the 
175
certificate file. Be sure to protect the certificate file, as it contains the
176
private key that is used to authenticate this node within DataONE.
177

    
178
.. Note:: 
179

    
180
	For Tier 2 deployments and above, the Metacat Member Node must have Apache configured to request 
181
	client certificates. Detailed instructions are included at the end of this chapter.
182

    
183
The ``Enable DataONE Services`` checkbox allows the administrator to decide whether to 
184
turn on synchronization with the DataONE network.  When this box is unchecked, the 
185
DataONE Coordinating Nodes will not attempt to synchronize at all, but when checked, 
186
then DataONE will periodically contact the node to synchronize all metadata content.
187
To be part of the DataONE network, this box must be checked as that allows 
188
DataONE to receive a copy of the metadata associated with each object in the Metacat
189
system.  The switch is provided for those rare cases when a node needs to be disconnected
190
from DataONE for maintenance or service outages.  When the box is checked, DataONE
191
contacts the node using the schedule provided in the ``Synchronization Schedule``
192
fields.  The example in the dialog above has synchronization occurring once every third
193
minutes at the 10 second mark of those minutes.  The syntax for these schedules
194
follows the Quartz Crontab Entry syntax, which provides for many flexible schedule 
195
configurations.  If the administrator desires a less frequent schedule, such as daily, 
196
that can be configured by changing the ``*`` in the ``Hours`` field to be a concrete 
197
hour (such as ``11``) and the ``Minutes`` field to a concrete value like``15``, 
198
which would change the schedule to synchronize at 11:15 am daily.  
199

    
200
The Replication section is used to configure replication options for the node
201
overall and for objects stored in Metacat.  The ``Accept and Store Replicas``
202
checkbox is used to indicate that the administrator of this node is willing to allow
203
replica data and metadata from other Member Nodes to be stored on this node.  We
204
encourage people to allow replication to their nodes, as this increases the
205
scalability and flexibility of the network overall.  The three "Default" fields set
206
the default values for the replication policies for data and metadata on this node
207
that are generated when System Metadata is not available for an object (such as when
208
it originates from a client that is not DataONE compliant).  The ``Default Number of
209
Replicas`` determines how many replica copies of the object should be stored on
210
other Member Nodes.  A value of 0 or less indicates that no replicas should be
211
stored.  In addition, you can specify a list of nodes that are either preferred for
212
use when choosing replica nodes, or that are blocked from use as replica nodes.
213
This allows Member Nodes to set up bidirectional agreements with partner nodes to
214
replicate data across their sites. The values for both ``Default Preferred Nodes``
215
and ``Default Blocked Nodes`` is a comma-separated list of NodeReference identifiers 
216
that were assigned to the target nodes by DataONE.
217

    
218
Once these parameters have been properly set, us the ``Register`` button to
219
request to register with the DataONE Coordinating Node.  This will generate a
220
registration document describing this Metacat instance and send it to the 
221
Coordinating Node registration service.  At that point, all that remains is to wait for
222
the DataONE administrators to approve the node registration.  Details of the approval
223
process can be found on the `DataONE web site`_.
224

    
225
.. _DataONE web site: http://www.dataone.org
226

    
227
Access Control Policies
228
-----------------------
229
Metacat has supported fine grained access control for objects in the system since
230
its inception.  DataONE has devised a simple but effective access control system
231
that is compatible with the prior system in Metacat.  For each object in the DataONE
232
system (including data objects, scientific metadata objects, and resource maps), 
233
a SystemMetadata_ document describes the critical metadata needed to manage that
234
object in the system.  This metadata includes a ``RightsHolder`` field and an
235
``AuthoritativeMemberNode`` field that are used to list the people and node that
236
have ultimate control over the disposition of the object.  In addition, a separate
237
AccessPolicy_ can be included in the ``SystemMetadata`` for the object.  This ``AccessPolicy``
238
consists of a set of rules that grant additional permissions to other people, 
239
groups, and systems in DataONE.  For example, for one data file, two users 
240
(Alice and Bob) may be able make changes to the object, and the general public may
241
be allowed to read the object.  In the absence of explicit rules extending these permissions,
242
Metacat enforces the rule that only the ``RightsHolder`` and ``AuthoritativeMemberNode`` have
243
rights to the object, and that the Coordinating Node can manage ``SystemMetadata``
244
for the object.  An example AccessPolicy that might be submitted with a dataset
245
(giving Alice and Bob permission to read and write the object) follows:
246

    
247
::
248

    
249
  ...
250
  <accessPolicy>
251
      <allow>
252
        <subject>/C=US/O=SomeIdP/CN=Alice</subject>
253
        <subject>/C=US/O=SomeIdP/CN=Bob</subject>
254
        <permission>read</permission>
255
        <permission>write</permission>
256
      </allow>
257
  </accessPolicy>
258
  ...
259
  
260
These AccessPolicies can be embedded inside of the ``SystemMetadata`` that accompany
261
submission of an object through the `MNStorage.create`_ and `MNStorage.update`_ services, 
262
or can be set using the `CNAuthorization.setAccessPolicy`_ service.
263

    
264
.. _SystemMetadata: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/Types.html#Types.AccessPolicy
265

    
266
.. _AccessPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/Types.html#Types.AccessPolicy
267

    
268
.. _MNStorage.create: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/MN_APIs.html#MNStorage.create
269

    
270
.. _MNStorage.update: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/MN_APIs.html#MNStorage.update
271

    
272
.. _CNAuthorization.setAccessPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/CN_APIs.html#CNAuthorization.setAccessPolicy
273

    
274
Configuration as a replication target
275
-------------------------------------
276
DataONE is designed to enable a robust preservation environment through replication
277
of digital objects at multiple Member Nodes.  Any Member Node in DataONE that implements
278
the Tier 4 Service interface can offer to act as a target for object replication.  
279
Currently, Metacat configuration supports turning this replication function on or off.
280
When the 'Act as a replication target' checkbox is checked, then Metacat will notify
281
the Coordinating Nodes in DataONE that it is available to house replicas of objects
282
from other nodes.  Shortly thereafter, the Coordinating Nodes may notify Metacat to
283
replicate objects from throughout the system, which it will start to do.  There objects
284
will begin to be listed in the Metacat catalog.
285

    
286
.. Note:: 
287
  
288
  Future versions of Metacat will allow finer specification of the Node
289
  Replication Policy, which determines the set of objects
290
  that it is willing to replicate, using constraints on object size, total objects, 
291
  source nodes, and object format types.
292

    
293
Object Replication Policies
294
---------------------------
295
In addition to access control, each object also can have a ``ReplicationPolicy``
296
associated with it that determines whether DataONE should attempt to replicate the
297
object for failover and backup purposes to other Member Nodes in the federation. 
298
Both the ``RightsHolder`` and ``AuthoritativeMemberNode`` for an object can set the
299
``ReplicationPolicy``, which consists of fields that describe how many replicas 
300
should be maintained, and any nodes that are preferred for housing those replicas, or
301
that should be blocked from housing replicas.  
302

    
303
These ReplicationPolicies can be embedded inside of the ``SystemMetadata`` that accompany
304
submission of an object through the `MNStorage.create`_ and `MNStorage.update`_ services, 
305
or can be set using the `CNReplication.setReplicationPolicy`_ service.
306

    
307
.. _CNReplication.setReplicationPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/CN_APIs.html#CNReplication.setReplicationPolicy
308

    
309

    
310
Generating DataONE System Metadata
311
----------------------------------
312
When a Metacat instance becomes a Member Node, System Metadata must be generated for the existing content.
313
This can be invoked in the Replication configuration screen of the Metacat administration interface. Initially, 
314
Metacat instances will only need to generate System Metadata for their local content (the ``localhost`` entry). 
315
In cases where Metacat has participated in replication with other Metacat servers, it may be useful to generate System Metadata 
316
for those replica records as well. Please consult both the replication partner's administrator and the DataONE administrators before 
317
generating System Metadata for replica content.
318

    
319
.. figure:: images/screenshots/screen-dataone-replication.png
320
   :align: center
321
   
322
   The replication configuration screen for generating System Metadata.
323

    
324
Apache configuration details
325
----------------------------
326
These Apache directives are crucial for Metacat to function as a Tier 2+ Member Node
327

    
328
::
329

    
330
  ...
331
  AllowEncodedSlashes On
332
  AcceptPathInfo      On
333
  JkOptions +ForwardURICompatUnparsed
334
  SSLEngine on
335
  SSLOptions +StrictRequire +StdEnvVars +ExportCertData
336
  SSLVerifyClient optional
337
  SSLVerifyDepth 10
338
  SSLCertificateFile /etc/ssl/certs/<your_server_certificate>
339
  SSLCertificateKeyFile /etc/ssl/private/<your_server_key>
340
  SSLCACertificatePath /etc/ssl/certs/
341
  ...
342
  
343
Where ``<your_server_certificate>`` and ``<your_server_key>`` are the certificate/key pair used by Apache 
344
to identify the server to clients. The DataONE Certiciate Authority certificate - available from the DataONE administrators -  
345
will also need to be added to the directory specified by ``SSLCACertificatePath`` 
346
in order to validate client certificates signed by that authority. DataONE has also provided a CA chain file that may be used in lieu of directory-based CA 
347
confinguration. The `SSLCACertificateFile`` directive should be used when configuring your member node with the DataONE CA chain.
348
When these changes have been applied, Apache should be restarted:
349

    
350
::
351

    
352
  cd /etc/ssl/certs
353
  sudo c_rehash
354
  sudo /etc/init.d/apache2 restart
355
  
356

    
357
Configure Tomcat to allow DataONE identifiers
358
----------------------------------------------
359
Edit ``/etc/tomcat/catalina.properties`` to include:
360

    
361
::
362

    
363
  org.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true
364
  org.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH=true  
(5-5/22)