1 |
DataONE Member Node Support
2 |
3 |
DataONE_ is a federation of data repositories that aims to improve
4 |
interoperability among data repository software systems and advance the
5 |
preservation of scientific data for future use.
6 |
Metacat deployments can be configured to participate in DataONE_. This
7 |
chapter describes the DataONE_ data federation, its architecture, and the
8 |
way in which Metacat can be used to participate as a node in the DataONE system.
9 |
10 |
.. _DataONE: http://dataone.org/
11 |
12 |
What is DataONE?
13 |
14 |
The DataONE_ project is a collaboration among scientists, technologists, librarians,
15 |
and social scientists to build a robust, interoperable, and sustainable system for
16 |
preserving and accessing Earth observational data at national and global scales.
17 |
Supported by the U.S. National Science Foundation, DataONE partners focus on
18 |
technological, financial, and organizational sustainability approaches to
19 |
building a distributed network of data repositories that are fully interoperable,
20 |
even when those repositories use divergent underlying software and support different
21 |
data and metadata content standards. DataONE defines a common web-service service
22 |
programming interface that allows the main software components of the DataONE system
23 |
to seamlessly communicate. The components of the DataONE system include:
24 |
25 |
* DataONE Service Interface
26 |
* Member Nodes
27 |
* Coordinating Nodes
28 |
* Investigator Toolkit
29 |
30 |
Metacat implements the services needed to operate as a DataONE Member Node,
31 |
as described below. The service interface then allows many different scientific
32 |
software tools for data management, analysis, visualization and other parts of
33 |
the scientific lifecycle to directly communicate with Metacat without being
34 |
further specialized beyond the support needed for DataONE. This streamlines the
35 |
process of writing scientific software both for servers and client tools.
36 |
37 |
The DataONE Service Interface
38 |
39 |
DataONE acheives interoperability by defining a lightweight but powerful set of
40 |
REST_ web services that can be implemented by various data management software
41 |
systems to allow those systems to effectively communicate with one another,
42 |
exchange data, metadata, and other scientific objects. This `DataONE Service Interface`_
43 |
is an open standard that defines the communication protocols and technical
44 |
expectations for software components that wish to participate in the DataONE
45 |
federation. This service interface is divided into `four distinct tiers`_, with the
46 |
intention that any given software system may implement only those tiers that are
47 |
relevant to their repository; for example, a data aggregator might only implement
48 |
the Tier 1 interfaces that provide anonymous access to public data sets, while
49 |
a complete data management system like Metacat can implement all four tiers:
50 |
51 |
1. **Tier 1:** Read-only, anonymous data access
52 |
2. **Tier 2:** Read-only, with authentication and access control
53 |
3. **Tier 3:** Full Write access
54 |
4. **Tier 4:** Replication target services
55 |
56 |
.. _REST: http://en.wikipedia.org/wiki/Representational_state_transfer
57 |
58 |
.. _DataONE Service Interface: http://releases.dataone.org/online/d1-architecture-1.0.0
59 |
60 |
.. _four distinct tiers: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/index.html
61 |
62 |
Member Nodes
63 |
64 |
In DataONE, Member Nodes represent the core of the network, in that they represent
65 |
particular scientific communities, manage and preserve their data and metadata, and
66 |
provide tools to their community for contributing, managing, and accessing data.
67 |
DataONE provides a standard way for these individual repositories to interact, and helps
68 |
to coordinate among the Member Nodes in the federation. This allows Member Nodes
69 |
to provide services to each other, such as replication of data for backup and failover.
70 |
To be a Member Node, a repository must implement the Member Node service interface,
71 |
and then register with DataONE. Metacat provides this implementation automatically,
72 |
and provides an easy configuration option to register a Metacat instance as a
73 |
DataONE Member Node (see configuration section below). If you are deploying a Metacat
74 |
instance, it is relatively simple to become a Member Node, but keep in mind that
75 |
DataONE is aiming for longevity and preservation, and so is selecting for nodes
76 |
that have long-term data preservation as part of their mission.
77 |
78 |
Coordinating Nodes
79 |
80 |
The DataONE Coordinating Nodes provide a set of services to Member Nodes that
81 |
allow Member Nodes to easily interact with one another and to provide a unified
82 |
view of the whole DataONE Federation. The main services provided by Coordinating
83 |
Nodes are:
84 |
85 |
* Global search index for all metadata and web portal for data discovery
86 |
* Resolution service to map unique identifiers to the Member Nodes that hold data
87 |
* Authentication against a shared set of accounts based on CILogon_ and InCommon_
88 |
* Replication management services to reliably replicate data according to
89 |
policies set by the Member Nodes
90 |
* Fixity checking to ensure that preserved objects remain valid
91 |
* Member Node registration and management
92 |
* Aggregated logging for data access across the whole federation
93 |
94 |
Three geographically distributed Coordinating Nodes replicate these coordinating
95 |
services at UC Santa Barbara, the University of New Mexico, and the Oak Ridge Campus.
96 |
Coordinating Nodes are set up in a fully redundant manner, such that any of the coordinating
97 |
nodes can be offline and the others will continue to provide availability of the services
98 |
without interruption. The DataONE services expose their services at::
99 |
100 |
101 |
102 |
And the DataONE search portal is available at:
103 |
104 |
105 |
106 |
.. _CILogon: http://www.cilogon.org
107 |
108 |
.. _InCommon: http://incommon.org
109 |
110 |
Investigator Toolkit
111 |
112 |
In order to provide scientists with convenient access to the data and metadata in
113 |
DataONE, the third component represents a library of software tools that have been
114 |
adapted to work with DataONE via the service interface and can be used to
115 |
discover, manage, analyze, and visualize data in DataONE. For example, DataONE
116 |
plans to release metadata editors (e.g., Morpho), data search tools (e.g., Mercury),
117 |
data access tools (e.g., ONEDrive), and data analysis tools (e.g., R) that all
118 |
know how to interact with DataONE Member Nodes and Coordinating Nodes. Consequently,
119 |
scientists will be able to access data from any DataONE Member Node, such as a Metacat
120 |
node, directly from within the R environment. In addition, software tools that
121 |
are written to work with one Member Node should also work with others, thereby
122 |
greatly increasing the efficiency of creating an entire toolkit of software that
123 |
is useful to investigators.
124 |
125 |
Because DataONE services are REST web services, software written in any
126 |
programming language can be adapted to interact with DataONE.
127 |
In addition, to ease the process of adapting tools to work with DataONE, libraries
128 |
are provided for common programming languages such as Java (d1-libclient-java)
129 |
and Python (d1_libclient-python) are provided that allow simple function calls
130 |
to be used to access any DataONE service.
131 |
132 |
Configuring Metacat as a Member Node
133 |
134 |
Configuring Metacat as a DataONE Member Node is accomplished with the standard
135 |
Metacat Administrative configuration utility. To access the utility, visit the
136 |
following URL::
137 |
138 |
139 |
140 |
where ``<yourhost.org>`` represents the hostname of your webserver running metacat,
141 |
and ``<context>`` is the name of the web context in which Metacat was installed.
142 |
Once at the administrative utility, click on the DataONE configuration link, which
143 |
should show the following screen:
144 |
145 |
.. figure:: images/screenshots/screen-dataone-config.png
146 |
:align: center
147 |
148 |
The configuration screen for configuring Metacat as a DataONE node.
149 |
150 |
To configure Metacat as a node in the DataONE network, configure the properties shown
151 |
in the figure above. The Node Name should be a short name for the node that can
152 |
be used in user interface displays that list the node. For example, one node in
153 |
DataONE is the 'Knowledge Network for Biocomplexity'. Also provide a brief sentence
154 |
or two describing the node, including its intended scope and purpose.
155 |
156 |
The Node Identifier field is a unique identifier assigned by DataONE to identify
157 |
this node even when the node changes physical locations over time. After Metacat
158 |
registers with the DataONE Coordinating Nodes (when you click 'Register' at the
159 |
bottom of this form), the Node Identifier should not be changed. **It is critical that
160 |
you not change the Node Identifier after registration**, as that will break the connection with the
161 |
DataONE network. Changing this field should only happen in the rare case
162 |
in which a new Metacat instance is being established to act as the provider for an
163 |
existing DataONE Member Node, in which case the field can be edited to set it to
164 |
the value of a valid, existing Node Identifier.
165 |
166 |
The Node Subject and Node Certificate Path are linked fields that are critical for
167 |
proper operation of the node. To act as a Member Node in DataONE, you must obtain
168 |
an X.509 certificate that can be used to identify this node and allow it to securely
169 |
communicate using SSL with other nodes and client applications. This certificate can
170 |
be obtained from the DataONE Certificate Authority.
171 |
Once you have the certificate in hand, use a tool such
172 |
as ``openssl`` to determine the exact subject distinguished name in the
173 |
certificate, and use that to set the Node Subject field. Set the Node
174 |
Certificate Path to the location on the system in which you stored the
175 |
certificate file. Be sure to protect the certificate file, as it contains the
176 |
private key that is used to authenticate this node within DataONE.
177 |
178 |
.. Note::
179 |
180 |
For Tier 2 deployments and above, the Metacat Member Node must have Apache configured to request
181 |
client certificates. Detailed instructions are included at the end of this chapter.
182 |
183 |
The ``Enable DataONE Services`` checkbox allows the administrator to decide whether to
184 |
turn on synchronization with the DataONE network. When this box is unchecked, the
185 |
DataONE Coordinating Nodes will not attempt to synchronize at all, but when checked,
186 |
then DataONE will periodically contact the node to synchronize all metadata content.
187 |
To be part of the DataONE network, this box must be checked as that allows
188 |
DataONE to receive a copy of the metadata associated with each object in the Metacat
189 |
system. The switch is provided for those rare cases when a node needs to be disconnected
190 |
from DataONE for maintenance or service outages. When the box is checked, DataONE
191 |
contacts the node using the schedule provided in the ``Synchronization Schedule``
192 |
fields. The example in the dialog above has synchronization occurring once every third
193 |
minutes at the 10 second mark of those minutes. The syntax for these schedules
194 |
follows the Quartz Crontab Entry syntax, which provides for many flexible schedule
195 |
configurations. If the administrator desires a less frequent schedule, such as daily,
196 |
that can be configured by changing the ``*`` in the ``Hours`` field to be a concrete
197 |
hour (such as ``11``) and the ``Minutes`` field to a concrete value like``15``,
198 |
which would change the schedule to synchronize at 11:15 am daily.
199 |
200 |
The Replication section is used to configure replication options for the node
201 |
overall and for objects stored in Metacat. The ``Accept and Store Replicas``
202 |
checkbox is used to indicate that the administrator of this node is willing to allow
203 |
replica data and metadata from other Member Nodes to be stored on this node. We
204 |
encourage people to allow replication to their nodes, as this increases the
205 |
scalability and flexibility of the network overall. The three "Default" fields set
206 |
the default values for the replication policies for data and metadata on this node
207 |
that are generated when System Metadata is not available for an object (such as when
208 |
it originates from a client that is not DataONE compliant). The ``Default Number of
209 |
Replicas`` determines how many replica copies of the object should be stored on
210 |
other Member Nodes. A value of 0 or less indicates that no replicas should be
211 |
stored. In addition, you can specify a list of nodes that are either preferred for
212 |
use when choosing replica nodes, or that are blocked from use as replica nodes.
213 |
This allows Member Nodes to set up bidirectional agreements with partner nodes to
214 |
replicate data across their sites. The values for both ``Default Preferred Nodes``
215 |
and ``Default Blocked Nodes`` is a comma-separated list of NodeReference identifiers
216 |
that were assigned to the target nodes by DataONE.
217 |
218 |
Once these parameters have been properly set, us the ``Register`` button to
219 |
request to register with the DataONE Coordinating Node. This will generate a
220 |
registration document describing this Metacat instance and send it to the
221 |
Coordinating Node registration service. At that point, all that remains is to wait for
222 |
the DataONE administrators to approve the node registration. Details of the approval
223 |
process can be found on the `DataONE web site`_.
224 |
225 |
.. _DataONE web site: http://www.dataone.org
226 |
227 |
Access Control Policies
228 |
229 |
Metacat has supported fine grained access control for objects in the system since
230 |
its inception. DataONE has devised a simple but effective access control system
231 |
that is compatible with the prior system in Metacat. For each object in the DataONE
232 |
system (including data objects, scientific metadata objects, and resource maps),
233 |
a SystemMetadata_ document describes the critical metadata needed to manage that
234 |
object in the system. This metadata includes a ``RightsHolder`` field and an
235 |
``AuthoritativeMemberNode`` field that are used to list the people and node that
236 |
have ultimate control over the disposition of the object. In addition, a separate
237 |
AccessPolicy_ can be included in the ``SystemMetadata`` for the object. This ``AccessPolicy``
238 |
consists of a set of rules that grant additional permissions to other people,
239 |
groups, and systems in DataONE. For example, for one data file, two users
240 |
(Alice and Bob) may be able make changes to the object, and the general public may
241 |
be allowed to read the object. In the absence of explicit rules extending these permissions,
242 |
Metacat enforces the rule that only the ``RightsHolder`` and ``AuthoritativeMemberNode`` have
243 |
rights to the object, and that the Coordinating Node can manage ``SystemMetadata``
244 |
for the object. An example AccessPolicy that might be submitted with a dataset
245 |
(giving Alice and Bob permission to read and write the object) follows:
246 |
247 |
248 |
249 |
250 |
251 |
252 |
253 |
254 |
255 |
256 |
257 |
258 |
259 |
260 |
These AccessPolicies can be embedded inside of the ``SystemMetadata`` that accompany
261 |
submission of an object through the `MNStorage.create`_ and `MNStorage.update`_ services,
262 |
or can be set using the `CNAuthorization.setAccessPolicy`_ service.
263 |
264 |
.. _SystemMetadata: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/Types.html#Types.AccessPolicy
265 |
266 |
.. _AccessPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/Types.html#Types.AccessPolicy
267 |
268 |
.. _MNStorage.create: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/MN_APIs.html#MNStorage.create
269 |
270 |
.. _MNStorage.update: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/MN_APIs.html#MNStorage.update
271 |
272 |
.. _CNAuthorization.setAccessPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/CN_APIs.html#CNAuthorization.setAccessPolicy
273 |
274 |
Configuration as a replication target
275 |
276 |
DataONE is designed to enable a robust preservation environment through replication
277 |
of digital objects at multiple Member Nodes. Any Member Node in DataONE that implements
278 |
the Tier 4 Service interface can offer to act as a target for object replication.
279 |
Currently, Metacat configuration supports turning this replication function on or off.
280 |
When the 'Act as a replication target' checkbox is checked, then Metacat will notify
281 |
the Coordinating Nodes in DataONE that it is available to house replicas of objects
282 |
from other nodes. Shortly thereafter, the Coordinating Nodes may notify Metacat to
283 |
replicate objects from throughout the system, which it will start to do. There objects
284 |
will begin to be listed in the Metacat catalog.
285 |
286 |
.. Note::
287 |
288 |
Future versions of Metacat will allow finer specification of the Node
289 |
Replication Policy, which determines the set of objects
290 |
that it is willing to replicate, using constraints on object size, total objects,
291 |
source nodes, and object format types.
292 |
293 |
Object Replication Policies
294 |
295 |
In addition to access control, each object also can have a ``ReplicationPolicy``
296 |
associated with it that determines whether DataONE should attempt to replicate the
297 |
object for failover and backup purposes to other Member Nodes in the federation.
298 |
Both the ``RightsHolder`` and ``AuthoritativeMemberNode`` for an object can set the
299 |
``ReplicationPolicy``, which consists of fields that describe how many replicas
300 |
should be maintained, and any nodes that are preferred for housing those replicas, or
301 |
that should be blocked from housing replicas.
302 |
303 |
These ReplicationPolicies can be embedded inside of the ``SystemMetadata`` that accompany
304 |
submission of an object through the `MNStorage.create`_ and `MNStorage.update`_ services,
305 |
or can be set using the `CNReplication.setReplicationPolicy`_ service.
306 |
307 |
.. _CNReplication.setReplicationPolicy: http://releases.dataone.org/online/d1-architecture-1.0.0/apis/CN_APIs.html#CNReplication.setReplicationPolicy
308 |
309 |
310 |
Generating DataONE System Metadata
311 |
312 |
When a Metacat instance becomes a Member Node, System Metadata must be generated for the existing content.
313 |
This can be invoked in the Replication configuration screen of the Metacat administration interface. Initially,
314 |
Metacat instances will only need to generate System Metadata for their local content (the ``localhost`` entry).
315 |
In cases where Metacat has participated in replication with other Metacat servers, it may be useful to generate System Metadata
316 |
for those replica records as well. Please consult both the replication partner's administrator and the DataONE administrators before
317 |
generating System Metadata for replica content.
318 |
319 |
.. figure:: images/screenshots/screen-dataone-replication.png
320 |
:align: center
321 |
322 |
The replication configuration screen for generating System Metadata.
323 |
324 |
Apache configuration details
325 |
326 |
These Apache directives are crucial for Metacat to function as a Tier 2+ Member Node
327 |
328 |
329 |
330 |
331 |
AllowEncodedSlashes On
332 |
AcceptPathInfo On
333 |
JkOptions +ForwardURICompatUnparsed
334 |
SSLEngine on
335 |
SSLOptions +StrictRequire +StdEnvVars +ExportCertData
336 |
SSLVerifyClient optional
337 |
SSLVerifyDepth 10
338 |
SSLCertificateFile /etc/ssl/certs/<your_server_certificate>
339 |
SSLCertificateKeyFile /etc/ssl/private/<your_server_key>
340 |
SSLCACertificatePath /etc/ssl/certs/
341 |
342 |
343 |
Where ``<your_server_certificate>`` and ``<your_server_key>`` are the certificate/key pair used by Apache
344 |
to identify the server to clients. The DataONE Certiciate Authority certificate - available from the DataONE administrators -
345 |
will also need to be added to the directory specified by ``SSLCACertificatePath``
346 |
in order to validate client certificates signed by that authority. DataONE has also provided a CA chain file that may be used in lieu of directory-based CA
347 |
confinguration. The `SSLCACertificateFile`` directive should be used when configuring your member node with the DataONE CA chain.
348 |
When these changes have been applied, Apache should be restarted:
349 |
350 |
351 |
352 |
cd /etc/ssl/certs
353 |
sudo c_rehash
354 |
sudo /etc/init.d/apache2 restart
355 |
356 |
357 |
Configure Tomcat to allow DataONE identifiers
358 |
359 |
Edit ``/etc/tomcat/catalina.properties`` to include:
360 |
361 |
362 |
363 |
364 |