After reviewing CNodeService and D1NodeService prompted by Robert comparing the Hazelcast locking with the d1_synchronization locking, I've made a number of changes that will prevent locking problems:
1) Multiple methods contained try/catch blocks that would:...
only attempt to unlock a lock if it was created (in the finally block)
new jars with many changes -- including new CN methods: ping, describe, listChecksumAlgorithm. Removed MN.setAccessPolicy. Refactored CN.setOwner() to CN.setRightsHolder().
Change setReplicationStatus() to drop serialVersion and report the failure exception message in the CN log.
updated D1 API -- removed Permission.REPLICATE and associated parameters
If a member node cannot be found in the node list matching the targetNodeSubject given in isNodeAuthorized(), throw a ServiceFailure exception.
Add log statements for each call to ILock.unlock() for debugging.
When using ILock.lock(), get a lock on the string value of the Identifier, not the Identifier object itself. Hazelcast locking won't work otherwise.
Use the Hazelcast ILock mechanism to lock the system metadata identifier rather than using IMap.lock(pid).
when comparing D1 Subject objects, use the equals() method not direct string comparisonhttps://redmine.dataone.org/issues/2050
access nodeList list correctlyhttps://redmine.dataone.org/issues/2049
Use Subject.equals() when comparing DNs rather than CertificateManager.equalsDN(). Don't lock the pid in isNodeAuthorized() to debug for timeout issues. Minor debugging changes.
Minor logging for isNodeAuthorized(), and compare subjects properly. Change this to Subject.compareTo() when it is vetted.
Catch RuntimeExceptions thrown by Hazelcast as opposed to general Exceptions to we don't catch exceptions we're trying to throw.
generalize exception handling -- add cause detail
Changes to setReplicationStatus and isNodeAuthorized(), working out minor bugs in replication.
include exception cause when throwing new exception (combine RuntimeException in Exception handling -- they are almst identical)
Calls to setReplicationStatus() can only be made by a CN or the MN that is the target replica node. Implement this service restriction in CNodeService using CertificateManager's equalsDN() method.
Added stack trace debugging for CNodeService.isNodeAuthorized() for tracking down replication issues.
Fix cast to List<Node> in isNodeAuthorized().
upgrade to 1.0.1-SNAPSHOT DataONE jars
Update CNodeService to use the serialVersion parameter and compare it to the current serialVersion of the system metadata found in the hzSystemMetadata map. Throw an InvalidRequest exception if they are not equal. This affects updateReplicationMetadata(), setReplicationStatus(), setReplicationPolicy(), setAccessPolicy(), and setOwner().
Add updateReplicationMetadata() to the CN service implementation. This was missing from the API, and likely never called. It fully replaces the given replica item in the list of replicas in system metadata.
Minor indentation cleanup.
Add setAccessPolicy() to CNodeService since the CN should only make changes to access policies for objects registered with the D1 system. Increment the serial version after locling and getting the most up to fdate system metadata. Note: CCIT meeting decision says the serial version of the system metadata (during the change) should equal the current serial version, but setAccessPolicy() does not pass in the entire system metadata object, so there's no way to check. For now, increment the latest system metadata from the hzSystemMetadata map.
In CNodeService, separate the CN.create() functionality from the MN.create() functionality while still using the superclass to call create(). Deal with Hazelcast locks and setting serial versions only in the CN implementation.
Change updateSystemMetadata() to evaluate the incoming system metadata serial version against that found in the hzSystemMetadata map. If they are the same, do the update. If not, throw an InvalidRequest explaining that they need the most current version.
Modify CNodeService's registerSystemMetadata() with support for SystemMetadata's serialVersion field. Also, use the hzSystemMetadata map for all system metadata reads using a lock on the pid in order to get the very latest version. This affected isNodeAuthorized(), getChecksum(), and assertRelation(). Since we're using Hazelcast, exceptions are masked as RuntimeException, so throw a ServiceFailure with the underlying message.
Modify CNodeService's updateSystemMetadata(), setReplicationStatus(), setReplicationPolicy(), and setOwner() with support for SystemMetadata's serialVersion field. Other methods still pending an update. Use the hzSystemMetadata map for all system metadata reads using a lock on the pid in order to get the very latest version.
add User-Agent logging to support D1 requirements
log errors on create() and registerSM
Don't use the hzNodes map yet (as a hazelcast client). Use D1Client instead to get the node list in isNodeAuthorized().
Reverting previous @Overrides chanrge from r6470, as that is the desiredbehavior under Java 1.6 -- previous versions of Java (e.g., 1.5) will notcomile with this usage of the @Overrides annotation, but the currentlysupported version will. So reverting to the 1.6 convention.
Removing incorrect @Override annotations that were preventing compilation. The methods marked did not actually override a method in the superclass, so they were not compiling. I think @Overrides was being mistaken for methods that implement an interface but aren't actually in the superclass.
catch runtime exceptions that arise from hazelcast storage errors in the system metadata map
Lock the system metadata entry in hzSystemMetadata when calling setReplicationPolicy().
Lock the system metadata entry in hzSystemMetadata when calling registerSystemMetadata().
Remove references to CNReplicationTask.
Change isNodeAuthorized() to query the hzSystemMetadata map rather than the hzPendingreplicationTasks map. The latter isn't needed for authorization since the ReplicationStatus for each Replica in SystemMetadata lists the status of the replica and can be queried.
move CNReplicationTask to the hazelcast package
only "save" to the shared system metadata map - not directly to the table store.
rely on Hazelcast to store the SystemMetadata locally for the node. Entry event listeners store the shared system metadata on their local node when alerted. TODO: remove old replication code that included system metadata xml when replicating scimeta and data
move bulk of the Hazelcast code into HazelcastService from CNodeService so that it is centrall located - easier to manage and configure
initialize Hazelcast from the custom configuration when initializing the Metacat service.
handle entryAdded and entryUpdated the same - update the entry if it exists, otherwise create it
handle entryAdded (to hzSystemMetadata) to store sysmeta to local store when it is not already present
add code to handle new entry when it is not on the local member of the sysmeta cluster
comment out processCluster connections that use hzClient until that is finalized
use pending replication task queue to check if node is authorized for replication. moved from old ReplicationService code
Add stub methods in CNodeService that implement the Hazelcast EntryListener interface: entryAdded(), entryRemoved(), entryUpdated(), and entryEvicted(). Add a listener to the hzSystemMetadata map so the CNodeService can respond to those events and create appropriate CNReplicationTask objects for distributed execution across the CN cluster. Again, stubs only so far.
Minor cleanup - tabs to spaces.
Enable CNodeService to access 1) the hzNodes map defined in the DataONE process cluster by becoming a Hazelcast client (hzClient) to that cluster and 2) the hzSystemMetadata map defined in the DataONE storage cluster by becoming a member to that cluster (using direct Hazelcast calls). Added fields for maintaining the DataONE cluster properties.
changes for schema update (d1_common)
Update classes to use the DataONE 0.6.4 schema and types. Major changes involve using BigInteger vs long in SystemMetadata.size, and using ObjectFormatIdentifier rather than Object format.
latest D1 jars - changes include:updateSystemMetadata() impl for CNnew identifier methods (generate is its own method)removal of the resourceMap pointer from system metadata
use new "v1" types from DataONE
add hasReservation() method (NotImplemented, however)
use objectFormatIdentifier for listObjects()remove provisional system metadata indicator - Metacat will not implement reserveIdentifier()
allow very minimal system metadata for provisional entries (CN.reserveIdentifier)
remove resolve() test -- not implemented in Metacat
catch exceptions from system meta data query and throw service failure rather than swallowing them with an error msg
use super class' create() methoduse string comparison for assertRelation method
return all public objects for the search() method [for now]
add the old ecogrid query code (still commented out) from the old Rest handler
implement reserveIdentifier() and check whether the id is reserved when creating records (only allow the create when the Subject creating matches the Subject who reserved it -- currently stored in rightsHolder)
remove extraneous update() call when create() does the call for us
At Ben's suggestion, add metacatUrl to D1NodeService and make it available to subclasses. Set the metacatUrl in the constructor using SystemUtil rather than all roll your own PropertyService calls. More concise. Also, log the delete event in MNodeService.delete().
Implement [MN|CN]Storage.create() in D1NodeService. Since MetacatHandler requires an IP for event logging, we pass in the metacat URL (hold over from CrudService). To do this in the abstract D1NodeService, change the constructors to take metacatUrl as a parameter and get the URL from the metacat properties file in getInstance() of the subclasses. Needs testing.
include URL in resolve() method as well as placeholder for preference
Metacat does not implement CNRegister
-use every Subject in the session (alt Ids and Group membership)-consolidate to single isAuthorized method
throw exception for unimplemented methods
implement resolve() method
implement assertRelation
implement CNReplication.setReplicationStatus() but with a note about selecting which replica's status should be set (right now it is all)
implement CNReplication.setReplicationPolicy
correction: implementation is CN-specific
implement getChecksum() in the superclass
implement getChecksum (retrieves from system metadata)
use shared get() method from superclass
use shared getLogRecords method
implement CNAuthorization
Metacat does not implement CNIdentity - it is a stand-alone service
implement registerSystemMetadata
implement object format methods - using a separate class to do the actual metacat lookup/caching so that teh CN implementation looks cleaner
Initial check in of the MNodeService stub methods that implement the D1 MN* interfaces. CrudService methods will be transitioned into this class. The methods follow the D1 0.6.2 API thus far.
Also changed CNodeService to reflect minor changes to the D1NodeService class.
Add a static getInstance() method to CNodeService and make CNodeService a singleton.
Initial check in of the CNodeService stub methods that implement the D1 CN* interfaces. CNCoreImpl methods will be transitioned into this class. The methods follow the D1 0.6.2 API thus far.