Bug #7188
closedMNodeService.replicate() is failing
0%
Description
Laura Moyers reported that she is seeing many failed replication attempts in the Coordinating Node index. In particular, KNB, GOA, UIC, ARCTIC, mnUCSB1, and mnORC1 are all affected, and are all running Metacat.
After looking at catalina.out on the MNs, we're seeing errors in MNodeService.replicate()
:
20170508-06:59:14: [ERROR]: Error computing checksum on replica: mark/reset not supported [edu.ucsb.nceas.metacat.dataone.MNodeService]
Here's the number of requests and failures
host requests failures failures_since ----------------------------------------------------------- mn-orc-1 145 25 20170511-01:23:48 mn-ucsb-1 105 56 20170508-06:59:14 mn-unm-1 0 0 - knb 71 28 20170509-16:57:53 uic no log access
I'm pretty sure the failures represent 100% of the requests since the failures began, but we'd need to confirm this. Basically, MN replication looks to be entirely broken in Metacat.
The error reported above comes from line 866 of MNodeService.java
, where the checksum of the bytes of the object from the source MN (to be replicated) is calculated. Once the checksum is calculated, we call object.reset()
on the input stream so it can be read again when writing to disk. This is throwing the exception above.
So what's changed? The last changes regarding the InputStream
was that Jing wrapped the calls in a try{ } finally { }
block in order to ensure the input stream gets closed after use to prevent memory leaks. This doesn't seem like an issue at all, although the finally{ }
block could have been used in the existing try { }
block instead of having three levels of try
nesting. This seems inconsequential though.
The other change is that d1_libclient_java
is now using the Apache Commons IO AutoCloseInputStream
. Looking at the documentation there, it seems to delegate to the underlying input stream implementation. We know that not all input streams support the mark()
method and therefore can't be reset()
, which is why we call markSupported()
before attempting to calculate the checksum. So why is markSupported()
succeeding, but then reset()
is failing after reading the input stream? It seems like we need to track this down between the interaction of MNodeService
and MultipartMNode.getReplica()
.