Project

General

Profile

Actions

Bug #7188

closed

MNodeService.replicate() is failing

Added by Chris Jones almost 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
metacat
Target version:
Start date:
05/11/2017
Due date:
% Done:

0%

Estimated time:
Bugzilla-Id:

Description

Laura Moyers reported that she is seeing many failed replication attempts in the Coordinating Node index. In particular, KNB, GOA, UIC, ARCTIC, mnUCSB1, and mnORC1 are all affected, and are all running Metacat.

After looking at catalina.out on the MNs, we're seeing errors in MNodeService.replicate():

20170508-06:59:14: [ERROR]: Error computing checksum on replica: mark/reset not supported [edu.ucsb.nceas.metacat.dataone.MNodeService]

Here's the number of requests and failures

host        requests        failures      failures_since
-----------------------------------------------------------
mn-orc-1    145             25            20170511-01:23:48
mn-ucsb-1   105             56            20170508-06:59:14
mn-unm-1    0               0             -
knb         71              28            20170509-16:57:53
uic         no log access

I'm pretty sure the failures represent 100% of the requests since the failures began, but we'd need to confirm this. Basically, MN replication looks to be entirely broken in Metacat.

The error reported above comes from line 866 of MNodeService.java, where the checksum of the bytes of the object from the source MN (to be replicated) is calculated. Once the checksum is calculated, we call object.reset() on the input stream so it can be read again when writing to disk. This is throwing the exception above.

So what's changed? The last changes regarding the InputStream was that Jing wrapped the calls in a try{ } finally { } block in order to ensure the input stream gets closed after use to prevent memory leaks. This doesn't seem like an issue at all, although the finally{ } block could have been used in the existing try { } block instead of having three levels of try nesting. This seems inconsequential though.

The other change is that d1_libclient_java is now using the Apache Commons IO AutoCloseInputStream. Looking at the documentation there, it seems to delegate to the underlying input stream implementation. We know that not all input streams support the mark() method and therefore can't be reset(), which is why we call markSupported() before attempting to calculate the checksum. So why is markSupported() succeeding, but then reset() is failing after reading the input stream? It seems like we need to track this down between the interaction of MNodeService and MultipartMNode.getReplica().

Actions

Also available in: Atom PDF