Project

General

Profile

Actions

Bug #3296

closed

Replication: Many EML documents fail to replicate

Added by Duane Costa about 16 years ago. Updated over 12 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
metacat
Target version:
Start date:
05/13/2008
Due date:
% Done:

0%

Estimated time:
Bugzilla-Id:
3296

Description

Over 1000 EML documents fail to replicate from LTER Metacat to KNB Metacat during every timed replication, currently scheduled once per week. Jing and I did some initial investigation of this issue last year. Jing suspects that the problem might be related to normalization (and/or lack of normalization) of character entities such as '<'. (See below.)

A secondary bug seems to be that each of the failed documents is sent twice, not once. (Also described below.) Thus, for each timed replication, LTER attempts to send over 2000 documents to KNB, KNB fails to write them into its Metacat, and the process is repeated again the following week. Because of the large number of documents sent during each timed replication, the process requires about eleven hours to complete.

We were previously scheduling timed replications once every 48 hours, but as an interim measure we reduced this to once every 7 days to reduce the load on both Metacat servers.

----
On Tue, 17 Jul 2007, Jing Tao wrote:

Here is something on the top of my head (if anybody know it is wrong, please correct me):

When you send xml document containing something like "<" in text part, xerces will automatically transform it to "<" during our metacat inserting process. So "<" will be stored in our db. When we try to read the xml file, metacat should transform the stored "<" to "<". Otherwise, the xml file will be not well-formed. The details of the transform (normalize) is in MetacatUtil.normalize().

The documents that are failing to be replicated are being accessed twice >every forty-eight hours. This is strange, because we expected them to be >accessed only once every forty-eight hours. Here's an example:

grep knb-lter-and.3190.4 metacatreplication.log

07-07-09 01:06:01 :: document knb-lter-and.3190.4 sent
07-07-09 06:29:32 :: document knb-lter-and.3190.4 sent
07-07-11 01:06:39 :: document knb-lter-and.3190.4 sent
07-07-11 06:25:31 :: document knb-lter-and.3190.4 sent
07-07-13 01:07:34 :: document knb-lter-and.3190.4 sent
07-07-13 07:03:49 :: document knb-lter-and.3190.4 sent
07-07-15 01:09:06 :: document knb-lter-and.3190.4 sent
07-07-15 07:34:07 :: document knb-lter-and.3190.4 sent

You can see how LNO attempts to send the same document twice, not once, about six hours apart every two days. Why is this happening, shouldn't it only be attempted once every forty-eight hours?

I took a look at the replication log files and found the time interval is 48 hours, but somehow the replication request are called twice in one replication! It is a bug.

Jing Tao wrote:

Hi, Duane:

Thank you for the info. I took a look at the error log in knb site. Yeah, knb rejected lots of documents from LNO. From the error message, it seems those documents have some problem in format. Since LNO can accecept them, I don't think there is any problem in oringin docs. The most possibile problme, which I am guessing, is the normalization. Xerces parsor automatically normailize some stuff. However, metacat failed to denormalize the doc when we try to read it. How do you think? I will take a look soon. The attached files are replication error and log in knb.

Thanks,

Jing

Hi Jing,

I've attached the LNO replication log files from the past few days. It looks like LNO is replicating many files to KNB with every timed replication, but the KNB server is not accepting them. Also, replication attempt seems to take long, about four or five minutes per document. I think maybe the timed replications were beginning to take so long that they even started to overlap, i.e. a new timed replication would begin even though the previous one from forty-eight hours ago hadn't completed yet. When you get a chance could you please take a look at the KNB logs to match this up and find out why the documents are not being accepted on the KNB side?
Thanks,
Duane

On 7/11/2007, Duane Costa wrote:

Hi Jing,

I was just checking the replication logs on the LNO metacat, and it looks like there is an unusually large amount of replication happening with the timed replications (once every 48 hours), with many documents being sent from LNO to KNB. Do you know of any changes (on the KNB side) that might be causing this, and could you check the replication logs at the KNB metacat to see if you agree that it looks unusual?

Thanks,
Duane

Actions

Also available in: Atom PDF