/xmldbms/SAXToDBMS.txt - Metacat - Ecoinformatics Redmine

metacat/xmldbms/SAXToDBMS.txt @ 38

       I originally tried to write DOMToDBMS as a SAX application but concluded
       that it was impossible. However, this conclusion was based on the
       assumption that the class did no caching of data and, in this case, it
       was impossible. However, if some data is cached as Row objects, it is
       quite possible that SAXToDBMS is possible. The problem that needs to
       be solved is as follows.
       Consider the sample Sales Order document. This has data for four tables:
       Sales, Customers, Lines, and Parts.  Now consider the key relationships:
       Sales has a foreign key that points to Customers and a primary key that
       points to Lines, and Lines has a foreign key that points to Parts.  The
       trick is that, to meet referential integrity constraints in the database,
       you need to add the rows with primary keys first. Thus, you need to add the
       Customers row before you can add the Sales row, you need to add the Sales
       row before you can add the Lines row, and you need to add the Parts row
       before you can add the Lines row.
       The tree-walking algorithm I use is as follows. Start from the root. When
       you encounter a node mapped to a table, such as the SalesOrder node, create
       a buffer for a row in that table, then processes the children of that node.
       If a child is mapped to a column, such as the OrderDate node, store the
       node's value in the buffer. If a child is mapped to a table that stores a
       foreign key in the current table, such as the Customer node, process the
       node immediately and store the foreign key in the current row buffer. If a
       child is mapped to a table that uses the primary key of the current table,
       such as the Line nodes, save the node is for later processing, which you do
       after inserting the parent node, so you can pass the primary key to the
       child node.
       In my original SAX implementation, I did not use row buffers. Thus, the
       implementation was clearly impossible -- I would, for example, have to
       immediately process (and insert) the sales data, which comes before the
       customer data. This leads to referential integrity problems. However,
       you've got me thinking now and it's quite possible you might be able to
       build the row buffers as you go, then insert them in the correct order. If
       so, this would be terrific.
       Anyway, the general strategy is something along the lines of:
       **********
       In startElement().
 ) Get the map for the element.
 ) If the element is mapped as a class:
       a) create a new Row for the element.
       b) read the attributes linearly, getting each name and value.
       c) get the property map for each attribute. if it is stored as TOCOLUMN,
       store the value in the current Row. if it is stored in as TOPROPERTYTABLE,
       create a Row for the property table and store the value in it.
 ) If the element is mapped as a property then, if the element is mapped as
       TOCOLUMN, do nothing now. If it is mapped as TOPROPERTYTABLE, create a Row
       for the property table.
       ********
       In characters():
 ) If you are processing an element mapped as a property then:
       a) If the element is mapped as TOCOLUMN, store the value in the parent
       element's Row.
       b) If the element is mapped as TOPROPERTYTABLE, store the value in the
       property table Row.
 ) If you are processing an element mapped as a class, get the map for the
       PCDATA. If this is mapped as TOCOLUMN, store the value in the element's
       Row. If this is mapped as TOPROPERTYTABLE, create a Row for the property
       table and store the value in it.
 ) IMPORTANT! A single piece of PCDATA can result in multiple calls to
       characters, so you need to deal with this somehow -- for example, appending
       data to column values rather than just creating a string and storing it in
       the Row.
       *********
       In endElement():
 ) If the element is mapped as a property, I don't think there is anything
       to do.
 ) If the element is mapped as a class, then (I'm thinking out loud here),
       by reaching the end of the element, you know that all child elements are
       done. You don't need to worry about child elements-as-properties, as
       processing for them is finished.  All you care about is child
       elements-as-classes and their offspring.  I think what you do is
       recursively process these as follows:
       a) First process all child elements-as-classes that contain primary keys.
       These need to be inserted and their foreign keys stored in the parent row.
       b) Next insert the parent row.
       c) Finally, insert the primary key from the parent row into the children
       skipped in (a) and insert these children.
       Note that when I talk about child elements-as-classes, I think I am also
       referring to property tables for the current parent.
       This is the only really tricky part of the code, as these elements could be
       nested quite deep and you need to keep track of which have been inserted
       and which haven't. However, the more I talk about this, the more I believe
       it is possible. Furthermore, although you end up caching a certain amount
       of data, I think it's a good bet that this approaches the same memory
       requirements as the DOM for trees in which there are no siblings, only
       parents and children. (For example, if your tree has a root with 10000
       children, each of which has three children and no grandchildren, then
       assuming the root is ignored, you will, at most, only cache the original
       child and its three children at any one point in time. On the other hand,
       if your tree has only parents and children, then depending on the PK/FK
       relationships, you might have to cache the whole thing and insert the most
       deeply nested child first.  In any case, caching the rows and keeping their
       maps around is likely to be far less expensive than the equivalent DOM.)
       The last thing to consider is that you need to keep order information about
       the children of each element-as-class. Because elements can be nested,
       there will be a stack of order information, as you need to maintain this on
       a per-level basis.  I don't think this will be too hard -- just a push
       every time you start an element-as-class, a pop every you end an
       element-as-class, and an increment every time you start an
       element-as-property or call to characters (remembering, of course, the
       problem with multiple calls to characters for a single piece of PCDATA).

(3-3/9)

Project

General

Profile

Metacat