Project

General

Profile

1
I originally tried to write DOMToDBMS as a SAX application but concluded
2
that it was impossible. However, this conclusion was based on the
3
assumption that the class did no caching of data and, in this case, it
4
was impossible. However, if some data is cached as Row objects, it is
5
quite possible that SAXToDBMS is possible. The problem that needs to
6
be solved is as follows.
7

    
8
Consider the sample Sales Order document. This has data for four tables: 
9
Sales, Customers, Lines, and Parts.  Now consider the key relationships: 
10
Sales has a foreign key that points to Customers and a primary key that 
11
points to Lines, and Lines has a foreign key that points to Parts.  The 
12
trick is that, to meet referential integrity constraints in the database, 
13
you need to add the rows with primary keys first. Thus, you need to add the 
14
Customers row before you can add the Sales row, you need to add the Sales 
15
row before you can add the Lines row, and you need to add the Parts row 
16
before you can add the Lines row.
17

    
18
The tree-walking algorithm I use is as follows. Start from the root. When 
19
you encounter a node mapped to a table, such as the SalesOrder node, create 
20
a buffer for a row in that table, then processes the children of that node. 
21
If a child is mapped to a column, such as the OrderDate node, store the 
22
node's value in the buffer. If a child is mapped to a table that stores a 
23
foreign key in the current table, such as the Customer node, process the 
24
node immediately and store the foreign key in the current row buffer. If a 
25
child is mapped to a table that uses the primary key of the current table, 
26
such as the Line nodes, save the node is for later processing, which you do 
27
after inserting the parent node, so you can pass the primary key to the 
28
child node.
29

    
30
In my original SAX implementation, I did not use row buffers. Thus, the 
31
implementation was clearly impossible -- I would, for example, have to 
32
immediately process (and insert) the sales data, which comes before the 
33
customer data. This leads to referential integrity problems. However, 
34
you've got me thinking now and it's quite possible you might be able to 
35
build the row buffers as you go, then insert them in the correct order. If 
36
so, this would be terrific.
37

    
38
Anyway, the general strategy is something along the lines of:
39

    
40
**********
41
In startElement().
42
1) Get the map for the element.
43

    
44
2) If the element is mapped as a class:
45
a) create a new Row for the element.
46
b) read the attributes linearly, getting each name and value.
47
c) get the property map for each attribute. if it is stored as TOCOLUMN, 
48
store the value in the current Row. if it is stored in as TOPROPERTYTABLE, 
49
create a Row for the property table and store the value in it.
50

    
51
3) If the element is mapped as a property then, if the element is mapped as 
52
TOCOLUMN, do nothing now. If it is mapped as TOPROPERTYTABLE, create a Row 
53
for the property table.
54

    
55
********
56
In characters():
57

    
58
1) If you are processing an element mapped as a property then:
59
a) If the element is mapped as TOCOLUMN, store the value in the parent 
60
element's Row.
61
b) If the element is mapped as TOPROPERTYTABLE, store the value in the 
62
property table Row.
63
2) If you are processing an element mapped as a class, get the map for the 
64
PCDATA. If this is mapped as TOCOLUMN, store the value in the element's 
65
Row. If this is mapped as TOPROPERTYTABLE, create a Row for the property 
66
table and store the value in it.
67
3) IMPORTANT! A single piece of PCDATA can result in multiple calls to 
68
characters, so you need to deal with this somehow -- for example, appending 
69
data to column values rather than just creating a string and storing it in 
70
the Row.
71

    
72
*********
73
In endElement():
74
1) If the element is mapped as a property, I don't think there is anything 
75
to do.
76

    
77
2) If the element is mapped as a class, then (I'm thinking out loud here), 
78
by reaching the end of the element, you know that all child elements are 
79
done. You don't need to worry about child elements-as-properties, as 
80
processing for them is finished.  All you care about is child 
81
elements-as-classes and their offspring.  I think what you do is 
82
recursively process these as follows:
83
a) First process all child elements-as-classes that contain primary keys. 
84
These need to be inserted and their foreign keys stored in the parent row.
85
b) Next insert the parent row.
86
c) Finally, insert the primary key from the parent row into the children 
87
skipped in (a) and insert these children.
88

    
89
Note that when I talk about child elements-as-classes, I think I am also 
90
referring to property tables for the current parent.
91

    
92
This is the only really tricky part of the code, as these elements could be 
93
nested quite deep and you need to keep track of which have been inserted 
94
and which haven't. However, the more I talk about this, the more I believe 
95
it is possible. Furthermore, although you end up caching a certain amount 
96
of data, I think it's a good bet that this approaches the same memory 
97
requirements as the DOM for trees in which there are no siblings, only 
98
parents and children. (For example, if your tree has a root with 10000 
99
children, each of which has three children and no grandchildren, then 
100
assuming the root is ignored, you will, at most, only cache the original 
101
child and its three children at any one point in time. On the other hand, 
102
if your tree has only parents and children, then depending on the PK/FK 
103
relationships, you might have to cache the whole thing and insert the most 
104
deeply nested child first.  In any case, caching the rows and keeping their 
105
maps around is likely to be far less expensive than the equivalent DOM.)
106

    
107
The last thing to consider is that you need to keep order information about 
108
the children of each element-as-class. Because elements can be nested, 
109
there will be a stack of order information, as you need to maintain this on 
110
a per-level basis.  I don't think this will be too hard -- just a push 
111
every time you start an element-as-class, a pop every you end an 
112
element-as-class, and an increment every time you start an 
113
element-as-property or call to characters (remembering, of course, the 
114
problem with multiple calls to characters for a single piece of PCDATA).
(3-3/9)