1
|
I originally tried to write DOMToDBMS as a SAX application but concluded
|
2
|
that it was impossible. However, this conclusion was based on the
|
3
|
assumption that the class did no caching of data and, in this case, it
|
4
|
was impossible. However, if some data is cached as Row objects, it is
|
5
|
quite possible that SAXToDBMS is possible. The problem that needs to
|
6
|
be solved is as follows.
|
7
|
|
8
|
Consider the sample Sales Order document. This has data for four tables:
|
9
|
Sales, Customers, Lines, and Parts. Now consider the key relationships:
|
10
|
Sales has a foreign key that points to Customers and a primary key that
|
11
|
points to Lines, and Lines has a foreign key that points to Parts. The
|
12
|
trick is that, to meet referential integrity constraints in the database,
|
13
|
you need to add the rows with primary keys first. Thus, you need to add the
|
14
|
Customers row before you can add the Sales row, you need to add the Sales
|
15
|
row before you can add the Lines row, and you need to add the Parts row
|
16
|
before you can add the Lines row.
|
17
|
|
18
|
The tree-walking algorithm I use is as follows. Start from the root. When
|
19
|
you encounter a node mapped to a table, such as the SalesOrder node, create
|
20
|
a buffer for a row in that table, then processes the children of that node.
|
21
|
If a child is mapped to a column, such as the OrderDate node, store the
|
22
|
node's value in the buffer. If a child is mapped to a table that stores a
|
23
|
foreign key in the current table, such as the Customer node, process the
|
24
|
node immediately and store the foreign key in the current row buffer. If a
|
25
|
child is mapped to a table that uses the primary key of the current table,
|
26
|
such as the Line nodes, save the node is for later processing, which you do
|
27
|
after inserting the parent node, so you can pass the primary key to the
|
28
|
child node.
|
29
|
|
30
|
In my original SAX implementation, I did not use row buffers. Thus, the
|
31
|
implementation was clearly impossible -- I would, for example, have to
|
32
|
immediately process (and insert) the sales data, which comes before the
|
33
|
customer data. This leads to referential integrity problems. However,
|
34
|
you've got me thinking now and it's quite possible you might be able to
|
35
|
build the row buffers as you go, then insert them in the correct order. If
|
36
|
so, this would be terrific.
|
37
|
|
38
|
Anyway, the general strategy is something along the lines of:
|
39
|
|
40
|
**********
|
41
|
In startElement().
|
42
|
1) Get the map for the element.
|
43
|
|
44
|
2) If the element is mapped as a class:
|
45
|
a) create a new Row for the element.
|
46
|
b) read the attributes linearly, getting each name and value.
|
47
|
c) get the property map for each attribute. if it is stored as TOCOLUMN,
|
48
|
store the value in the current Row. if it is stored in as TOPROPERTYTABLE,
|
49
|
create a Row for the property table and store the value in it.
|
50
|
|
51
|
3) If the element is mapped as a property then, if the element is mapped as
|
52
|
TOCOLUMN, do nothing now. If it is mapped as TOPROPERTYTABLE, create a Row
|
53
|
for the property table.
|
54
|
|
55
|
********
|
56
|
In characters():
|
57
|
|
58
|
1) If you are processing an element mapped as a property then:
|
59
|
a) If the element is mapped as TOCOLUMN, store the value in the parent
|
60
|
element's Row.
|
61
|
b) If the element is mapped as TOPROPERTYTABLE, store the value in the
|
62
|
property table Row.
|
63
|
2) If you are processing an element mapped as a class, get the map for the
|
64
|
PCDATA. If this is mapped as TOCOLUMN, store the value in the element's
|
65
|
Row. If this is mapped as TOPROPERTYTABLE, create a Row for the property
|
66
|
table and store the value in it.
|
67
|
3) IMPORTANT! A single piece of PCDATA can result in multiple calls to
|
68
|
characters, so you need to deal with this somehow -- for example, appending
|
69
|
data to column values rather than just creating a string and storing it in
|
70
|
the Row.
|
71
|
|
72
|
*********
|
73
|
In endElement():
|
74
|
1) If the element is mapped as a property, I don't think there is anything
|
75
|
to do.
|
76
|
|
77
|
2) If the element is mapped as a class, then (I'm thinking out loud here),
|
78
|
by reaching the end of the element, you know that all child elements are
|
79
|
done. You don't need to worry about child elements-as-properties, as
|
80
|
processing for them is finished. All you care about is child
|
81
|
elements-as-classes and their offspring. I think what you do is
|
82
|
recursively process these as follows:
|
83
|
a) First process all child elements-as-classes that contain primary keys.
|
84
|
These need to be inserted and their foreign keys stored in the parent row.
|
85
|
b) Next insert the parent row.
|
86
|
c) Finally, insert the primary key from the parent row into the children
|
87
|
skipped in (a) and insert these children.
|
88
|
|
89
|
Note that when I talk about child elements-as-classes, I think I am also
|
90
|
referring to property tables for the current parent.
|
91
|
|
92
|
This is the only really tricky part of the code, as these elements could be
|
93
|
nested quite deep and you need to keep track of which have been inserted
|
94
|
and which haven't. However, the more I talk about this, the more I believe
|
95
|
it is possible. Furthermore, although you end up caching a certain amount
|
96
|
of data, I think it's a good bet that this approaches the same memory
|
97
|
requirements as the DOM for trees in which there are no siblings, only
|
98
|
parents and children. (For example, if your tree has a root with 10000
|
99
|
children, each of which has three children and no grandchildren, then
|
100
|
assuming the root is ignored, you will, at most, only cache the original
|
101
|
child and its three children at any one point in time. On the other hand,
|
102
|
if your tree has only parents and children, then depending on the PK/FK
|
103
|
relationships, you might have to cache the whole thing and insert the most
|
104
|
deeply nested child first. In any case, caching the rows and keeping their
|
105
|
maps around is likely to be far less expensive than the equivalent DOM.)
|
106
|
|
107
|
The last thing to consider is that you need to keep order information about
|
108
|
the children of each element-as-class. Because elements can be nested,
|
109
|
there will be a stack of order information, as you need to maintain this on
|
110
|
a per-level basis. I don't think this will be too hard -- just a push
|
111
|
every time you start an element-as-class, a pop every you end an
|
112
|
element-as-class, and an increment every time you start an
|
113
|
element-as-property or call to characters (remembering, of course, the
|
114
|
problem with multiple calls to characters for a single piece of PCDATA).
|