1 |
14
|
jones
|
I originally tried to write DOMToDBMS as a SAX application but concluded
|
2 |
|
|
that it was impossible. However, this conclusion was based on the
|
3 |
|
|
assumption that the class did no caching of data and, in this case, it
|
4 |
|
|
was impossible. However, if some data is cached as Row objects, it is
|
5 |
|
|
quite possible that SAXToDBMS is possible. The problem that needs to
|
6 |
|
|
be solved is as follows.
|
7 |
|
|
|
8 |
|
|
Consider the sample Sales Order document. This has data for four tables:
|
9 |
|
|
Sales, Customers, Lines, and Parts. Now consider the key relationships:
|
10 |
|
|
Sales has a foreign key that points to Customers and a primary key that
|
11 |
|
|
points to Lines, and Lines has a foreign key that points to Parts. The
|
12 |
|
|
trick is that, to meet referential integrity constraints in the database,
|
13 |
|
|
you need to add the rows with primary keys first. Thus, you need to add the
|
14 |
|
|
Customers row before you can add the Sales row, you need to add the Sales
|
15 |
|
|
row before you can add the Lines row, and you need to add the Parts row
|
16 |
|
|
before you can add the Lines row.
|
17 |
|
|
|
18 |
|
|
The tree-walking algorithm I use is as follows. Start from the root. When
|
19 |
|
|
you encounter a node mapped to a table, such as the SalesOrder node, create
|
20 |
|
|
a buffer for a row in that table, then processes the children of that node.
|
21 |
|
|
If a child is mapped to a column, such as the OrderDate node, store the
|
22 |
|
|
node's value in the buffer. If a child is mapped to a table that stores a
|
23 |
|
|
foreign key in the current table, such as the Customer node, process the
|
24 |
|
|
node immediately and store the foreign key in the current row buffer. If a
|
25 |
|
|
child is mapped to a table that uses the primary key of the current table,
|
26 |
|
|
such as the Line nodes, save the node is for later processing, which you do
|
27 |
|
|
after inserting the parent node, so you can pass the primary key to the
|
28 |
|
|
child node.
|
29 |
|
|
|
30 |
|
|
In my original SAX implementation, I did not use row buffers. Thus, the
|
31 |
|
|
implementation was clearly impossible -- I would, for example, have to
|
32 |
|
|
immediately process (and insert) the sales data, which comes before the
|
33 |
|
|
customer data. This leads to referential integrity problems. However,
|
34 |
|
|
you've got me thinking now and it's quite possible you might be able to
|
35 |
|
|
build the row buffers as you go, then insert them in the correct order. If
|
36 |
|
|
so, this would be terrific.
|
37 |
|
|
|
38 |
|
|
Anyway, the general strategy is something along the lines of:
|
39 |
|
|
|
40 |
|
|
**********
|
41 |
|
|
In startElement().
|
42 |
|
|
1) Get the map for the element.
|
43 |
|
|
|
44 |
|
|
2) If the element is mapped as a class:
|
45 |
|
|
a) create a new Row for the element.
|
46 |
|
|
b) read the attributes linearly, getting each name and value.
|
47 |
|
|
c) get the property map for each attribute. if it is stored as TOCOLUMN,
|
48 |
|
|
store the value in the current Row. if it is stored in as TOPROPERTYTABLE,
|
49 |
|
|
create a Row for the property table and store the value in it.
|
50 |
|
|
|
51 |
|
|
3) If the element is mapped as a property then, if the element is mapped as
|
52 |
|
|
TOCOLUMN, do nothing now. If it is mapped as TOPROPERTYTABLE, create a Row
|
53 |
|
|
for the property table.
|
54 |
|
|
|
55 |
|
|
********
|
56 |
|
|
In characters():
|
57 |
|
|
|
58 |
|
|
1) If you are processing an element mapped as a property then:
|
59 |
|
|
a) If the element is mapped as TOCOLUMN, store the value in the parent
|
60 |
|
|
element's Row.
|
61 |
|
|
b) If the element is mapped as TOPROPERTYTABLE, store the value in the
|
62 |
|
|
property table Row.
|
63 |
|
|
2) If you are processing an element mapped as a class, get the map for the
|
64 |
|
|
PCDATA. If this is mapped as TOCOLUMN, store the value in the element's
|
65 |
|
|
Row. If this is mapped as TOPROPERTYTABLE, create a Row for the property
|
66 |
|
|
table and store the value in it.
|
67 |
|
|
3) IMPORTANT! A single piece of PCDATA can result in multiple calls to
|
68 |
|
|
characters, so you need to deal with this somehow -- for example, appending
|
69 |
|
|
data to column values rather than just creating a string and storing it in
|
70 |
|
|
the Row.
|
71 |
|
|
|
72 |
|
|
*********
|
73 |
|
|
In endElement():
|
74 |
|
|
1) If the element is mapped as a property, I don't think there is anything
|
75 |
|
|
to do.
|
76 |
|
|
|
77 |
|
|
2) If the element is mapped as a class, then (I'm thinking out loud here),
|
78 |
|
|
by reaching the end of the element, you know that all child elements are
|
79 |
|
|
done. You don't need to worry about child elements-as-properties, as
|
80 |
|
|
processing for them is finished. All you care about is child
|
81 |
|
|
elements-as-classes and their offspring. I think what you do is
|
82 |
|
|
recursively process these as follows:
|
83 |
|
|
a) First process all child elements-as-classes that contain primary keys.
|
84 |
|
|
These need to be inserted and their foreign keys stored in the parent row.
|
85 |
|
|
b) Next insert the parent row.
|
86 |
|
|
c) Finally, insert the primary key from the parent row into the children
|
87 |
|
|
skipped in (a) and insert these children.
|
88 |
|
|
|
89 |
|
|
Note that when I talk about child elements-as-classes, I think I am also
|
90 |
|
|
referring to property tables for the current parent.
|
91 |
|
|
|
92 |
|
|
This is the only really tricky part of the code, as these elements could be
|
93 |
|
|
nested quite deep and you need to keep track of which have been inserted
|
94 |
|
|
and which haven't. However, the more I talk about this, the more I believe
|
95 |
|
|
it is possible. Furthermore, although you end up caching a certain amount
|
96 |
|
|
of data, I think it's a good bet that this approaches the same memory
|
97 |
|
|
requirements as the DOM for trees in which there are no siblings, only
|
98 |
|
|
parents and children. (For example, if your tree has a root with 10000
|
99 |
|
|
children, each of which has three children and no grandchildren, then
|
100 |
|
|
assuming the root is ignored, you will, at most, only cache the original
|
101 |
|
|
child and its three children at any one point in time. On the other hand,
|
102 |
|
|
if your tree has only parents and children, then depending on the PK/FK
|
103 |
|
|
relationships, you might have to cache the whole thing and insert the most
|
104 |
|
|
deeply nested child first. In any case, caching the rows and keeping their
|
105 |
|
|
maps around is likely to be far less expensive than the equivalent DOM.)
|
106 |
|
|
|
107 |
|
|
The last thing to consider is that you need to keep order information about
|
108 |
|
|
the children of each element-as-class. Because elements can be nested,
|
109 |
|
|
there will be a stack of order information, as you need to maintain this on
|
110 |
|
|
a per-level basis. I don't think this will be too hard -- just a push
|
111 |
|
|
every time you start an element-as-class, a pop every you end an
|
112 |
|
|
element-as-class, and an increment every time you start an
|
113 |
|
|
element-as-property or call to characters (remembering, of course, the
|
114 |
|
|
problem with multiple calls to characters for a single piece of PCDATA).
|