Q: How do I handle large documents in XML-DBMS?
Applies to: 1.x, 2.0
Most large documents seem to fall into one of three categories:
Repeating data. For example, you have thousands or millions of sales orders in the same document or astronomical data. In this case, there is generally a wrapper element around what really are a set of separate documents or "rows" of data. For example, the following document consists of multiple separate sales orders, each of which can be inserted separately:
<SalesOrders> <SalesOrder> ... </SalesOrder> <SalesOrder> ... </SalesOrder> <SalesOrder> ... </SalesOrder> </SalesOrders>
Related data. In this case, you simply have a huge amount of related data. I have heard of financial transactions that require multiple MBs of XML because of all the contextual information that must be transmitted to process the actual transaction.
Documents. It is possible for documents (such as books) to be as large as 5 MB. However, for this to happen, the documents would probably need to include graphics encoded as Base64 -- 5 MB is a *lot* of text.
Because XML-DBMS uses DOM trees to represent XML documents, it has size limitations. DOM trees are kept in memory and are larger than the original document, so large documents can easily exceed available memory. I'm not sure that 5 MB documents would cause problems on a modern machine, though. Even if the DOM tree is 10 times larger than the original document, this is still only 50 MB.
Large documents with repeating data (case 1) can be easily processed by "cutting" them into separate documents, each of which is processed separately. The cutter uses SAX to read the documents and creates DOM trees based on a particular element, such as the <SalesOrder element in the example above. As long as the sub-documents are not too large, it can process documents of any size.
One way to do this is to write an application that uses SAXDOMIX to split the document into smaller documents, then make separate calls to DOMToDBMS.storeDocument for each smaller document. Your application needs to implement the SDXController interface in SAXDOMIX, which consists of two methods: wantDOM and handleDOM.
wantDOM returns true if you want a DOM tree returned for the element. In the above case, wantDOM would return true for <SalesOrder> elements and false for all other elements. When wantDOM returns true, SAXDOMIX passes the DOM tree to handleDOM. In the case of an XML-DBMS application, handleDOM would pass the DOM tree to DOMToDBMS.storeDocument.
SAXDOMIX is Open Source and is available from:
http://www.devsphere.com/xml/saxdomix/index.html
You can also find a simple, SAX-based "cutter" in section 7.1 of the IBM Redbook "XML for DB2 Information Integration":
http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246994.html?Open
Whether it is possible to process other types of large documents (cases 2 or 3) depends on whether it is possible to cut such documents into separate pieces, each of which can be processed separately. (For example, is it acceptable to insert sections of a book separately?) In any case, you would probably need custom code to pre-process the documents and break them into manageable pieces.