Defining XML Views over Relational Data
Copyright 2001 by Ronald Bourret
The current implementations of XML over relational databases follow two strategies:
- A canonical mapping, such as that described in Mapping DTDs to Databases.
- A query language, such as the FOR XML EXTENDED clause in Microsoft SQL Server.
The problem with the first strategy is that it limits the number of XML documents that can be constructed to those that match the canonical mapping. This means that many documents need to be transformed with XSLT before they are transferred to the database or after they are retrieved from the database. On the other hand, it has the advantage of being reversible. That is, it can be used to transfer data to and from the database. While the second strategy can construct virtually any XML document, it is not reversible. That is, in the general case, it cannot be used to transfer data from an XML document to the database.
One likely future of XML-enabled databases involves defining virtual XML documents (XML views) over a relational database, then issuing queries against these documents. The probable language will be XQuery, from the W3C. Initially, this will contain syntax for queries only -- that is, for retrieving documents from the database. However, a future release will contain syntax for updates, deletes, and inserts, such as that proposed by Igor Tatarinov, Zachary G. Ives, Alon Y. Halevy, and Daniel S. Weld.
One user-interface issue in building a query language over a relational database is how to effectively define virtual XML documents. While a few products use a hard-coded mapping, most products have some sort of mapping language that allows users to map a particular set of tables and columns to element types and attributes. Examples are DAD in IBM DB2, annotated schemas in Microsoft SQL Server, the mapping languages in XML-DBMS and Castor, and the languages behind the GUI tools in numerous products.
The problem with all of these languages is that they are verbose. Furthermore, one mapping document is required for each virtual XML document. For example, consider the following three tables:
Orders Items Customers ---------- --------- ---------- OrderID OrderID CustomerID CustomerID ItemNumber Name OrderDate PartID Address Quantity Telephone
Now suppose you wanted to define two different XML documents over these tables:
<Orders> <Customers> <Order> <Customer> <OrderID>...</OrderID> <CustomerID>...</CustomerID> <OrderDate>...</OrderDate> <Name>...</Name> <Customer> <Address>...</Address> <CustomerID>...</CustomerID> <Telephone>...</Telephone> <Name>...</Name> <Order> <Address>...</Address> <OrderID>...</OrderID> </Customer> <OrderDate>...</OrderDate> <Item> </Order> ... ... </Item> <Order> ... ... <Item> </Order> ... </Customer> </Item> ... </Order> </Customers> ... </Orders>
To do this, you would have to write two separate mapping documents. Even with a GUI-based mapping tool, constructing these documents can be a tedious task, especially if different names are used in the XML document and the database.
What is frustrating about this is that both documents are defined over the same set of tables and contain much of the same information; for example, both map the Orders table to the <Order> element type and the Orders.OrderID column to the <OrderID> element type. In fact, the mappings for the above two documents are the same except: (a) which tables are included, (b) which columns from each table are included, and (c) which tables are mapped to parent element types and which tables are mapped to child element types.
The last item points out a fundamental difference between XML documents and relational databases: The data in an XML document forms a directed graph (generally a tree), while the data in a relational database forms an undirected graph. The latter is true since unique key / foreign key links can be traversed in either direction and is the primary advantage that relational databases have over hierarchical databases. This suggests an easy way to define virtual XML documents over a relational database:
A canonical map. This maps the nodes (tables and columns) in the undirected graph in a relational database to element types, attributes, and PCDATA. It also provides information about how to traverse the edges in this graph that connect class tables. (A class table is the table mapped to a complex element type is mapped.) That is, it provides a list of unique key / foreign key pairs. However, it does not define directions for these edges and therefore does not define how to nest the corresponding element types in the XML document.
(Note that this does does map direction along edges that connect class tables to property tables. Property tables are used to store multi-valued attributes or multi-valued child elements. Property tables are an artifact of the (often by design) inability of many relational databases to store multi-valued attributes in a column and are best thought of as part of the class table.)
One or more XML views. These define a virtual XML document by specifying a wrapper element (if needed), a set of tables (as well as a root table), a set of columns in those tables, and a set of directions on the edges between class tables. The directions describe which way the unique key / foreign key links in the database are traversed and consequently how nesting is done in the XML document.
For example, the above tables can be mapped with a canonical map that provides the following information:
Table Element type name ------ ----------------- Orders Order Items Item Customers Customer Element type, Attribute, Table Column or PCDATA Name ----- ---------- ------------- -------------- Orders OrderID Element type OrderID OrderDate Element type OrderDate CustomerID Element type CustomerID Items OrderID Element type OrderID ItemNumber Element type ItemNumber PartID Element type PartID Quantity Element type Quantity Customers CustomerID Element type CustomerID Name Element type Name Address Element type Address Telephone Element type Telephone Unique key table Unique key columns Foreign key table Foreign key columns ---------------- ------------------ ----------------- ------------------- Customers CustomerID Orders CustomerID Orders OrderID Items OrderID
The first view can be specified with the following information:
Wrapper element: <Orders> Root table: Orders Tables and columns: Orders (OrderID, OrderDate) Items (ItemNumber, PartID, Quantity) Customers (CustomerID, Name, Address) Edge directions: Orders => Customers, Orders => Items
And the second view can be specified with the following information:
Wrapper element: <Customers> Root table: Customers Tables and columns: Orders (OrderID, OrderDate) Customers (CustomerID, Name, Address, Telephone) Edge directions: Customers => Orders
If you have played around with mapping relational data to XML, you will notice that the canonical map contains roughly the same amount of information as is currently needed to map a set of tables to an XML document. The advantage of the method shown here is that, once the canonical map is defined, it is easy to quickly define multiple views over this canonical map. Each of these contains vastly less information than is currently needed to map a set of tables to an XML document. While this is not a technical breakthrough of any sort, it does make life significantly easier for the user.
The only drawback of this method appears to be that once the user has chosen to map a column to an element type, attribute, or PCDATA, it must be mapped that way in all views defined over the canonical map. However, since the people using this strategy are likely to put the database first and XML second, it is probable that they will standardize on mapping columns either to element types or to attributes and that this will not be an issue.