Defining XML Views over Relational Data

The current implementations of XML over relational databases follow two strategies:

A canonical mapping, such as that described in Mapping DTDs to Databases.
A query language, such as the FOR XML EXTENDED clause in Microsoft SQL Server.

The problem with the first strategy is that it limits the number of XML documents that can be constructed to those that match the canonical mapping. This means that many documents need to be transformed with XSLT before they are transferred to the database or after they are retrieved from the database. On the other hand, it has the advantage of being reversible. That is, it can be used to transfer data to and from the database. While the second strategy can construct virtually any XML document, it is not reversible. That is, in the general case, it cannot be used to transfer data from an XML document to the database.

One likely future of XML-enabled databases involves defining virtual XML documents (XML views) over a relational database, then issuing queries against these documents. The probable language will be XQuery, from the W3C. Initially, this will contain syntax for queries only -- that is, for retrieving documents from the database. However, a future release will contain syntax for updates, deletes, and inserts, such as that proposed by Igor Tatarinov, Zachary G. Ives, Alon Y. Halevy, and Daniel S. Weld.

One user-interface issue in building a query language over a relational database is how to effectively define virtual XML documents. While a few products use a hard-coded mapping, most products have some sort of mapping language that allows users to map a particular set of tables and columns to element types and attributes. Examples are DAD in IBM DB2, annotated schemas in Microsoft SQL Server, the mapping languages in XML-DBMS and Castor, and the languages behind the GUI tools in numerous products.

The problem with all of these languages is that they are verbose. Furthermore, one mapping document is required for each virtual XML document. For example, consider the following three tables:

   Orders          Items          Customers
   ----------      ---------      ----------
   OrderID         OrderID        CustomerID
   CustomerID      ItemNumber     Name
   OrderDate       PartID         Address
                   Quantity       Telephone

Now suppose you wanted to define two different XML documents over these tables:

   <Orders>                                       <Customers>
      <Order>                                        <Customer>
         <OrderID>...</OrderID>                         <CustomerID>...</CustomerID>
         <OrderDate>...</OrderDate>                     <Name>...</Name>
         <Customer>                                     <Address>...</Address>
            <CustomerID>...</CustomerID>                <Telephone>...</Telephone>
            <Name>...</Name>                            <Order>
            <Address>...</Address>                         <OrderID>...</OrderID>
         </Customer>                                       <OrderDate>...</OrderDate>
         <Item>                                         </Order>
            ...                                         ...
         </Item>                                        <Order>
         ...                                                ...
         <Item>                                         </Order>
            ...                                      </Customer>
         </Item>                                     ...
      </Order>                                    </Customers>
      ...
   </Orders>

To do this, you would have to write two separate mapping documents. Even with a GUI-based mapping tool, constructing these documents can be a tedious task, especially if different names are used in the XML document and the database.

What is frustrating about this is that both documents are defined over the same set of tables and contain much of the same information; for example, both map the Orders table to the <Order> element type and the Orders.OrderID column to the <OrderID> element type. In fact, the mappings for the above two documents are the same except: (a) which tables are included, (b) which columns from each table are included, and (c) which tables are mapped to parent element types and which tables are mapped to child element types.

The last item points out a fundamental difference between XML documents and relational databases: The data in an XML document forms a directed graph (generally a tree), while the data in a relational database forms an undirected graph. The latter is true since unique key / foreign key links can be traversed in either direction and is the primary advantage that relational databases have over hierarchical databases. This suggests an easy way to define virtual XML documents over a relational database:

A canonical map. This maps the nodes (tables and columns) in the undirected graph in a relational database to element types, attributes, and PCDATA. It also provides information about how to traverse the edges in this graph that connect class tables. (A class table is the table mapped to a complex element type is mapped.) That is, it provides a list of unique key / foreign key pairs. However, it does not define directions for these edges and therefore does not define how to nest the corresponding element types in the XML document.

(Note that this does does map direction along edges that connect class tables to property tables. Property tables are used to store multi-valued attributes or multi-valued child elements. Property tables are an artifact of the (often by design) inability of many relational databases to store multi-valued attributes in a column and are best thought of as part of the class table.)
One or more XML views. These define a virtual XML document by specifying a wrapper element (if needed), a set of tables (as well as a root table), a set of columns in those tables, and a set of directions on the edges between class tables. The directions describe which way the unique key / foreign key links in the database are traversed and consequently how nesting is done in the XML document.

For example, the above tables can be mapped with a canonical map that provides the following information:

   Table        Element type name
   ------       -----------------
   Orders       Order
   Items        Item
   Customers    Customer


                             Element type,
                             Attribute,
   Table      Column         or PCDATA        Name
   -----      ----------     -------------    --------------
   Orders     OrderID        Element type     OrderID
              OrderDate      Element type     OrderDate
              CustomerID     Element type     CustomerID

   Items      OrderID        Element type     OrderID
              ItemNumber     Element type     ItemNumber
              PartID         Element type     PartID
              Quantity       Element type     Quantity

   Customers  CustomerID     Element type     CustomerID
              Name           Element type     Name
              Address        Element type     Address
              Telephone      Element type     Telephone

   
   Unique key table    Unique key columns    Foreign key table   Foreign key columns
   ----------------    ------------------    -----------------   -------------------
   Customers           CustomerID            Orders              CustomerID
   Orders              OrderID               Items               OrderID

The first view can be specified with the following information:

   Wrapper element:     <Orders>

   Root table:          Orders

   Tables and columns:  Orders    (OrderID, OrderDate)
                        Items     (ItemNumber, PartID, Quantity)
                        Customers (CustomerID, Name, Address)

   Edge directions:     Orders => Customers, Orders => Items

And the second view can be specified with the following information:

   Wrapper element:     <Customers>

   Root table:          Customers

   Tables and columns:  Orders    (OrderID, OrderDate)
                        Customers (CustomerID, Name, Address, Telephone)

   Edge directions:     Customers => Orders

If you have played around with mapping relational data to XML, you will notice that the canonical map contains roughly the same amount of information as is currently needed to map a set of tables to an XML document. The advantage of the method shown here is that, once the canonical map is defined, it is easy to quickly define multiple views over this canonical map. Each of these contains vastly less information than is currently needed to map a set of tables to an XML document. While this is not a technical breakthrough of any sort, it does make life significantly easier for the user.

The only drawback of this method appears to be that once the user has chosen to map a column to an element type, attribute, or PCDATA, it must be mapped that way in all views defined over the canonical map. However, since the people using this strategy are likely to put the database first and XML second, it is probable that they will standardize on mapping columns either to element types or to attributes and that this will not be an issue.

Ronald Bourret

Technical writing, Software development, and Consulting

Defining XML Views over Relational Data