Declaring Elements and Attributes in an XML DTD

Copyright 1999-2002 by Ronald Bourret

This paper is designed to introduce the reader to the grammar used in XML DTDs to declare elements and attributes. It does not rigorously define this grammar, nor does it define the entire grammar used in DTDs. Among other things, the grammar for notations and entities is omitted. For a complete definition of the DTD grammar, see the XML specification or the annotated XML specification.

NOTE: The examples below often redefine the same element. This is for simplicity only; it is an error to define an element more than once in an actual DTD.

XML Markup Languages

An XML document primarily consists of a strictly nested hierarchy of elements with a single root. Elements can contain character data, child elements, or a mixture of both. In addition, they can have attributes. Child character data and child elements are strictly ordered; attributes are not. For example:

<?xml version="1.0" ?>
<Book Author="Anonymous">
   <Title>Sample Book</Title>
   <Chapter id="1">
       This is chapter 1. It is not very long or interesting.
   </Chapter>
   <Chapter id="2">
       This is chapter 2. Although it is longer than chapter 1,
       it is not any more interesting.
   </Chapter>
</Book>

The names of the elements and attributes and their order in the hierarchy (among other things) form the XML markup language used by the document. This language can be defined by the document author or it can be inferred from the document's structure. In the example shown above, the language contains three elements: Book, Title, and Chapter. The Book element contains a single Title element and one or more Chapter elements. The Book element has an Author attribute and the Chapter element has an id attribute.

The main reason to explicitly define the language is so that documents can be checked to conform to it. For example, if we defined a grammar for the Book language, authors using this grammar could use a validating parser to ensure that their documents conformed to the language.

An XML markup language is defined in a Document Type Definition (DTD). The DTD is either contained in a <!DOCTYPE> tag, contained in an external file and referenced from a <!DOCTYPE> tag, or both. For example, the document shown above could contain the following <!DOCTYPE> tag:

<!DOCTYPE Book [
   <!ELEMENT Book (Title, Chapter+)>
   <!ATTLIST Book Author CDATA #REQUIRED>
   <!ELEMENT Title (#PCDATA)>
   <!ELEMENT Chapter (#PCDATA)>
   <!ATTLIST Chapter id ID #REQUIRED>
]>

Elements

1) An element is defined as a group of one or more subelements/subgroups, character data, EMPTY, or ANY. For example:

2) Elements defined as groups of subelements/subgroups constitute non-terminals in the language. Elements defined as character data, EMPTY, or ANY constitute terminals. For example:

Although it is legal to define a language containing non-terminals that never resolve to terminals, such as one with purely circular definitions, it is generally impossible and/or useless to create any valid documents for such languages.

3) Groups can be either a sequence or choice of subelements and/or subgroups. For example:

4) Optional (?), one-or-more (+), and zero-or-more (*) operators can be applied to groups, subgroups, and subelements. For example:

5) Elements containing character data can be declared as containing only character data:

or as containing a mixture of character data and elements in any order:

In the latter case, the declaration must place #PCDATA first in the group, the group must be a choice, and the group must appear zero or more times. Such groups are generally referred to as "mixed content" (as opposed to element-only groups or "element content"). Technically, mixed content refers to any element containing character data. However, in common usage it refers only to the latter case.

Note: "PCDATA" in the declarations is short for "Parsed Character DATA". The term is inherited from SGML and comes from the fact that the text in the XML document following the element tag is parsed looking for more markup tags. Although it is possible to include unparsed character data through the use of CDATA sections, these can occur only where PCDATA occurs. While this is of interest to parser writers, it does not affect the syntax of DTDs, nor does it affect the resulting elements -- they still contain character data.

6) EMPTY means that the element has no child elements or character data. Empty elements often have attributes -- see below.

7) ANY means that the element can contain zero or more child elements of any declared type, as well as character data. It is therefore a shorthand for mixed content containing all declared elements.

Attributes

1) Elements can have zero or more attributes. For example:

2) A single ATTLIST statement can declare multiple attributes for the same element. Multiple ATTLIST statements can declare attributes for the same element. That is, the following are equivalent:

3) Attributes can be optional, required, or have a fixed value. Optional attributes can have a default; fixed attributes must have a default. For example:

4) Each attribute has a type:

Comments

1) DTDs can contain comments. Comments are delimited by <!-- and -->. For example: