Volume 3 : DTD Basics
The Need for XML Document Schema
XML documents are used for many different purposes today. Order processing, invoices, estimates, travel expense reports, meeting minutes, accounting forms, manuals, and other data used on a daily basis at work are only the tip of the iceberg. Now, we see XML used for personal data such as journals, household finances, and other applications. Virtually any data can be created using XML format, since XML allows a user to freely define element names and hierarchical structure.
In this volume, we will be looking both at writing XML documents simply according to XML syntax (well-formed XML documents), as well as writing XML documents to be used as business-to-business data, or in other words, a data format to be shared between and among different companies.
For example, if requested to create an XML document to serve as a purchase order to be sent to Mr. Y at Company X, what kind of XML document would you create? The following are three XML document examples, created by three different individuals. Ms. A has created very semantic element names. Mr. B has opted for rather abbreviated element names, and Ms. C has created elements having a hierarchical structure.
Even a simple request to create an XML document to serve as a purchase order to be sent to Mr. Y at Company X can take on a number of different XML document patterns. Any of the three examples above can be considered to be proper XML documents. As long as the information required for a purchase order is included, likely any XML document you could create would be a valid XML document for the purpose.
But what would change if you approached the task from Mr. Y's perspective? Assume that Mr. Y uses the three XML documents above, or your XML document, for order processing. Having received XML documents with different element names and hierarchical structures, Mr. Y would have to open each XML document in an editor, confirm whether all of the required information was present, and then process the purchase order. In this case, every purchase order would have to be processed by hand, and the entire system could never be automated.
But what would happen if all XML documents sent had the same element names and hierarchical structure? With standard element names and structures, a system could be created to handle all incoming XML documents, and order processing could be automated, without Mr. Y having to verify the content of each individual document.
A "Schema" is what is required to allow the acceptance (or creation) of XML documents with a standardized element name and hierarchy structure. We know that in the RDB world, a schema is defined when designing tables to stipulate category (column) data types and data sizes, set the primary key, associate tables with other tables, etc. Under XML Schema, a user notates element names, orders of occurrence, and number of occurrences. When XML is used for specific purposes, a schema will first be defined, and then XML documents will be created in accordance with that schema. In doing so, anyone can create an XML document having the same exact element names and hierarchical structure.
Let's take another look at the task for creating an XML document to be used for a purchase order. Assume that Mr. Y sends to Ms. A, Mr. B, and Ms. C a schema document for purchase orders (XML document). Ms. A, Mr. B., and Ms. C then each create an XML document based on Mr. Y's schema. The element names and hierarchical structure of the XML documents they send to Mr. Y are completely identical.
Mr. Y can now use an XML parser to verify whether the documents have been created according to the schema, so there is no need to open each file and check element names and hierarchy structures. This reduces Mr. Y's workload significantly.
Types of XML Document Schema
There are many different types of XML document schema. While the following type of narrative format can be considered a type of schema, there is the chance that different people will interpret the narrative differently. This is why, in general, XML document schema is created using Schema Definition Language. Schema Definition Language is specialized definition language for noting schema, and leaves no room for interpretive differences.
Purchase Order XML Schema
 The root element is "orderform"  The content of "orderform" is a "customer" element and a "product" element in that order. "customer" occurs once, and "product" may occur zero or more times.  The content of "customer" is the "name", "address", and "tel" elements, each occurring once in order  The content of "name" and "address" is a text string  The content of "tel" is the "portable" and "home" elements, with either one or the other occurring  The content of "portable" and "home" is a text string  The content of "product" is the "product_name" and "num" elements, each occurring once in order
 The content of "product_name" is a text string  The content of "num" is a numeric value
There is more than one Schema Definition Language out there. The Schema Definition Language defined under the XML 1.0 specification is the "DTD (Document Type Definition)." An even more strictly defined Schema Definition Language is the "XML Schema" determined by the W3C. Different vendors also have defined various Schema Definition Languages.
DTD Schema Definition
Under DTD, the main categories comprising the XML document are declared. Declarations come under one of the following four categories:
- Element Type Declaration
- Attribute List Declaration
- Entity Declaration
- Notation Declaration
Here, we will discuss the most important of these, the "Element Type Declaration."
Element Type Declarations declare elements contained within an XML document. The following shows the syntax for an Element Type Declaration.
The Content Model is very important in the Element Type Declaration. It defines whether the element content is a text string or numeric value (character data), whether only child elements occur (element content), etc.
When content is text string or numeric value
When the element content is a text string or numeric value, the Content Model is designated as #PCDATA. Under DTD, there is no difference between numeric type data and text type data. For example, the following describes the Element Type Declaration that designates the content of "product_name" as a text string:
<!ELEMENT product_name (#PCDATA)>
The correct element description that conforms to this Element Type Declaration is <product_name>television</product_name>. Describing a child element such as <product_name><abc/></product_name> will cause an error.
The following describes the Element Type Declaration that designates the content of "num" as a numeric value:
<!ELEMENT num (#PCDATA)>
The correct element description for this definition is <num>10</num>. As discussed earlier, both text strings and numeric values for element content are designated as #PCDATA under DTD, so <num>Jenny</num> is a correct notation. The application must perform a check to see whether the content of an element is actually a number.
When content is a child element
When a child element occurs as the content of an element, the element name of the child element occurring is designated in the Content Model. However, the order of occurrence and number of occurrences of the child element must also be defined.
Defining the order of occurrence
When there are a multiple number of child elements, you must designate the order of occurrence. There are two ways to notate the order of occurrence. Using a comma (,) between the child element name and the next child element name indicates that the child elements will occur in the order given. Using a vertical line (|) means that either one or the other child element will occur.
|","||Occurs in the order given|
|"|"||Either one or the other child element occurs|
In the following example, the content of "product" is the "product_name" and "num" elements, occurring once each in that order.
<!ELEMENT product (product_name,num)>
The following is a valid element description for this type of Element Type Declaration:
Because "," defines the order of occurrence as the order in which the child element was written, the following would be examples of invalid notation:
To describe an Element Type Declaration where either the "portable" or "home" element (child elements of "tel") occurs:
<!ELEMENT tel (portable|home)>
In this case, the following would be an error when describing both the portable and home elements:
〈tel〉 〈portable〉＊＊＊＊＊＊＊＊＊＊〈/portable〉 〈home〉＊＊＊＊＊＊＊＊＊＊＊＊＊＊〈/home〉 〈/tel〉
Defining the number of occurrences
In addition to the order of occurrence for child element names, the number of occurrences is also defined in the Content Model. The number of occurrences is designated with one of three symbols: "*", "+" or "?". The "*" symbol means "may occur zero or more times." The "+" symbol means "may occur one or more times." The "?" symbol means "may occur zero times or one time."
As with the notation examples for the Element Type Declaration （<!ELEMENT product （product_name,num）>） shown earlier, not providing an symbol for the number of occurrences means "must occur once."
|"*"||May occur 0 or more times|
|"＋ "||May occur one or more times|
|"？ "||May occur zero times or once|
|No designation||One time|
Under DTD, a programmer may not designate a specific number of occurrences (e.g. three times, between two and five times, etc.).
For example, output the "customer" and "product" elements (content of "orderform") in that order. To describe an Element Type Declaration designating one occurrence for "customer" and zero or more occurrences for "product", use the following notation:
<!ELEMENT orderform (customer,product*)>
Now, let's describe all of the elements, referencing the notation examples above.
Use the "dtd" extension when actually creating the document. The following shows a file named "order.dtd", describing the Purchase Order XML schema DTD:
<!ELEMENT orderform (customer,product*)> 
<!ELEMENT customer (name,address,tel)> 
<!ELEMENT name (#PCDATA)> 
<!ELEMENT address (#PCDATA)> 
<!ELEMENT tel (portable | home)> 
<!ELEMENT portable (#PCDATA)> 
<!ELEMENT home (#PCDATA)> 
<!ELEMENT product (product_name,num)> 
<!ELEMENT product_name (#PCDATA)> 
<!ELEMENT num (#PCDATA)> 
LIST1： Valid XML Document for DTD
orderform.xml <!DOCTYPE orderform SYSTEM "order.dtd"> <orderform> <customer> <name>Jenny</name> <address>Tokyo</address> <tel> <portable>555-5555-5555</portable> </tel> </customer> <product> <product_name>washing machine</product_name> <num>1</num> </product> <product> <product_name>television</product_name> <num>2</num> </product> </orderform>
Declaration to Associate an XML Document and Schema Document
The <!DOCTYPE・・・> at the beginning of LIST1 is called the "Document Type Declaration," and designates the DTD that defines the structure of the XML document. There are two types of notation methods, one being an "internal subset" describing the Element Type Declaration and individual declarations within the Document Type Declaration, and the other being an "external subset" (used here) where the Element Type Declaration and individual declarations are designated in an external file. In this volume, we will discuss the notation method for an external subset.
The location for the Document Type Declaration is predetermined, coming above the start tag of the root element. The Document Type Declaration syntax is described as shown below, and then the root element name and file name are designated:
Validating the XML Document
Once the schema document and XML document have been created, we can verify whether the XML document has been created in accordance with the schema document. This validation can be performed using an XML parser, eliminating the need for manual verification or creating a separate validation program.
In the prior volume, we explained how to use Internet Explorer ("IE") to verify whether an XML document has been correctly written. However, the XML parser incorporated within IE cannot verify whether an XML document has been created in accordance with a particular schema document. Accordingly, we will use a verification XML processor.
Let's verify the XML document we created against the schema document.
Next, create the XML document as shown in LIST2, and conduct the same operation as before. An error message should result.
LIST2：Invalid XML Document with respect to a DTD
orderform_err.xml <!DOCTYPE order form SYSTEM "order.dtd" > <orderform> <customer> <name>Jenny</name> <address>Tokyo</address> </customer> <product> <product_name>washing machine</product_name> <num>1</num> </product> <product> <product_name>television</product_name> <num>2</num> </product> </orderform>
The reason that this type of error occurred is that the tel element does not occur in the XML document in LIST2, while the schema document requires that the "name", "address" and "tel" elements (content of "customer") occur once in that order. Use the error message in the dialog box as a clue to check the line before and after the error, and make the necessary edits.
In the prior volume, we discussed using predefined entity references, since "<" and "&" characters cannot be used directly as the content of an element. Since the <calculation>a1<b2</calculation> statement causes an error to occur, we rewrote the statement to read <calculation>a1<b2</calculation>. There are five types of predefined entity references provided under the XML 1.0 specification.
<Table> Predefined Entity References
|Entity||Entity Name||Symbol Notation|
When using a DTD entity declaration, you can define your own entity references in addition to the five types above.
Advantage of Using Entity Declarations
Using an entity declaration allows you to accomplish the following two goals:
Improved efficiency in creating/ editing XML documents using replacement text string definitions
One advantage is the ability to define your own replacement text string like a predefined entity reference ("internal entity" under XML 1.0 specification). For example, when a long character data string occurs many times in an XML document, you can define an entity declaration, and use the defined entity name to reduce the amount of typing required, helping to eliminate typographical errors in your code. Also, in the event of a change in character data, you can merely change the definition of the entity declaration, and change the referenced section, rather than locating and retyping each individual line affected by the change.
Distribute workload by loading external files
Another advantage is that you can use an entity declaration to load an external file ("external entity" under XML 1.0 specification). For example, there are times when several engineers work together on a large XML document. In this situation, there is a possible conflict over who gets access to the file when. While the file is being used by one engineer, the other engineers cannot get any work done.
But using an entity declaration allows each section to be created as its own discrete XML document, being added back to the original XML document. While the term "added" can bring to mind a "copy/ paste" type of operation, an entity declaration eliminates the need.
Here, we will explain the method for defining a replacement text string. The entity declaration syntax is as follows:
When used within an XML document, designate "&entityname;", similar to using a predefined entity reference.
The following shows the entity declaration definition and how it is used in an XML document:
<!ELEMENT product_name (#PCDATA)>
<!ENTITY TV "television">
<!DOCTYPE product_name SYSTEM =>
（=> used for line breaks due to space constraints）
Once you have created the document, open it using IE. The "&TV;" is replaced by the XML parser, and the following is displayed in your browser:
Assume you want either "A", "B", "AB" or "O" to occur as a child element of "BloodType". Select which of the following is a correct Element Type Declaration.
- <!ELEMENT BloodType（A|B|AB|O）>
- <!ELEMENT BloodType（A?B?AB?O）>
- <!ELEMENT BloodType（A,B,AB,O）>
- <!ELEMENT BloodType（A ＋ B ＋ AB ＋ O）>
We use the "|" character when we want one or the other child element to occur. "," defines an occurrence in the order given, while "?" and "+" define the number of occurrences. Accordingly, the correct answer is A.
Select which of the following is a correct statement with respect to the Element Type Declaration.
- <!ELEMENT sample（data1|data2）>
Both data1 and data2 have to occur as child elements of the sample element
- <!ELEMENT sample（data1,data2）>
Both data1 and data2 occur as child elements of the sample element; any order of occurrence is allowed
- <!ELEMENT sample（data1|data2*）>
Both data1 and data2 may occur zero or more times as child elements of the sample element.
- <!ELEMENT sample（data1, data2*）>
As child elements of the sample element, data1 will first occur once, and then data2 will occur zero or more times
A uses "|" to designate that either data1 or data2 occurs. This is an error if you wish to allow both child elements to occur. B uses "," to designate that the child elements occur in the order described. This is an error if the intent is that the child elements may occur in any order. Since there is no symbol defining the number of occurrences for data1 in C, the number of occurrences defaults to one. Accordingly, this is an error if the intent is that data1 and data2 occur zero or greater times. D uses "," to correctly indicate that the child elements occur in the order written, and "*" to correctly indicate that the child element occurs zero or greater times. Accordingly, the correct answer is D.
Select which of the following is a correct description for the results displayed in a browser, given the DTD and XML document below:
<!ENTITY DB "DB Magazine">
- Using an Entity Declaration allows you only to call an external file. Replacement text strings are limited to the five predefined entity references, and cannot be created arbitrarily by the programmer
- An error occurs because the Element Type Declaration DB and Entity Declaration DB names are duplicated
- <DB>DB Magazine</DB>
With an Entity Declaration, a programmer can define his or her own replacement text string. Accordingly, A is incorrect. Because the type of declaration between the Element Type Declaration and the Entity Declaration are different, no error will occur. As such, B is incorrect. When displayed in a browser, the text that is the object of the entity reference will be shown. Accordingly, C is the correct answer.
A DTD also includes an "Attribute List Declaration" for defining attributes, and "Notation Declaration" that provides for entities that allow the usage of images and other data that cannot be processed by an XML parser. While we did not cover these areas in this volume, a programmer should study and understand these techniques as well.
Employee Training Department, Hitachi Systems & Services, Ltd. Morita currently serves as a lecturer on Java within Hitachi Systems, in addition to being an Infoteria Certified Trainer and XML lecturer. Mr. Morita looks forward to his daily after-work beer, while very strictly-no, that should be gently-guiding his subordinates in their duties.
The content presented here is an HTML version of an article that originally appeared in the April 2007 issue of DB Magazine published by Shoeisya.