XML Tutorial
Volume 2 : Creating XML Documents

Hiromi Morita

Index

Elements: The Basic Unit of an XML Document

Points to Remember when Creating XML Documents

Review Questions (Section 2: Creating XML Documents)

Elements: The Basic Unit of an XML Document

In this volume, we will discuss Section 2 of the Basic V2 Exam, "Creating XML Documents."

The focus of this article is the "element."
As mentioned many times in the previous volume, the word "element" means the basic unit used when expressing data in an XML document. An element consists of three tags: the "start tag," the "content tag," and the "end tag."

The start tag is represented using <element name>, while the end tag is represented using </element name>. In general, the element name can be determined freely, while text, numeric values, other elements, etc. can be described as element content. The smallest XML document consists of a single element, such as the following:

Start Tag Content End Tag

In this way, 〈name〉Jenny〈/name〉 by itself can form a single XML document.

Important Points When Describing Elements

There are several important points to note when describing elements. If these points are not incorporated, an element cannot be regarded as "correct."

The element name for each start and end tag must agree

The element name for the start and end tags must match. Lower case/ upper case characters are differentiated in element names. Half-width/ full-width characters are also differentiated in element names.

element names

Be careful of blank spaces in start and end tags!

Always write the element name immediately after the "<" character. Blank spaces, tabs, carriage returns, or line feeds cannot be described between the "<" character and the element name. The same strictures apply to the "/" character immediately following the end tag "<" character, as well as to the "/" character and the element name immediately following. No spaces may exist between characters in an element name.

However, a blank space, tab, carriage return, or line feed may be placed before ">" character that represents the close of a start or end tag.

example

Avoid using a number as the first character in an element name

While element names may be freely chosen by the programmer, there are restrictions on the types of characters that may be used.

Character Location Valid Characters
First Character of the Name Alphabetic characters, underscore (_)
Second Character and Beyond Alphabetic characters, underscore (_)
Numerals, period (.), hyphen (-)

Numerals, periods, etc. may not be used as the first character in an element name.

First character is a numeral

Element Content

Any alphanumeric character may be used to describe element content.

alphanumeric character

Another element (child element) may also be described as the content of an element.

Another element (child element)

Here, we see a feature of XML-hierarchical data structure-in action. The example above shows a two-level hierarchy, but you can create three, four, or any number of levels.
Alternately, you may leave out a description for the content of an element altogether. For example, describing <element_name></element_name> expresses an element with no content. To achieve the same effect under a simpler method, you can just describe <element_name/>. Under this description method, both start tag and end tag do not need to be described, which reduces the volume of coding, and also clearly shows the intention to create an element without content. This simplified tag (<element_name/>) is called an "empty element tag."

Above, we have briefly covered important points to remember when describing elements. From here on, we will discuss points to remember when creating XML documents.

Points to Remember when Creating XML Documents

To use a symbol in element content

We have discussed how text strings and values can be described as element content. Next, let’s look at what we can do when we want to describe special characters as the content of an element.

For example, let’s pretend we have the following XML document:

<sample>
    <calculation>a1*b2</calculation>
    <calculation>a1<b2</calculation>
    <calculation>a1>b2</calculation>
</sample>

In the example above, there is one incorrect calculation element. Let’s see which of the calculation elements is in error.

Use Notepad or other text editor to write this XML document, and save the document with an ".xml" extension. Next, open this file in Microsoft Internet Explorer or Mozilla Firefox. You will see that when the browser tries to display the XML document, the error is in the syntax of the second calculation element.

Let’s start our discussion with <calculation>a1*b2</calculation>, the first element in our example. We all know that in a SQL statement, the "*" character is used as a wildcard. With XML, noting the "*" character as the content of an element means that the "*" character is interpreted as a text character. Accordingly, the character is not treated like a SQL statement wildcard, but it is one of the characters that can be used to describe element content. As such, it is a correct element description.

So what about the third element description, <calculation>a1>b2</calculation>? The "<" and ">" characters are used to describe start and end tags, and as such, there may be many out there who think that these characters can’t be used to describe element content.

XML Analysis Interpretation Sequence

The key to unraveling this mystery is in the "XML AnalysisInterpretation Procedures." The analysisinterpretation of an XML document is performed by a piece of software called an XML parser. An XML parser determines (analyzesinterprets) whether an XML document is correctly described according to XML syntax (the XML specification calls this a "well-formed" XML document), or whether the XML document contains incorrect XML syntax. We just asked you to create an actual XML document, and confirm the document using Internet Explorer or Mozilla Firefox. An XML parser comes standard with Internet Explorer and Mozilla Firefox, so these programs will automatically analyzeinterpret an XML document when used to open the document. When the document is a correct XML document, the content will be displayed in the browser, or in the event of an error, the browser will display an error message.

Other XML parsers are available from a variety of vendors, in case you wish to analyzeinterpret XML from within a program, rather than using a browser.

An XML parser analyzes interprets an XML document in order from top to bottom, left to right. When the parser encounters a "<" character in the XML document, it checks whether the character immediately following is a valid character for an element name. If the character is valid, the parser determines that the "<" character is the beginning of a start tag. Next, the parser checks whether the characters following are valid for use in an element name, until the ">" character is reached. When the parser encounters a ">" character, it interprets the character as the end of the start tag, and then continues the checking process, this time determining whether characters are valid characters for element content. The parser interprets a "</" combination of characters as the beginning of an end tag, and as will the start tag, verifies all of the characters following until the ">" character is reached, which is interpreted as the close of the end tag.

Now, let’s talk about the calculation element from above. In the case of the second <calculation>a1<b2</calculation>, the "<" character indicates the beginning of the start tag, and the character immediately following is "c," a valid character for an element name. Accordingly, the parser interprets the beginning of the start tag, and then checks each character coming after the "c" character in the element name, until it reaches the ">" character. To this point, the parser determines that the document is correct.

Next, the parser checks the element content, and determines that all is well up to "a1," but when it encounters the following "<" character, the parser interprets that character as the beginning of another start tag. You see? The parser interpreted the "<" character as the beginning of a start tag. Accordingly, the following two characters, b2, are interpreted as part of the element name. When the parser analyzesinterprets the next "<" character (the "<" in </calculation>), it throws an error, since a "<" character cannot be used as part of an element name.

So, what about the ">" character in the third <calculation>a1>b2</calculation> element? The ">" character is interpreted simply as the content of the element. Even when ">" occurs as the content of the element, there is no paired "<" or "</" character combination (the character is not interpreted as the close of a start or end tag). As a result, the correct answer is that the "<" character is a symbol that causes an error when used as the content of an element. Understanding the XML parser analysisinterpretation sequence allows us to solve the mystery of this error.

Technique for using "<" in the Content of an Element

To this point, we have discussed how using the "<" character in the content of an element will cause an error. But, what can we do when we really want to use the "<" character? In that case, we can use the "predefined entity references" defined in the XML specification. The "<" character is represented by "&lt;" under the predefined entity references.

The "lt" of "&lt;" is called the "entity name" predefined according to the XML specification. By placing a "&" character before and a ";" character behind the entity name, we can represent the desired character as a predefined entity references.

<calculation>a1&lt;b2</calculation>

Entering the characters "lt" by themselves as the entity name prevents the XML parser from knowing whether the "lt" is a text string or an entity name. Accordingly, when representing a predefined entity references, the "lt" must be preceded by the "&" and followed by the ";" characters.

Now, the "<" character is interpreted not as the beginning of a start or end tag, but rather as a text character. In addition to "lt," there are four other characters defined for predefined entity references.

Entity Entity Name Notation
lt &lt;
gt &gt;
& amp &amp;
" quot &quot;
' apos &apos;

Caution when using "&" in Element Content

What happens when we want to use the "&" character in our element content? If we simply enter the character "&" as it is, then the XML parser will interpret the character as representing a predefined entity reference. When describing the character "&" in element content, then, we use "&amp;", which is the predefined entity reference.

Review Questions (Section 2: Creating XML Documents)

Question 1

Select which of the following is an "element" in the XML document below.

<book>XML Master Basic</book>

  1. <book>
  2. </book>
  3. <book></book>
  4. <book>XML Master Basic</book>

Comments

An element consists of a start tag, content, and an end tag. Answer A is a start tag. Answer B is an end tag. Answer C is an element not contained in the example XML document above. Accordingly, D is the correct answer.

Question 2

Select which of the following XML documents is a well-formed XML document.

  1. <book1>
    <isbn>1234567890</isbn>
    <name>XML Master Basic >2006&lt;</name>
    </book1>
  2. <book1>
    <Isbn>1234567890</isbn>
    <Name>XML Master Basic &gt;2006&lt;</Name>
    </book1>
  3. <1_book>
    <isbn>1234567890</isbn>
    <name>XML Master Basic>2006<</name>
    </1_book>
  4. <book>
    <isbn>1234567890
    <name>
    </isbn>XML Master Basic>2006&lt;
    </name>
    </book>

Comments

In Answer B, the upper case/ lower case characters of the element name do not properly agree. Answer C has a number as the first character of the element name, and the "<" character cannot be used directly as element content. As for Answer D, when including an element as the content of another element, such must be notated as a unit. The isbn element content contains a start tag for the name element, as well as the element content, but there is no end tag, which is an error. Accordingly, Answer A is the correct answer.


Hiromi Morita

Employee Training Department, Hitachi Systems & Services, Ltd. Morita currently serves as a lecturer on Java within Hitachi Systems, in addition to being an Infoteria Certified Trainer and XML lecturer. Mr. Morita looks forward to his daily after-work beer, while very strictly—no, that should be gently—guiding his subordinates in their duties.


The content presented here is an HTML version of an article that originally appeared in the March 2007 issue of DB Magazine published by Shoeisya.

XML Master Tutorial Indexs

Go To HOME