XML Fundamentals | XML, Web Services, and the Data Revolution


Team-Fly

	XML, Web Services, and the Data Revolution By Frank P. Coyle
	Table of Contents

	Appendix A. XML Language Basics

Although XML includes several language components , most individual XML vocabularies can be read and understood by focusing on three commonly used XML structures, which we will explore in this overview: elements, attributes , and entities. Elements and attributes are used to describe content while entities are substitutes for special or commonly used character strings.

Elements

Elements are the primary means for describing data in XML. The rules for composing elements are flexible, allowing for different combinations of text content, attributes, and other elements. However, there are three ways elements are used in XML documents.

Simple Content

Text or other data appears between start and end tags. The start tag has the same name as the end tag except that the end tag begins with a slash. The following element has a start tag, content, and an end tag.

 <author>Stephen Hawking</author>

Element as Container for Other Elements

An element may contain other elements, providing a hierarchical or tree data structure. The following book element contains the author and title elements.

 <book>      <author>Stephen Hawking</author>      <title>A brief history of time</title> </book>

Empty Element as Container for Attributes

When an element has no content but only attributes, there is a shorthand way of writing the element that bypasses the need for both a start and end tag. An element written with a slash following the tag name indicates an empty element, as in

 <book/>

which is shorthand for

 <book></book>

The most common use of empty elements is to hold attribute data.

 <book title="A brief history of time" author="Stephen Hawking"/>

Technically, an element can contain it all ”other elements, text, and attributes ”but packing both elements and text into a single element is considered poor style because, as the following example shows, it makes it difficult for a reader to understand what the author is trying to express.

 <book isbn="0-102-9393-3">   A brief history of time   <author>Stephen Hawking</author> </book>

Because readability is an important aspect of XML design, many industry groups charged with defining XML vocabularies specify naming conventions to make it easier to read the XML and distinguish data elements from structure elements.

For example, the Mortgage Industry Standards Maintenance Organization uses all capitals to distinguish elements that contain other elements and initial capitals for words in element names that contain only data. Applying this convention to our book example would give us

 <book>      <author>Stephen Hawking</author>      <title>A brief history of time</title> </book>

Element Naming Rules

While the decision to capitalize or not is up to the XML language designer, there are official naming rules for XML which must be followed.

Names can contain letters , numbers , and other characters .
Names must not begin with a number or punctuation.
Names must not start with the string "xml" in any upper- or lowercase form.
Names must not contain spaces.

Also, when designing element names, one should not use the colon , since it is reserved for use with XML namespaces.

Attributes

Attributes provide additional information about elements. In HTML, for example, attributes are used to specify the name of an image file when loading an HTML document.

 <img src="computer.gif">

Attributes are often used to indicate information that is not part of the data described within an element. Often an attribute is used to describe something about the data itself. For example, in the following XML the attribute use might tell a program handling the data that the file is not required.

 <file use="optional">computer.gif</file>

In XML, attribute values must always be enclosed in quotation marks, either single or double.

Elements versus Attributes

The question whether to use elements or attributes to represent data has been widely debated. XML allows data to be stored either as the content of an element or as an attribute. The following two examples show the same data represented in different ways.

 <book isbn="092373637">    <title>Anna Karenina</title>    <author>Tolstoy</author> </book> <book>    <isbn>092373637</isbn>    <title>Anna Karenina</title>    <author>Tolstoy</author> </book>

In the first example isbn is an attribute; in the second, isbn is a child element of book . There are no official rules about when to use attributes. The general consensus is to use elements if the information seems like data and to use attributes when describing something about the data. Reasons for not using attributes to store data include the following:

Attributes cannot contain multiple values, while elements can have multiple subelements.
Attributes are not easily expandable to account for future changes.
Attributes are more difficult than elements to manipulate with programs.
Attribute values are not easy to check against a document type definition (DTD).

Entities

Entities are used to substitute one string for another in an XML document. For example, if a phrase such as "XML and the Data Revolution" is repeated frequently in a document, one can define a shortcut entity declaration in the DTD.

 <!ENTITY xdr " XML and the Data Revolution ">

Then, when you want to use the full phrase, you use &xdr ; and it will be substituted in the XML document. Using entities can help avoid misspellings and the tediousness of typing the same thing over and over.

Predefined Entities

XML has adopted five predefined entities from the HTML world. The ampersand ( & ), greater-than ( > ), lesser-than ( < ), double-quote ( " ), and apostrophe ( ' ) characters are represented within XML documents as " & ", " < ", " > ", " " ", and " ' ", respectively.

If the entities are long, it's possible to store the information separately in another file. This can be accomplished through an external entity reference, which uses the XML keyword SYSTEM between the entity name and URL of the file.

 <!ENTITY text SYSTEM "http://my.url.here">

Parameter Entities

While entities are useful for creating substitution strings within XML documents, it's often useful to define shortcuts in a DTD to make writing a DTD easier. This is where parameter entities come in. A parameter entity is defined by inserting a percent sign prior to the entity name. Once defined, a parameter entity can be substituted by surrounding the parameter name with a percent sign and semicolon.

CDATA

The XML CDATA section is used to prevent the processing of a portion of data. When an XML document is parsed, all the XML is processed except the data inside a CDATA section. This allows the inclusion of content that may confuse an XML processor.

For example, if an XML document contains greater-than or ampersand characters, as many programs or scripts do, one can define a CDATA section to contain this data. A CDATA section starts with " <![CDATA[ " and ends with " ]]> ". The following example shows how a CDATA section may be used to include script code within an XML element named script.

 <script> <![CDATA[   function compare(a,b) {     if (a < b) then {       return 1     }     else {      return 0     }   } ]]> </script>

Processing Instructions

XML allows the use of special instructions in order to pass information to programs that may read the document. A processing instruction begins with " <? " and ends with " ?> ". Immediately after the " <? " is a target name that is used to let a program know who the content of the processing instruction is intended for. For example, the following is a processing instruction intended for a program that is looking for the name agent.

 <?agent process="yes" priority="high">

XML Declaration

Most XML documents begin an XML declaration of the form:

 <?xml version="1.0">

Although the XML declaration looks like a processing instruction, it is technically not. If present, it must be the first thing in a document.


Team-Fly

Top