XML Basics | Extensible Markup Language (XML)

Table of contents:

XML permits document authors to create markup (i.e., a text-based notation for describing data) for virtually any type of information. This enables document authors to create entirely new markup languages for describing any type of data, such as mathematical formulas, software-configuration instructions, chemical molecular structures, music, news, recipes and financial reports. XML describes data in a way that both human beings and computers can understand.

Figure 19.1 is a simple XML document that describes information for a baseball player. We focus on lines 511 to introduce basic XML syntax. You will learn about the other elements of this document in Section 19.3.

Figure 19.1. XML that describes a baseball player's information.

(This item is displayed on page 932 in the print version)

 1  = "1.0"?>
 2 
 3 
 4
 5 
 6 John
 7
 8 Doe
 9
10 0.375
11

XML documents contain text that represents content (i.e., data), such as John (line 6 of Fig. 19.1), and elements that specify the document's structure, such as firstName (line 6 of Fig. 19.1). XML documents delimit elements with start tags and end tags. A start tag consists of the element name in angle brackets (e.g., and in lines 5 and 6, respectively). An end tag consists of the element name preceded by a forward slash (/) in angle brackets (e.g., and in lines 6 and 11, respectively). An element's start and end tags enclose text that represents a piece of data (e.g., the firstName of the playerJohnin line 6, which is enclosed by the start tag and > end tag). Every XML document must have exactly one root element that contains all the other elements. In Fig. 19.1, player (lines 511) is the root element.

Some XML-based markup languages include XHTML (Extensible HyperText Markup LanguageHTML's replacement for marking up Web content), MathML (for mathematics), VoiceXML™ (for speech), CML (Chemical Markup Languagefor chemistry) and XBRL (Extensible Business Reporting Languagefor financial data exchange). These markup languages are called XML vocabularies and provide a means for describing particular types of data in standardized, structured ways.

Massive amounts of data are currently stored on the Internet in a variety of formats (e.g., databases, Web pages, text files). Based on current trends, it is likely that much of this data, especially that which is passed between systems, will soon take the form of XML. Organizations see XML as the future of data encoding. Information technology groups are planning ways to integrate XML into their systems. Industry groups are developing custom XML vocabularies for most major industries that will allow computer-based business applications to communicate in common languages. For example, Web services, which we discuss in Chapter 22, allow Web-based applications to exchange data seamlessly through standard protocols based on XML.

The next generation of the Internet and World Wide Web will almost certainly be built on a foundation of XML, which will permit the development of more sophisticated Web-based applications. As is discussed in this chapter, XML allows you to assign meaning to what would otherwise be random pieces of data. As a result, programs can "understand" the data they manipulate. For example, a Web browser might view a street address listed on a simple HTML Web page as a string of characters without any real meaning. In an XML document, however, this data can be clearly identified (i.e., marked up) as an address. A program that uses the document can recognize this data as an address and provide links to a map of that location, driving directions from that location or other location-specific information. Likewise, an application can recognize names of people, dates, ISBN numbers and any other type of XML-encoded data. Based on this data, the application can present users with other related information, providing a richer, more meaningful user experience.

Viewing and Modifying XML Documents

XML documents are highly portable. Viewing or modifying an XML documentwhich is a text file that ends with the .xml filename extensiondoes not require special software, although many software tools exist, and new ones are frequently released that make it more convenient to develop XML-based applications. Any text editor that supports ASCII/Unicode characters can open XML documents for viewing and editing. Also, most Web browsers can display XML documents in a formatted manner that makes it easier to see the XML's structure. We demonstrate this using Internet Explorer in Section 19.3. One important characteristic of XML is that it is both human readable and machine readable.

Processing XML Documents

Processing an XML document requires software called an XML parser (or XML processor). A parser makes the document's data available to applications. While reading the contents of an XML document, a parser checks that the document follows the syntax rules specified by the W3C's XML Recommendation (www.w3.org/XML). XML syntax requires a single root element, a start tag and end tag for each element, and properly nested tags (i.e., the end tag for a nested element must appear before the end tag of the enclosing element). Furthermore, XML is case sensitive, so the proper capitalization must be used in elements. A document that conforms to this syntax is a well-formed XML document, and is syntactically correct. We present fundamental XML syntax in Section 19.3. If an XML parser can process an XML document successfully, that XML document is well formed. Parsers can provide access to XML-encoded data in well-formed documents only.

Often, XML parsers are built into software such as Visual Studio or available for download over the Internet. Popular parsers include Microsoft XML Core Services (MSXML), the Apache Software Foundation's Xerces (xml.apache.org) and the opensource Expat XML Parser (expat.sourceforge.net). In this chapter, we use MSXML.

Validating XML Documents

An XML document can optionally reference a Document Type Definition (DTD) or a schema that defines the proper structure of the XML document. When an XML document references a DTD or a schema, some parsers (called validating parsers) can read the DTD/schema and check that the XML document follows the structure defined by the DTD/schema. If the XML document conforms to the DTD/schema (i.e., the document has the appropriate structure), the XML document is valid. For example, if in Fig. 19.1 we were referencing a DTD that specifies that a player element must have firstName, lastName and battingAverage elements, then omitting the lastName element (line 8 in Fig. 19.1) would cause the XML document player.xml to be invalid. However, the XML document would still be well formed, because it follows proper XML syntax (i.e., it has one root element, and each element has a start tag and an end tag). By definition, a valid XML document is well formed. Parsers that cannot check for document conformity against DTDs/schemas are nonvalidating parsersthey determine only whether an XML document is well formed, not whether it is valid.

We discuss validation, DTDs and schemas, as well as the key differences between these two types of structural specifications, in Sections 19.5 and 19.6. For now, note that schemas are XML documents themselves, whereas DTDs are not. As you will learn in Section 19.6, this difference presents several advantages in using schemas over DTDs.

`Software Engineering Observation 19 1`

DTDs and schemas are essential for business-to-business (B2B) transactions and mission-critical systems. Validating XML documents ensures that disparate systems can manipulate data structured in standardized ways and prevents errors caused by missing or malformed data.

Formatting and Manipulating XML Documents

XML documents contain only data, not formatting instructions, so applications that process XML documents must decide how to manipulate or display each document's data. For example, a PDA (personal digital assistant) may render an XML document differently than a wireless phone or a desktop computer. You can use Extensible Stylesheet Language (XSL) to specify rendering instructions for different platforms. We discuss XSL in Section 19.7.

XML-processing programs can also search, sort and manipulate XML data using technologies such as XSL. Some other XML-related technologies are XPath (XML Path Languagea language for accessing parts of an XML document), XSL-FO (XSL Formatting Objectsan XML vocabulary used to describe document formatting) and XSLT (XSL Transformationsa language for transforming XML documents into other documents). We present XSLT in Section 19.7. We also introduce XPath in Section 19.7, then discuss it in greater detail in Section 19.8.

[Page 934 (continued)]

`19 3 Structuring Data`

Preface

Index

Introduction to Computers, the Internet and Visual C#

Introduction to the Visual C# 2005 Express Edition IDE

Introduction to C# Applications

Introduction to Classes and Objects

Control Statements: Part 1

Control Statements: Part 2

Methods: A Deeper Look

Arrays

Classes and Objects: A Deeper Look

Object-Oriented Programming: Inheritance

Polymorphism, Interfaces & Operator Overloading

Exception Handling

Graphical User Interface Concepts: Part 1

Graphical User Interface Concepts: Part 2

Multithreading

Strings, Characters and Regular Expressions

Graphics and Multimedia

Files and Streams

Extensible Markup Language (XML)

Database, SQL and ADO.NET

ASP.NET 2.0, Web Forms and Web Controls

Web Services

Networking: Streams-Based Sockets and Datagrams

Searching and Sorting

Data Structures

Generics

Collections

Appendix A. Operator Precedence Chart

Appendix A. Operator Precedence Chart

Appendix B. Number Systems

Appendix C. Using the Visual Studio 2005 Debugger

Appendix D. ASCII Character Set

Appendix D. ASCII Character Set

Appendix E. Unicode®

Appendix F. Introduction to XHTML: Part 1

Appendix G. Introduction to XHTML: Part 2

Appendix H. HTML/XHTML Special Characters

Appendix H. HTML/XHTML Special Characters

Appendix I. HTML/XHTML Colors

Appendix I. HTML/XHTML Colors

Appendix J. ATM Case Study Code

Appendix K. UML 2: Additional Diagram Types

Appendix L. Simple Types

Index