I l @ ve RuBoard |
This section is designed to give you a broad overview of the Extensible Markup Language (XML). It discusses XML's history, need, and rationale, together with a quick look at some basic XML constructs and applications. XML actually is a subset of the Standardized General Markup Language (SGML). SGML is an internationally accepted standard for describing just about any type of information; however, it's way too complex for the relatively simple world of the web. And so, the World Wide Web Consortium (W3C) created a modified version of SGML specifically for the web, named it XML , and released it on an unsuspecting public sometime in 1998. Over the past few years , XML has received more than its allotted fifteen minutes of fame, with technology pundits and business leaders alike singing its praises. It's been crowned the Next Big Thing, both on account of its ease of use and its potential for revolutionizing the way information is exchanged and used. Some of this is hype and some of it isn't; either way, it's quite clear that XML is going to be around for a while, and that, wisely used, it can indeed be a powerful tool for the management and effective exploitation of information. XML works by "marking up" data with descriptive tags, in much the same way HTML does. The difference is that HTML was designed specifically to format data for web browsers and, as such, is limited to a predefined set of tags and functions. XML, by contrast, was designed as a web-friendly meta-data language and, therefore, merely lays down the rules for document markup (leaving it to the document author to define his or her own tags). As an example, consider the following block of text:
Sure, you can tell that it's a newspaper report, primarily because you read newspapers, can see the similarities, and can make a conclusion based on those similarities.You can even break it up conceptually into the headline, the byline, and the body of the article. But a computer can't do those things ”to a computer, the block of text above is simply a bunch of alphanumeric characters , with very little to distinguish the headline from the body. That's where XML comes in. It can be used to transform the anonymous block of text above into something that even a computer can make sense of (see Listing 1.1). Listing 1.1 A Simple XML Document<?xml version="1.0"?> <story id="34" category="weird"> <slug>A Man And His Mouse</slug> <author>J. Gilbert Gumpfinch III</author> <date>11-23-2001-</date> <body> <para>In a development many consider to be the first of its kind, the Hungarian scientist <person type="scientist">Professor Haarbert Floopshot</person> today announced that he had succeeded in inventing "a better mouse." The new mouse, created using advanced genetic splicing techniques and "some good old- fashioned SuperGlue," can emit ultrasonic squeals to frighten off predators twice its size, leap tall mousetraps in a single bound, and comes equipped with a built- in CatDetector to detect approaching felines.</para> </body> </story> By marking up the data with descriptive tags, XML makes it easy to distinguish between different types of information ”even for a computer. In today's wired world, this capability is more valuable than you might surmise; with many of today's business decisions handled by computer, XML can significantly improve the accuracy of information processing, thereby increasing overall business efficiency, streamlining business processes, and (ultimately) fattening the bottom line ”which probably also explains why the industry's so enthusiastic about it. FeaturesXML was designed to incorporate the following features:
Basic ConceptsXML documents come in two flavors: well- formed and valid. A well-formed document is one that adheres to the basic rules laid down in the XML specification; for example, all elements must be properly nested; attribute values must be enclosed within quotation marks; and the document must contain at least one nonempty element. A valid document is one that, in addition to being well-formed, meets the requirements and constraints laid down in a Document Type Definition (DTD) or XML schema. This DTD or schema is an additional ruleset an author can use to specify the element names and data types that are allowed in the document. This helps to reduce the risk of corrupted or invalid data. Listing 1.2 shows well-formed XML document, which describes an invoice for materials purchased. It's a slightly contrived example, but it will serve to illustrate XML's most commonly used constructs. Listing 1.2 An XML Document Demonstrating Most of the Language's Basic Constructs<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE invoice [ <!ENTITY terms "Payment on delivery. Cash only, no checks or money orders accepted. Returns within 15 days."> <!ENTITY logo SYSTEM "images/logo.gif" NDATA gif> <!NOTATION gif SYSTEM "gifviewer.exe"> ]> <invoice id="AS2354R"> <logo src="logo" /> <date>03-12-2000</date> <!-- consignor address --> <vendor> <name>Bobby's Biotech Bazaar</name> <address> <street>25, Main Street</street> <city>Purple River Town</city> <zip>457929</zip> <country>South America</country> </address> </vendor> <!-- consignee address --> <customer> <name>Floopshot, Haarbert.</name> <address> <street>15, Booger Avenue</street> <city>Nowheresville</city> <zip>64732</zip> <state>TY</state> <country>US</country> </address> </customer> <!-- consignment details --> <material> <item id="23"> <name>Gray mouse</name> <quantity>12</quantity> <price currency="USD">11.99</price> </item> <item id="23"> <name>Vampire bat (DNA sequence)</name> <quantity>1</quantity> <price currency="USD">34.99</price> <dna_sequence> <![CDATA[ 1627 3494 #190 2230 8549 0732 #232 8238 2833 01&6 9929 ]]> </dna_sequence> </item> <item id="856"> <name>Australian kangaroo (DNA sequence)</name> <quantity>1</quantity> <price currency="USD">34.99</price> <dna_sequence> <![CDATA[ 0101 9394 %42$ 9393 0209 2020 8348 #577 7493 2543 &646 ]]> </dna_sequence> </item> </material> <terms>&terms;</terms> <note>Include current product catalog with purchase</note> <?sales_agent include_promotional_message="1"?> </invoice> As you can see, an XML document is merely ASCII text, broken into separate sections by markup. This markup has several components , each with its own distinct role:
These components are explained in the sections that follow. Document PrologEvery XML document begins with a special identifier called the document prolog , such as <?xml version="1.0" encoding="UTF-8"?> This prolog appears at the top of an XML document and specifies things like the XML version and type of encoding used, the location of any DTD that may be used to validate the document, and one or more entity definitions. In Listing 1.2, the prolog contained two entity definitions and one notation: <!DOCTYPE invoice [ <!ENTITY terms "Payment on delivery. Cash only, no checks or money orders accepted. Returns within 15 days."> <!ENTITY logo SYSTEM "images/logo.gif" NDATA gif> <!NOTATION gif SYSTEM "gifviewer.exe"> ]> ElementsThe document prolog is followed by one or more elements. Elements are the most basic units of XML data ”they consist of attributes and content (or character data ) surrounded by descriptive tags (or markup ). Here are three examples: <street>15, Booger Avenue</street> <rate>9.99</rate> <name>Vampire bat (DNA sequence)</name> To be well-formed, an XML document must contain at least one nonempty element. This outermost element, sometimes referred to as the root element , serves as the container for the remainder of the document. Elements can be empty, contain other elements nested within them, or enclose a combination of both character data and elements. AttributesElements can be enhanced further by the addition of attributes , which are name-value pairs that can be used to attach any type of additional descriptive information to an element. Here are two examples: <invoice id="AS2354R"> <price currency="USD">9.99</price> In order to be well-formed, attribute values must be enclosed within quotation marks, and attribute names cannot be repeated within the same element. EntitiesEntities serve as placeholders for frequently used pieces of text within an XML document. They provide a convenient shortcut for document authors to store and easily update commonly used text snippets. Entities consist of two components:
When an XML parser processes an XML document, entity references automatically are replaced with their actual values. XML comes with five predefined entities.You might already be familiar with them if you have worked with HTML (see Table 1.1). Table 1.1. XML's Five Predefined Entities
A variant of the regular entity just described is the unparsed entity , typically used to reference data that should not be processed by the XML parser. This is usually binary data ”images, audio files, video streams, and the like. The preceding example demonstrates one such unparsed entity, which holds the path to the company logo: <!ENTITY logo SYSTEM "images/logo.gif" NDATA gif> Note the NDATA keyword, which tells the parser that it should look up the appropriate notation to find out how to handle this data (notations are discussed next). In order for a document to be well-formed, entities cannot contain references to themselves ”think infinite loop and you'll understand why. NotationsA notation is an XML construct designed to help the parser identify non-XML data ”for example, images or sound files ”and typically goes hand in hand with unparsed entities. A notation is always enclosed within a notation declaration , which appears either within a DTD or the document prolog, and looks like this: <!NOTATION notation-name notation-identifier> The notation name is a unique identifier used within unparsed entities, while the notation identifier is a string that tells the XML processor how to handle that particular entity. This string could be anything from a URL that identifies the data type to the location of a program that can decode the data. Here's an example: <!NOTATION gif SYSTEM "gifviewer.exe"> CDATA BlocksCDATA blocks are "boxes" within an XML document, identified by special opening and closing delimiters. The text within these boxes is treated by the parser as character data, not markup, and can therefore contain special characters which would normally cause the parser to generate an error. CDATA blocks begin with the special sequence <![CDATA[ and end with the sequence ]]> . For example, <dna_sequence> <![CDATA[ 0101 9394 %42$ 9393 0209 2020 8348 #577 7493 2543 &646 ]]> </dna_sequence> The option to CDATA blocks is, of course, using the predefined entities discussed earlier to represent special characters like the less-than (<), greater-than (>), and ampersand (&) symbols. Because entities allow for reusability within the document, using entities is sometimes preferable to using CDATA blocks. Processing InstructionsProcessing instructions (PIs) are special instructions embedded within an XML document. These PIs are not usually intended for human readers; rather, they provide special information or commands to the XML application responsible for parsing the document. Parsers that do not recognize these instructions will simply ignore them. PIs are typically enclosed within <? ... ?> tags, as demonstrated here: <?sales_agent include_promotional_message="1"?> Notice that the very first line in an XML document is actually a PI indicating the version number and encoding to the parser: <?xml version="1.0" encoding="UTF-8"?> The parser can use this information to make decisions on how to process the XML ”for example, reject the document if the XML version is unsupported, or switch its internal character handling routines to use the encoding specified in the prolog. CommentsFinally, comments provide a simple and convenient way for document authors to include human-readable notes within their XML markup. Comments must be placed within <!-- ... --> markers, and they are usually ignored by the parser. Ancillary TechnologiesAs XML's popularity has grown, so has an understanding of its capabilities and potential; and this, in turn , has spawned a new generation of related technologies. Together, they are an oft-confusing morass of acronyms and buzzwords ; individually, they each make an important contribution to the overall picture. Here's a brief list of the better-known XML development efforts underway at the W3C:
ApplicationsAs the preceding list demonstrates, XML has the potential to change the way we deal with web-based content. Here are four of XML's most important applications:
Of course, this is just the tip of the iceberg. XML and its related technologies are still coming to full fruition, and new applications for this family of powerful technologies appear all the time. If you'd like to learn more about XML, there are a number of very good books available to get you up to speed. The book's companion web site (http://www.xmlphp.com or http://www.newriders.com) has details, together with a list of useful web sites and mailing lists. |
I l @ ve RuBoard |