This powerful new edition provides developers with a comprehensive guide to the rapidly evolving XML space. Serious users of XML will find topics on just about everything they need, from fundamental syntax rules, to details of DTD and XML Schema creation, to XSLT transformations, to APIs used for processing XML documents. Simply put, this is the only reference of its kind among XML books.
What This Book Covers
What's New in the Second Edition
Organization of the Book
Conventions Used in This Book
Request for Comments
Part I: XML Concepts
Chapter 1. Introducing XML
1.1 The Benefits of XML
1.2 Portable Data
1.3 How XML Works
1.4 The Evolution of XML
Chapter 2. XML Fundamentals
2.1 XML Documents and XML Files
2.2 Elements, Tags, and Character Data
2.4 XML Names
2.5 Entity References
2.6 CDATA Sections
2.8 Processing Instructions
2.9 The XML Declaration
2.10 Checking Documents for Well-Formedness
Chapter 3. Document Type Definitions (DTDs)
3.2 Element Declarations
3.3 Attribute Declarations
3.4 General Entity Declarations
3.5 External Parsed General Entities
3.6 External Unparsed Entities and Notations
3.7 Parameter Entities
3.8 Conditional Inclusion
3.9 Two DTD Examples
3.10 Locating Standard DTDs
Chapter 4. Namespaces
4.1 The Need for Namespaces
4.2 Namespace Syntax
4.3 How Parsers Handle Namespaces
4.4 Namespaces and DTDs
Chapter 5. Internationalization
5.1 Character-Set Metadata
5.2 The Encoding Declaration
5.3 Text Declarations
5.4 XML-Defined Character Sets
5.6 ISO Character Sets
5.7 Platform-Dependent Character Sets
5.8 Converting Between Character Sets
5.9 The Default Character Set for XML Documents
5.10 Character References
Part II: Narrative-Centric Documents
Chapter 6. XML as a Document Format
6.1 SGML's Legacy
6.2 Narrative Document Structures
6.5 Document Permanence
6.6 Transformation and Presentation
Chapter 7. XML on the Web
7.2 Direct Display of XML in Browsers
7.3 Authoring Compound Documents with Modular XHTML
7.4 Prospects for Improved Web-Search Methods
Chapter 8. XSL Transformations (XSLT)
8.1 An Example Input Document
8.2 xsl:stylesheet and xsl:transform
8.3 Stylesheet Processors
8.4 Templates and Template Rules
8.5 Calculating the Value of an Element with xsl:value-of
8.6 Applying Templates with xsl:apply-templates
8.7 The Built-in Template Rules
8.9 Attribute Value Templates
8.10 XSLT and Namespaces
8.11 Other XSLT Elements
Chapter 9. XPath
9.1 The Tree Structure of an XML Document
9.2 Location Paths
9.3 Compound Location Paths
9.5 Unabbreviated Location Paths
9.6 General XPath Expressions
9.7 XPath Functions
Chapter 10. XLinks
10.1 Simple Links
10.2 Link Behavior
10.3 Link Semantics
10.4 Extended Links
10.6 DTDs for XLinks
Chapter 11. XPointers
11.1 XPointers on URLs
11.2 XPointers in Links
11.3 Bare Names
11.4 Child Sequences
Chapter 12. Cascading Style Sheets (CSS)
12.1 The Three Levels of CSS
12.2 CSS Syntax
12.3 Associating Stylesheets with XML Documents
12.5 The Display Property
12.6 Pixels, Points, Picas, and Other Units of Length
12.7 Font Properties
12.8 Text Properties
Chapter 13. XSL Formatting Objects (XSL-FO)
13.1 XSL Formatting Objects
13.2 The Structure of an XSL-FO Document
13.3 Laying Out the Master Pages
13.4 XSL-FO Properties
13.5 Choosing Between CSS and XSL-FO
Chapter 14. Resource Directory Description Language (RDDL)
14.1 What's at the End of a Namespace URL?
14.2 RDDL Syntax
Part III: Data-Centric XML
Chapter 15. XML as a Data Format
15.1 Why Use XML for Data?
15.2 Developing Data-Oriented XML Formats
15.3 Sharing Your XML format
Chapter 16. XML Schemas
16.2 Schema Basics
16.3 Working with Namespaces
16.4 Complex Types
16.5 Empty Elements
16.6 Simple Content
16.7 Mixed Content
16.8 Allowing Any Content
16.9 Controlling Type Derivation
Chapter 17. Programming Models
17.1 Common XML Processing Models
17.2 Common XML Processing Issues
Chapter 18. Document Object Model (DOM)
18.1 DOM Foundations
18.2 Structure of the DOM Core
18.3 Node and Other Generic Interfaces
18.4 Specific Node-Type Interfaces
18.5 The DOMImplementation Interface
18.6 Parsing a Document with DOM
18.7 A Simple DOM Application
Chapter 19. Simple API for XML (SAX)
19.1 The ContentHandler Interface
19.2 SAX Features and Properties
Part IV: Reference
Chapter 20. XML 1.0 Reference
20.1 How to Use This Reference
20.2 Annotated Sample Documents
20.3 XML Syntax
20.5 XML Document Grammar
Chapter 21. Schemas Reference
21.1 The Schema Namespaces
21.2 Schema Elements
21.3 Primitive Types
21.4 Instance Document Attributes
Chapter 22. XPath Reference
22.1 The XPath Data Model
22.2 Data Types
22.3 Location Paths
22.5 XPath Functions
Chapter 23. XSLT Reference
23.1 The XSLT Namespace
23.2 XSLT Elements
23.3 XSLT Functions
Chapter 24. DOM Reference
24.1 Object Hierarchy
24.2 Object Reference
Chapter 25. SAX Reference
25.1 The org.xml.sax Package
25.2 The org.xml.sax.helpers Package
25.3 SAX Features and Properties
25.4 The org.xml.sax.ext Package
Chapter 26. Character Sets
26.1 Character Tables
26.2 HTML4 Entity Sets
26.3 Other Unicode Blocks
Copyright 2002, 2001 O'Reilly & Associates, Inc. All rights reserved.
Printed in the United States of America.
Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( ). For more information contact our corporate/institutional sales department: 800-998-9938 or firstname.lastname@example.org.
Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The association between the image of a peafowl and the topic of XML is a trademark of O'Reilly & Associates, Inc. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc., in the United States and other countries. O'Reilly & Associates, Inc. is independent of Sun Microsystems.
While every precaution has been taken in the preparation of this book, the publisher and the author assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
XML is one of the most important developments in document syntax in the history of computing. In the last few years it has been adopted in fields as diverse as law, aeronautics, finance, insurance, robotics, multimedia, hospitality, travel, art, construction, telecommunications, software, agriculture, physics, journalism, theology, retail, and comics. XML has become the syntax of choice for newly designed document formats across almost all computer applications. It's used on Linux, Windows, Macintosh, and many other computer platforms. Mainframes on Wall Street trade stocks with one another by exchanging XML documents. Children playing games on their home PCs save their documents in XML. Sports fans receive real-time game scores on their cell phones in XML. XML is simply the most robust, reliable, and flexible document syntax ever invented.
XML in a Nutshell is a comprehensive guide to the rapidly growing world of XML. It covers all aspects of XML, from the most basic syntax rules, to the details of DTD and schema creation, to the APIs you can use to read and write XML documents in a variety of programming languages.
There are hundreds of formally established XML applications from the W3C and other standards bodies, such as OASIS and the Object Management Group. There are even more informal, unstandardized applications from individuals and corporations, such as Microsoft's Channel Definition Format and John Guajardo's Mind Reading Markup Language. This book cannot cover them all, any more than a book on Java could discuss every program that has ever been or might ever be written in Java. This book focuses primarily on XML itself. It covers the fundamental rules that all XML documents and authors must adhere to, whether a web designer uses SMIL to add animations to web pages or a C++ programmer uses SOAP to exchange serialized objects with a remote database.
This book also covers generic supporting technologies that have been layered on top of XML and are used across a wide range of XML applications. These technologies include:
An attribute-based syntax for hyperlinks between XML and non-XML documents that provide the simple, one-directional links familiar from HTML, multidirectional links between many documents, and links between documents to which you don't have write access.
An XML application that describes transformations from one document to another, in either the same or different XML vocabularies.
A syntax for URI fragment identifiers that selects particular parts of the XML document referred to by the URI often used in conjunction with an XLink.
A non-XML syntax used by both XPointer and XSLT for identifying particular pieces of XML documents. For example, an XPath can locate the third address element in the document, or all elements with an email attribute whose value is email@example.com.
A means of distinguishing between elements and attributes from different XML vocabularies that have the same name; for instance, the title of a book and the title of a web page in a web page about books.
An XML vocabulary for describing the permissible contents of XML documents from other XML vocabularies.
The Simple API for XML, an event-based application programming interface implemented by many XML parsers.
The Document Object Model, a language-neutral tree-oriented API that treats an XML document as a set of nested objects with various properties.
An XMLized version of HTML that can be extended with other XML applications such as MathML and SVG.
The Resource Directory Description Language, an XML application based on XHTML for documents placed at the end of namespace URLs.
All these technologies, whether defined in XML (XLinks, XSLT, Namespaces, Schemas, XHTML, and RDDL) or in another syntax (XPointers, XPath, SAX, and DOM), are used in many different XML applications.
This book does not specifically cover XML applications that are relevant to only some users of XML, such as:
Scalable Vector Graphics, a W3C-endorsed standard XML encoding of line art.
The Mathematical Markup Language, a W3C-endorsed standard XML application used for embedding equations in web pages and other documents.
The Resource Description Framework, a W3C-standard XML application used for describing resources, with a particular focus on the sort of metadata one might find in a library card catalog.
Occasionally we use one or more of these applications in an example, but we do not cover all aspects of the relevant vocabulary in depth. While interesting and important, these applications (and hundreds more like them) are intended primarily for use with special software that knows their format intimately. For instance, most graphic designers do not work directly with SVG. Instead, they use their customary tools, such as Adobe Illustrator, to create SVG documents. They may not even know they're using XML.
This book focuses on standards that are relevant to almost all developers working with XML. We investigate XML technologies that span a wide range of XML applications, not those that are relevant only within a few restricted domains.
XML has hardly stood still in the 18 months since the first edition of XML in a Nutshell was published. To answer the most frequent request from readers of the first edition, there are now two new chapters covering schemas. Furthermore, other chapters throughout the book have been rewritten to reflect the impact of schemas on their subject matter. We added several other new topics as well, including the RDDL, the Transformations API for XML (TrAX), the Java API for XML Processing (JAXP), and SAX filters.
In addition, the treatment of many topics has been upgraded to the latest versions of various specifications, including:
XSL Formatting Objects 1.0
XPointer 2nd Candidate Recommendation
Finally, many small errors and omissions were corrected throughout the book.
Part I, introduces you to the fundamental standards that form the essential core of XML to which all XML applications and software must adhere. It teaches you about well-formed XML, DTDs, namespaces, and Unicode as quickly as possible.
Part II, explores technologies that are used mostly for narrative XML documents, such as web pages, books, articles, diaries, and plays. You'll learn about XSLT, CSS, XSL-FO, XLinks, XPointers, XPath, and RDDL.
One of the most unexpected developments in XML was its enthusiastic adoption for data-heavy structured documents such as spreadsheets, financial statistics, mathematical tables, and software file formats. Part III, explores the use of XML for such record-like documents. This part focuses on the tools and APIs needed to write software that processes XML, including SAX, DOM, and schemas.
Finally, Part IV, is a series of quick-reference chapters that form the core of any Nutshell Handbook. These chapters give you detailed syntax rules for the core XML technologies, including XML, DTDs, schemas, XPath, XSLT, SAX, and DOM. Turn to this section when you need to find out the precise syntax quickly for something you know you can do but don't remember exactly how to do.
Constant width is used for:
Code examples and fragments.
Anything that might appear in an XML document, including element names, tags, attribute values, entity references, and processing instructions.
Anything that might appear in a program, including keywords, operators, method names, class names, and literals.
Constant-width bold is used for:
Signifying emphasis in code examples and fragments.
Constant-width italic is used for:
Replaceable elements in code statements.
Italic is used for:
New terms where they are defined.
Signifying emphasis in body text.
Pathnames, filenames, and program names. (However, if the program name is also the name of a Java class, it is written in constant-width font, like other class names.)
Host and domain names (cafeconleche.org).
Significant code fragments, complete programs, and documents are generally placed into a separate paragraph like this:
<?xml version="1.0"?> <?xml-stylesheet href="person.css" type="text/css"?> <person> Alan Turing </person>
XML is case sensitive. The PERSON element is not the same thing as the person or Person element. Case-sensitive languages do not always allow authors to adhere to standard English grammar. It is usually possible to rewrite the sentence so the two do not conflict, and when possible we have endeavored to do so. However, on rare occasions when there is simply no way around the problem, we let standard English come up the loser.
Finally, although most of the examples used here are toy examples unlikely to be reused, a few have real value. Please feel free to reuse them or any parts of them in your own code. No special permission is required. As far as we are concerned, they are in the public domain (though the same is definitely not true of the explanatory text).
We enjoy hearing from readers with general comments about how this book could be better, specific corrections, or topics you would like to see covered. You can reach the authors by sending email to firstname.lastname@example.org and email@example.com. Please realize, however, that we each receive several hundred pieces of email a day and cannot respond to everyone personally. For the best chance of getting a personal response, please identify yourself as a reader of this book. And please send the message from the account you want us to reply to and make sure that your reply-to address is properly set. There's nothing so frustrating as spending an hour or more carefully researching the answer to an interesting question and composing a detailed response, only to have it bounce because the correspondent sent the message from a public terminal and neglected to set the browser preferences to include his actual email address.
The information in this book has been tested and verified, but you may find that features have changed (or you may even find mistakes). We believe the old saying, "If you like this book, tell your friends. If you don't like it, tell us." We're especially interested in hearing about mistakes. As hard as the authors and editors worked on this book, inevitably there are a few mistakes and typographical errors that slipped by us. If you find a mistake or a typo, please let us know so we can correct it in a future printing. Please send any errors you find directly to the authors at the previously listed email addresses.
You can also address comments and questions concerning this book to the publisher:
We have a web site for the book, where we list errata, examples, and any additional information. You can access this site at:
Before reporting errors, please check this web site to see if we have already posted a fix. To ask technical questions or comment on the book, you can send email to the authors directly or send your questions to the publisher at:
For more information about other O'Reilly books, conferences, software, Resource Centers, and the O'Reilly Network, see the web sites at:
Many people were involved in the production of this book. The original editor, John Posner, got this book rolling and provided many helpful comments that substantially improved the book. When John moved on, Laurie Petrycki shepherded this book to its completion. The eagle-eyed Jeni Tennison read the entire manuscript from start to finish and caught many errors large and small. Without her attention, this book would not be nearly as accurate. Stephen Spainhour deserves special thanks for his work on the reference section. His efforts in organizing and reviewing material helped create a better book. We'd like to thank Matt Sergeant and Didier P. H. Martin for their thorough technical review of the manuscript and thoughtful suggestions. James Kass's Code2000 font was invaluable in producing Chapter 26.
We'd also like to thank everyone who has worked so hard to make XML such a success over the last few years and thereby given us something to write about. There are so many of these people that we can only list a few. In alphabetical order we'd like to thank Tim Berners-Lee, Jonathan Borden, Jon Bosak, Tim Bray, David Brownell, Mike Champion, James Clark, Charles Goldfarb, Jason Hunter, Arnaud Le Hors, Michael Kay, Keiron Liddle, Murato Makoto, Eve Maler, Brett McLaughlin, David Megginson, David Orchard, Walter E. Perry, Simon St.Laurent, C. M. Sperberg-McQueen, Jonathan Robie, Arved Sandstrom, James Tauber, Henry S. Thompson, B. Tommie Usdin, Daniel Veillard, Norm Walsh, Lauren Wood, and Mark Wutka. Our apologies to everyone we unintentionally omitted.
Elliotte would like to thank his agent, David Rogelberg, who convinced him that it was possible to make a living writing books like this rather than working in an office. The entire Sunsite crew (now ibiblio.org) has also helped him to communicate better with his readers in a variety of ways over the last several years. All these people deserve much thanks and credit. Finally, as always, he offers his largest thanks to his wife, Beth, without whose love and support this book would never have happened.
Scott would most like to thank his lovely wife, Celia, who has already spent way too much time as a "computer widow." He would also like to thank his daughter Selene for understanding why Daddy can't play with her when he's "working" and Skyler for just being himself. Also, he'd like to thank the team at Enterprise Web Machines for helping him make time to write. Finally, he would like to thank John Posner for getting him into this and Laurie Petrycki for working with him when things got tough.
Elliotte Rusty Harold, firstname.lastname@example.org
W. Scott Means, email@example.com