Summary | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

XML is a standard textual markup language suitable for encoding almost any sort of data. It works very well, both for unstructured narrative data written by people and for the record-oriented data common in computer applications. About the only data for which it's not really suited are bitmapped things such as photographs and recorded sound.

Logically an XML document is made up of nested elements. Each element has a name, a set of attributes, and some content. The content can include plain text, other elements, or both. The attributes are name value pairs associated with the element. Each document has a single topmost element called the root or document element. Because all nonroot elements nest completely inside other elements, an XML document has a natural tree structure. In addition to elements and text nodes, XML documents can contain comments, processing instructions, an XML declaration, and a document type declaration.

Syntactically elements are delimited by tags that look like <Quantity> , </Quantity> , and <Quantity/> . <Quantity> is a start-tag that must be matched by the corresponding end-tag </Quantity> . The content of the Quantity element comes between these two tags. <Quantity/> is an empty-element tag that represents a Quantity element with no content. Attributes are indicated by name = " value " pairs inside start-tags and empty-element tags. For example, <Quantity number="17"/> is an empty Quantity element that has a number attribute with the value 17.

Physically, an XML document is divided into storage units called entities. These entities can be files, database records, data structures in memory, or something else. The document entity contains the root element of the document. Parsed entities contain XML markup, which will be merged to form the entire document. Parsed entities are located via general entity references such as &anaconda; in the document entity or another parsed entity. Unparsed entities contain non-XML, possibly binary data that will be identified by ENTITY type attributes in the document.

Every XML document must be well- formed . Among other things, this means that every start-tag must have a matching end-tag, every attribute value must be quoted, and only certain characters can be used in element names . If a document is not well formed, it is not an XML document, and XML parsers will not accept it. Beyond well- formedness , documents that have a schema may be valid (but aren't necessarily ). A valid document adheres to all the constraints listed in the schema. Schema languages include document type definitions (DTDs), the W3C XML Schema Language, and the XPath-based Schematron.

Because XML markup normally focuses on the structure and semantics of the contained information, before a document can be shown to a human reader, it must first be associated with a stylesheet that tells the browser or other tool how to format the document for display. The two most popular style languages are Cascading Style Sheets (CSS) and the Extensible Stylesheet Language (XSL). CSS is a non-XML declarative language for applying simple styles such as font-weight to elements of certain types. XSL is actually two separate XML applications, the XSL-FO page description language and the XSLT Turing-complete functional language. An XSLT stylesheet is used to transform a source XML document into the XSL-FO vocabulary. However, XSLT can also be used to transform a source XML document to other XML vocabularies such as XHTML.