XML is a project of the World Wide Web Consortium (W3C). XML is actually a metalanguage ”a facility for creating new languages using markup.
5.8.1 A Brief History of XML
The idea of markup languages is to include datatype information along with data in a human-readable form. Markup originated in printed form, when editors literally marked up printed pages, indicating changes and printing information, such as whether a word should be in bold. Carrying over this idea to computers seems natural ”just write something like <bold>this</bold> to tell the computer to write a word in bold.
Over time, developers found that they wanted to be able to set up templates so that two different documents would have the same look. A template is essentially a grammar ”a definition of a set of strings, namely all strings whose contents match the template. Thus, each new template defines a new language, and a facility for creating new templates is a metalanguage. This is a strong argument that a markup language should be a "metamarkup" language, a facility for creating new templates. This line of thinking, in part, led to the creation of SGML, the Standard Generalized Markup Language.
SGML is an unusually powerful language that includes the ability to define new document templates and many other facilities. This power comes at the price of complexity ”so much complexity that some developers feel that people with little computer expertise will never learn SGML. This line of thinking leads us back to simpler languages, such as HTML (Hypertext Markup Language). HTML is not a metalanguage for creating new templates; rather it is a language for describing a Web page, and it is much simpler than SGML.
If you wanted to restore some of the power of SGML, including at least the ability to define new templates (and thus new languages), you might invent something like XML. Many people see XML as striking the right compromise between power and simplicity. XML makes it comparatively easy to create new markup languages and to create documents that are elements of those languages. You can get an XML parser for free and start creating your own languages within a few hours.
5.8.2 The Evolution of XML
XML standardizes the definition of new languages enough that parsers for XML itself need only be written once. This has led to the free availability of XML parsers. Instead of writing your own parser, you can use a free, existing parser to read a data file into Java.
At the time of this writing, XML is evolving rapidly , so this book is limited in its ability to advise you where to get XML tools and even which ones to get. For example, during the writing of this book, IBM donated its popular (and free) XML parser to Apache. Another important recent change is the movement toward stricter definitions of grammars. Initially, grammars appeared in XML document type definition files. Now there is work to allow XML developers to specify grammars in XML schemas, which allow a tighter mapping to Java and other languages. It is predictable that other, unpredictable advances will have occurred by the time you read this. It is also predictable that XML will still be evolving, so good advice is to keep watching XML.
5.8.3 An XML Example
You can parse the marketing department's coffee input file by converting it to XML. Imagine that the old director leaves and the new director demands to know why the information technology group is behind the times and not using XML. To catch up, you can create an XML version of the coffee input file and use an XML parser to read the file. In a markup format, the example of coffees to feature in a given month's brochure might look like this:
<?xml version="1.0"?> <brochureCoffees> <coffee> <name>Brimful</name> <roast>Regular</roast> <price>6.95</price> <country>Kenya</country> </coffee> <coffee> <name>Caress</name> <formerName>Smackin</formerName> <roast>French</roast> <price>7.95</price> <country>Sumatra</country> </coffee> <coffee> <name>Fragrant Delicto</name> <roast>Regular</roast> <orFrench/> <price>9.95</price> <country>Peru</country> </coffee> <coffee> <name>Havalavajava</name> <roast>Regular</roast> <price>11.95</price> <country>Hawaii</country> </coffee> <coffee> <name>Launch Mi</name> <roast>French</roast> <price>6.95</price> <country>Kenya</country> </coffee> <coffee> <name>Roman Spur</name> <formerName>Revit</formerName> <roast>Italian</roast> <price>7.95</price> <country>Guatemala</country> </coffee> <coffee> <name>Simplicity House</name> <roast>Regular</roast> <orFrench/> <price>5.95</price> <country>Colombia</country> </coffee> </brochureCoffees>
We can use a SAX parser to read in the markup version of the coffee file. SAX is the Simple Application Programming Interface for XML. SAX provides some callback features that let us build coffee objects as the parser sees them. An alternative to using SAX is to use a parser that builds a document object model (DOM) and then walk over the document building Java objects. Walking over documents can be tricky, however, and in general this book promotes building objects during recognition rather than building some kind of tree and walking over it.
Our strategy is to create a hierarchy of Helper classes, similar to assemblers. In this example, the helpers take data that the parser sees and apply it to the creation of coffee objects. A SAX parser makes two basic callbacks: when the parser sees an element, such as <roast> , and when it sees data, such as Italian . When the parser tells you that it has seen a new element, you set a Helper object to be the right kind of helper for the element. Then, when the parser sees the data for the element, you pass the data to the helper, which knows how to apply the data to a Coffee object. Figure 5.5 shows the classes in sjm.examples.coffee that help a SAX parser to build a list of coffees.
Figure 5.5. Support for SAX. This diagram shows classes in the coffee package that help a SAX parser to build a list of coffees.