19.1 The Basic Concepts | PHP and PostgreSQL. Advanced Web Programming2002

The most important thing you have to know about XML is that XML is a metamarkup language for text documents. XML is not a programming language, which means that it is not possible to write standalone applications that need nothing other than XML it has nothing to do with languages such as C and Perl. XML is a language like HTML or SGML and can be used to describe documents.

One of the biggest advantages of XML over other languages for describing documents is that it does not have a fixed set of tags because this would be far too inflexible. Working with a fixed set of tags would fail because the reality cannot be modeled using just a few static commands.

19.1.1 XML Technologies

Many technologies have been developed on top of XML. Let's take a look at the most important concepts and technologies you will have to deal with when working with XML and XML-based applications:

XLinks XLinks is an attribute-based syntax for hyperlinks between XML-based and non-XML-based documents. It is similar to the kind of links you have already dealt with when working with HTML but in contrast to the kind of links provided by HTML, XLinks are one-directional.
XSLT XSL is short for Extensible Stylesheet Language and can be divided into two parts. One of these is XSL Transformation. With the help of XSLT applications, it is possible to define rules you can use to transform an XML document to another. The XML file used as input is analyzed and the components of the file are compared with a stylesheet used for generating the output.
XPointer XPointers are often used in combination with XLinks. The idea of XPointers is to have a syntax for referring to parts of an XML document.
Xpath In Xpath, an XML document is seen as a tree consisting of nodes. Every document contains exactly one root node. Every element in the tree has a name, a parent node, a namespace URI, and a set of child nodes. In addition, a tree contains attributes, texts, namespaces, processing instructions, and comments. Xpath is used by XPointers, XSLT, and by a set of proposed standards for query languages based on XML.
Namespace Not all names in a document must be unique. With the help of namespaces, it is possible to distinguish various components with the same name.
SAX SAX is a simple API for working with XML. It is a Java-based programming interface that is widely used when working with XML.
DOM DOM is short for Document Object Model. It is an API that treats an XML document as a tree of nested objects having various properties and attributes.

In addition to these standard technologies, many proprietary add-ons have been developed. Most of these add-ons focus on a very specific subject. One example of an add-on is MathML (Mathematical Markup Language), which is a W3C-endorsed XML application for processing and using mathematical statements and equations in documents.

19.1.2 XML Basics

Just like HTML, XML is based on tags. In contrast to HTML, XML does not offer a fixed set of tags. This has many advantages because the user can define his own tags, so XML offers great flexibility and can easily be adapted to an application's needs.

An XML file is based on text. Unlike other file formats, XML is not a binary format, so an XML file can easily be read.

Although XML is a flexible and powerful language, it is much more strict than HTML. Tags must be placed in a certain position, or an XML document won't be valid. The reason for that is simple: XML's flexibility makes this necessary because otherwise it would be impossible to retrieve data from a document efficiently and reliably.

If a document satisfies the demands of XML, it is considered to be "well formed"; otherwise, it is not. To find out if a document is well formed, you can use a parser. Let's look at a simple XML document:

 <?xml version="1.0"?> <!DOCTYPE person [ <!ELEMENT person (name, birthday)> <!ELEMENT name (academic_title, firstname, surname)> <!ELEMENT birthday (#PCDATA)> <!ELEMENT academic_title (#PCDATA)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT surname (#PCDATA)> ]> <person>         <name>                 <academic_title>Dr. med.</academic_title>                 <firstname>John</firstname>                 <surname>Jackson</surname>         </name>         <birthday>1978/08/09</birthday> </person>

The first thing you can do is to validate the document. The easiest way to do that is to use the XML validation tool at http://www.stg.brown.edu/service/xmlvalid/. With the help of this tool, it is possible to find out if the document you want to check is well formed. The document you have just seen is well formed, as you can see in the next listing:

 Validation Results for file.xml Document validates OK.

After this brief overview, we will take a closer look at what the content of the document is all about. First the version of XML used is specified. The syntax is similar to the syntax of HTML. The block starting with <!DOCTYPE person marks the beginning of an internal DTD. A DTD (Document Type Definition) defines the layout of an XML document. To use the words of a database developer: With the help of DTDs, it is possible to define data structures. In most cases, data and the DTD are stored in separate files because data and definition should be separated.

In this example the data structure consists of various nested elements. The element called person consists of two further elements called name and birthday. The element called name consists of three further elements whose names are academic_title, firstname, and surname. As you probably noticed, the child elements of an element are listed in parentheses. In the next lines you can see those elements that do not have children any more. Their value is set to #PCDATA. This means that these children contain the actual data.

After the inline DTD, the data is listed. Tags are defined and the data or additional objects are listed inside a pair of tags. This way, a tree structure can be built. Every leaf of the tree contains a piece of data. Because an XML document can be seen as a tree, it contains hierarchical data. This is slightly different than the hierarchical concept, but it also has some advantages (just as a relational data structure has some advantages).

The main challenge when working with PHP, PostgreSQL, and XML is that an efficient way for extracting data from the XML file has to be found. One way to access data is DOM. PHP provides various functions for working with DOM, but these functions are still said to be experimental.