2.10 XML

The data sets that have been discussed so far are but a few of those available. There are certain to be other data sets that have not been covered, but are likely to be needed. This brings us to the next step in the investigative data mining process, the accessing and preparation of the data. Because so much of the data being created today is Web-based, a brief discussion of XML is appropriate, for it represents the standard by which different data sets can be merged and used for analysis.

XML can play a central role in allowing agencies and departments to share information. XML, or the extensible markup language, is a standard that can provide interoperability among disparate systems. It is designed to improve the functionality of the Web by providing more flexible and adaptable information identification. It is called extensible because it is not a fixed format like HTML, which is a single, predefined markup language.

XML is actually a metalanguage; a language about languages, which lets the user design customized markup languages for limitless different types of documents. Theoretically, XML can be used to link disparate databases and could be used to develop composites from different data sources for investigative data mining analyses. The following are six key items about XML, a technology that can facilitate greater collaboration of data for interoperability among multiple government agencies and private industry:

XML is for structuring data. Structured data includes objects, such as spreadsheets, most-wanted lists, configuration parameters, financial transactions, and technical drawings. XML is a set of rules, guidelines, or conventions for designing text formats that let users describe their data. XML is not a programming language; it is a standard for generating, reading, and structuring data. XML is extensible and platform-independent.
It is not HTML. Like HTML, XML makes use of tags (words bracketed by < and >) and attributes (of the form name="value"). While HTML specifies what each tag and attribute means and often how the text between them will look in a browser, XML uses the tags only to delimit pieces of data and leaves the interpretation of the data completely to the application that reads it. So, for example, a <p> in an XML file is not a paragraph; it could be a price, a parameter, or a person.
XML has a family. XML 1.0 is the specification that defines what tags and attributes are. Beyond XML 1.0 is a growing set of modules that accomplish other tasks. Xlink describes a standard way to add hyperlinks to an XML file. XPointer and XFragments are syntaxes under development for pointing to parts of an XML document. An XPointer, instead of pointing to documents on the Web, points to pieces of data inside an XML file. XSL is the advanced language for expressing style sheets. It is based on XSLT, a transformation language used for rearranging, adding, and deleting tags and attributes.
XML leads HTML to XHTML. There is an important XML application that is a document format: World Wide Web Consortium's XHTML, the successor to HTML. XHTML has many of the same elements as HTML. The syntax has been changed slightly to conform to the rules of XML. A document that is XML-based inherits the syntax from XML and restricts it in certain ways (e.g., XHTML allows <p>, but not <r>); it also adds meaning to that syntax (XHTML declares that <p> stands for "paragraph," and not for "price," "person," or anything else).
XML is Modular. XML allows the user to define a new document format by combining and reusing other formats. Since two formats developed independently may have elements or attributes with the same name, care must be taken when combining those formats (does <p> mean "paragraph" from this format or "person" from that one?). To eliminate name confusion when combining formats, XML provides a namespace mechanism.
XML is the basis for RDF and the Semantic Web. W3C's Resource Description Framework (RDF) is an XML text format that supports resource description and metadata applications, such as a set of mug shots, playlists, or bibliographies. For example, RDF might allow a user to identify certain suspects in a set of photos using information from a wanted list; then a mail client could automatically start a message to investigators alerting them that these photos are on the Web. Just as HTML integrated documents, menu systems, and forms applications to launch the original Web, RDF integrates applications and agents into one Semantic Web. Just as people need to agree on the meanings of the words they employ in their communication, computers need mechanisms for agreeing on the meanings of terms in order to communicate effectively. Formal descriptions of terms in a certain areas, such as manufacturing or law enforcement, are called ontologies and are a part of the Semantic Web envisioned by W3C, the governing body of standards for the Internet (www.w3.org).

More information will be provided about how such an ontology would work via a Web service in the context of a real-time data mining system proposed in Chapter 11.