XML...Describing It All | Inescapable Data: Harnessing the Power of Convergence (paperback)

The data communication world has seen so many challenges over the decades that it is difficult to point out any single one as the most troublesome. If we were to choose one, however, it would be a lack of interoperability between devices, characterized by dissimilar computing machinery, incompatible wiring types and protocols, and dissimilar data protocols. Luckily, the computing world has now embraced the notion that nonproprietary standards for physical networking are good and that we all win when devices talk to each other no matter what they do or who made them. Even if new networking technologies come forward, the people designing and building the gear are now more willing to work together to standardize on their interface than ever before. But what about the actual data being exchanged?

The last element in the Inescapable Data world is intelligent and simple information exchange by using self-describing data techniques.

Until fairly recently, storing, processing, and managing large amounts of data was expensive. It was expensive to store on disk, expensive to move through networks, and expensive to process. In the early 2000s, we saw a dramatic drop in the cost of data storage because disk-drive manufacturers could double the capacity of a single drive without substantially increasing the cost to manufacture that drive. (Laptops, a case in point, will soon be sold with 1 terabyte1,000 gigabytedisks). We similarly saw networking performance go from 10 megabits per second to 100 to 1,000 and now 10,000 megabits per second with no increase in device cost. Of course, CPU processing power has followed Moore's law all the while (doubling every 18 months) without doubling in cost. We think that these advances, when combined, have enabled something magical to occur: We can now be more "verbose" in our data usage. We have the processing power, the data storage, and the network bandwidth to actually describe data as it is used and transferred. This capability would have been unheard of prior to the dawn of the twenty-first century.

So what does it mean to describe data? Typically, data in a computer is stored in binary form and stored in its most compact state. The numeral 5, for example, might be stored in as few as 8 bits (1 byte). With such a tight format, there is little room left to include information to tell what that 5 might actually represent. Does it mean 5 dollars? Five cents? Is it what your balance due is, or the cost of a particular item?

For years, computer systems could easily talk to each other electrically (via Ethernet) but labored to exchange meaningful information. It took teams of well-trained people to write special software that could decompose business databases created for a single use into data that could be used in other contexts and by other software applications. To some extent, this was acceptable because business systems were fairly customized to a particular business or business process. However, the Web dissolved business barriers and created a need for business information exchange. All of a sudden, the need to exchange information between millions of computers materialized, almost overnight.

At the inception of the Web, people first thought that the Web would be so vast that we would need special sites just to collect other sites and help us navigate through the maze. History has often taught us that hierarchical organizations are what we deploy to solve problems of complexity. Look, for example, at your local librarypainstakingly, every book is cross-referenced in three directions and stored in a massive index. Hierarchical and index solutions work as an organizational tool for large data sets, but fail for massive onessuch as the Webwhich need ad-hoc and quasi relationships.

The Web dictated that to display information, you had to first format it in a simple text-describing nomenclature called Hypertext Markup Language (HTML). A Web page is a collection of text and pictures with various formatting information. Unlike databases, the language of the formatting is human readable and human understandable. For example, a Web page may have such statements as <Title>This is my title</Title><Body>This is the main text area</Body>, where <Title> and <Body> are known as tags.

HTML represents a special kind of magica blending of human and machine intelligence. Web "pages" written in HTML are readable by machines and humans alike. As such, it is simple to create search tools that just run around surfing the Web much like we do, but they can read and index the content they find far faster and present it back to us in human-readable form. As an added bonus, they do so continuously without breaks for meals, sleep, or days off from work. Every Web page contains vast amounts of associated information, such as the author, the hosting company site, adjacent pages, pages it references, and pages that reference it. Google and other search engines use all this plus the embedded "tag" information to provide a detailed inventory of this massive resource.

All this leads us to eXtensible Markup Language (XML), which is much like the Web language HTML. Documents are not binary; instead, they are human-readable text, and every element is encapsulated within human-readable tags. We might have a brief XML document such as the following:

 <CustomerRecord>  <ItemPurchased>  <ItemType> Shoes </ItemType>  <ItemPrice> 5.00 </ItemPrice> </ItemPurchased> <ItemPurchased>  <ItemType> Toys </ItemType>  <ItemPrice> 3.45 </ItemPrice> </ItemPurchased> </CustomerRecord>

XML formatting allows proprietary databases and records to now have a nearly universal method for describing their contents. The binary representation of 5.00 from our example is now clearly a price, and the price of a specific type of shoe. You do not need to be a sophisticated programmer who understands how to read a "schema" document or how to encode SQL statements to make sense of XML statements. Your 13-year-old could happen upon such an XML fragment and derive some value from it. He or she could likely import it into a favorite spreadsheet package and sort or average or trend it with a few keystrokes. We refer to this capability often in later chapters.

Suppose, for example, that airline landing data was available nationwide in XML format (listing the airline, flight number, time of arrival, arrival airport, and so forth). A college student in Ludwig, Texas, with an interest in statistics and the correlation of flights to weather to economic conditions could, without ever writing any specialty software, correlate massive tables of flight data and massive tables of weather data along with published economic data from the Federal Reserve, all without changing out of his or her pajamas. Using some macros in an Excel spreadsheet and some cross-tabulation tricks, the student might tease out a relationship that was somewhat counterintuitive. This "tidbit" then becomes a tool for investment transactions or a tip back to an airline for an efficiency-consulting arrangement.

Note that this brief XML document is perhaps 100 times thicker (data-wise) than a simple 5, and therefore has a comparatively huge impact on the amount of disk space required to store the document and the bandwidth required to send it from one computer to another, not to mention the processing power needed to translate the human-readable statements into machine- readable form, and back again. This is why XML was not practical until available networking bandwidth, CPU horsepower, and storage densities hit their current levels.

Business back ends are now XML crazy. Any information that needs to be expressed to another computer system is now expressed in some XML format. (Data might still be stored in databases in a more native format, but we predict that these formats will eventually disappear.) Most significantly, XML enables far higher business-to-business cooperation squarely aligned with the Web's chief goal: information exchange (as opposed to data exchange). XML has been wholeheartedly embraced by business and is allowing for significant efficiency gains and better customer experiences. We are now finding XML reaching into the consumer world and our homes for many of the same values. To the Inescapable Data world, XML is the magic glue that allows all the vast sources of data and internetworking to now have real value through information sharing.