Chapter 4. Converting Flat Files to XML


Relatively little of the world's data is currently stored in XML. Much of it is stored in flat files as tab-delimited text, comma-separated values, or some similar format. More is locked up in databases of one kind or another, whether relational, hierarchical, or object based. Even more is hidden inside unstructured documents, including Microsoft Word files, HTML documents, and plain text. XML tools are not suitable for working with any of this.

There are no magic bullets that will convert all of your data to semantically tagged XML. There are a few specialized programs that convert certain formats such as Word documents to particular XML applications such as XHTML. However, the output from even the best of these tools often needs to be cleaned up by hand. How much clean-up work you need to do generally depends on how structured the data format is to start with and how clean the data is. It's relatively easy to encode a relational table from a DB2 database as XML because it already has a lot of structure and a mandatory schema. It's a lot harder in practice to convert tab-delimited text files because those tend to be full of mistakes and dirty data. Records are missing fields. Fields get swapped with each other. A field that is supposed to contain a number between 1 and 12 may contain a list of foodstuffs the data entry clerk was supposed to buy on his way home one day. All of these things can and do happen, and you have to account for them regardless of what you're doing with such data, whether that's converting it to XML or summarizing it for an annual report.

When you're tasked with converting legacy data to XML, you just have to roll up your sleeves and attack the problem. You need to understand the current structure of the data. You need to write a program that reads the input format and writes out XML. You need to debug the inevitable problems that arise when the data in the input isn't exactly you thought it was or what it was supposed to be. By far the hardest part of this problem is parsing the input data, in whatever form it takes. Once you've loaded the data into your program, writing it back out again in XML is a cakewalk .



Processing XML with Java. A Guide to SAX, DOM, JDOM, JAXP, and TrAX
Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX
ISBN: 0201771861
EAN: 2147483647
Year: 2001
Pages: 191

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net