Tenet 5: Store data in flat text files

4.2 Tenet 5: Store data in flat text files

"Flat text files" means that you store all data as text. Period. Binary format files are verboten. No special file-system formats are allowed. This rules out a host of interesting, but nonportable, formats invented by vendors for propriety purposes. Data files should consist of only a stream of bytes separated by line feed characters or "newlines," in the lingo of the Unix world.

Many consider this a bitter pill to swallow, but Unix programmers swear that this is the best way. Here is the secret: While data is kept on any kind of storage media, eventually it must go somewhere. Data sitting on a disk diminishes in value. For data to remain alive and valuable, it must move occasionally. Otherwise it should be archived and deleted.

Data that goes nowhere is dead data.

If you expect to move your data easily, you must make it portable. Any impediments to data movement, whether unintentional or by design, place limits on your data's potential value. The longer your data must sit somewhere, the less it will be worth when it finally arrives. The problem is, if your data is not in a format that is useful at its destination, it must be converted. That conversion process takes time. Every second spent in data conversion eats away at your data's value.

The Cable News Network (CNN), won top honors in 1991 for its coverage of the Persian Gulf War. CNN provided the world with graphic scenes of the conflict and did it quickly. Many people rushed home to their television sets every night to watch the events unfold. Would the CNN coverage have been as riveting if the production staff had spent several days converting the videotape from beta to VHS, airmailed the tapes to Atlanta, and showed them only during prime time?

So it is with your data. If it takes extra time to convert your data from a nonportable format to move it, the data will not be worth as much when it gets there. The world moves much too quickly to wait for your data.

4.2.1 Text is a common interchange format

Text is not necessarily the highest performing format; it's only the most common one. Other formats have been used in some applications, but none has found such wide acceptance as text. In nearly all cases, data encoded in text can be handled by target platforms.

By using text, you eliminate the difficulties of converting your data from one binary format to another. Few binary formats are standardized. Each vendor has defined its own binary encoding, and most of them are different. Converting from one vendor's format to another's can be an arduous task requiring anywhere from several days to several months. This time would be much better spent using the data.

4.2.2 Text is easily read and edited

It is possible to examine text data without conversion tools. If the data doesn't look right, you can use a standard text editor to modify it. Specialized tools are not required. You don't need a separate editor for each kind of data file. One size fits all.

The real power of text files becomes apparent when developing programs that use pipes under Unix. The pipe is a mechanism for passing one program's output to another program as input without using a temporary file. Many Unix programs are little more than a collection of smaller programs joined by a series of pipes. As developers prototype a program, they can easily check the data for accuracy at each point along the pipeline. If there is a problem, they can interrupt the flow through the pipeline and figure out whether the data or its manipulator is the problem. This greatly speeds up the development process, giving the Unix programmer a significant edge over programmers on other operating systems.

Text files also simplify the Unix user's interface with the system. Most administrative information under Unix is kept in flat text files and made available for universal inspection. This significantly reduces the amount of time spent by individuals in accessing the information to accomplish their daily work. Information about other users, systems on the network, and general statistics can be gleaned with minimal effort. Ironically, portable data here results in greater efficiency.

4.2.3 Textual data files simplify the use of Unix text tools

Most Unix environments contain dozens of utilities for transmitting, modifying, and filtering text. Unix users employ these utilities in many combinations to do their daily work. the following lists some of the more popular ones with a brief description of their functions:

awk

Perform functions on text arranged in fields

cut

Extract specific columns from lines of text

diff

Perform a line-by-line comparison of two text files

expand

Convert tab stops to spaces

expr

Extract part of a string from another string

fmt

A simple paragraph formatter

grep

Extract lines from a file containing a specified text string

head

Display the first n lines of a file

lex

Perform lexical analysis of a text stream

more

Display a text file one screen at a time

paste

Convert a single text column into multiple columns

roff

A comprehensive text formatter and typesetter

sed

A noninteractive text line editor

sort

Sort a column of text

tail

Display the last n lines of a file

test

Compare two strings for equality

tr

Replace selected characters in a file

wc

Count the number of lines, words, or characters in a file

Many of these utilities have other features besides those mentioned in the above list. For example, awk can mix alphabetical and numeric text interchangeably. Test can check the modes of files to learn whether they are writable by the user. Lex provides an interface to the C programming language driven by matching string expressions in the input stream. Sed by itself is powerful enough to replace commands like grep, head, and tail.

The mixed-mode capabilities of these commands tend to blur the line between text and what is traditionally thought of as data. Hence, it is easier to represent in textual form that which was formerly stored in binary files. Unix programmers usually store numerical data in text files because the Unix environment provides a rich set of tools to manipulate those files.

Storing data as text and then manipulating it with a diverse set of small, text-oriented tools makes Unix systems formidable data processors. With tools such as awk, sed, and grep available on virtually all Unix and Linux systems, the ability to select, modify, and move data becomes more accessible to everyone. Even people who aren't programmers find it easy to read and interpret data stored in flat text files.

The developers of Hewlett-Packard's OpenVMS [1] operating system may be right in thinking that most people are afraid of the computer. Instead of shielding users from the system, though, Unix takes them inside it. It leads them through the labyrinthine logic trails, while they hold onto their last vestige of familiarity—namely, their data in a format that can be read and understood. For all the criticism of the "unfriendly Unix user interface," Unix may well be the friendliest system of all. Users can always look at their data without having to be system gurus skilled at interpreting complex binary file formats.

4.2.4 Increased portability overcomes the lack of speed

Throughout this discussion, you've probably been thinking, "Yeah, portability is nice, but what about performance?" It's true that using flat text files slows things down a bit. It takes two or three characters to represent the contents of one binary byte. So you're potentially talking about a 3:1 reduction in performance. This sounds significant, but it really isn't at all, except in high-resolution real-time applications or the rare multiterabyte data warehouse application. Even in, say, a huge data warehouse application, however, the user has the ability to absorb and analyze only a minuscule (by human standards, anyway) abstraction of the data. So at some point, the amount of data is reduced to an amount that is insignificant by CPU-cycle standards.

Eventually, every application program is ported to a new system, or else it becomes extinct. The unrelenting progress of computer manufacturers assures us that what may have been prohibitively expensive today will be dirt cheap tomorrow. It doesn't pay to run an application on a slow system that is becoming increasingly costly to maintain.

The payoff in using text comes when you must port your application to a new architecture. If you had enough foresight to make your program portable, then with text it becomes a trivial matter to move your data to a new platform as well. Woe to software engineers who must port both data and program code. The data will be stale by the time it ever sees the new memory boards. The cumulative time lost by a 3:1 performance reduction pales in comparison with the weeks or months lost in moving the data to the new platform.

4.2.5 The lack of speed is overcome by next year's machine

We've acknowledged that text files impose a drag on performance. You could possibly realize up to a 3:1 reduction in speed. However, if the application meets today's minimum performance requirements, you can expect that next year's machine will yield a dramatic improvement—if your data can be ported.

As of this writing, next year's machine usually offers enough additional computing power to render any of today's performance concerns about text files superfluous. In other words, if your application barely performs adequately today, its speed will be ample tomorrow. In a few years you may even have to start thinking about how to slow it down so people can use it!

Case Study: One Unix Philosopher's Bag Of Tricks

start example

We have seen that given a choice between high efficiency and high portability, Unix programmers' preference weighs heavily with the latter. As a result, their applications are often among the first to run on new platforms as they become available. This gives their software a definite edge in the marketplace. In a world where windows of opportunity open overnight and slam shut as soon as a month later, pursuing the portability priority can mean the difference between being an industry leader or being one of the others that wish they were.

How did Unix programmers come to embrace such beliefs? Most early software engineers weren't taught the importance of portability in school, at least not with any sense of conviction. More likely, they learned the value of portable code and data the best way: through firsthand experience.

Most Unix "gurus," as they're called, carry a collection of programs and shell scripts that make up their personal tool kit. These tools have followed them as they've moved from machine to machine, job to job, and company to company. For purposes of illustration, let's look at a Unix philosopher's bag of tricks.

My personal collection of tools has varied through the years. Here is a partial sample of those that have stood the tests of time and portability:

cal

A shell script front end to the Unix cal program that allows you to specify textual names for months instead of numbers. Borrowed from The Unix Programming Environment by Brian Kernighan and Rob Pike. [a]

cp

A "fumble finger" front end to the Unix cp program that prevents you from unintentionally overwriting an existing file

l

Runs the ls command with the -F switch specified

ll

Runs the ls command with the -l switch specified

mv

Similar to the cp script, it prevents you from unintentionally overwriting an existing file by renaming another file to a file with the same name.

vit

Invokes the vi editor with the -t flag for use with tags and tag files. Tags make it easy to locate subroutines in a collection of files.

[a]© 1984, Bell Telephone Laboratories, Inc.

I have converted some scripts into aliases for use with the C shell, an interactive command interpreter originally found on Berkeley Unix systems. Aliases allow you to specify alternative forms of heavily used commands without having to resort to putting everything into shell scripts. Like shell scripts, they, too, are portable.

I originally built these tools under Unix Version 7 on a Digital PDP-11/70 at a small company engaged in the manufacture of newspaper composition systems. As the company added new systems for software development, I moved them to PDP-11/34, PDP-11/44, and LSI-11/23 systems also running Unix. This doesn't sound like a grand feat, given the renowned compatibility of the PDP-11 line, but wait. It gets better.

Eventually, I left the company in pursuit of other career opportunities, taking my tools with me on a nine-track tape. The C programs and shell scripts had soon found a home on a Digital VAX-11/750. The VAX-11/750 had more horsepower than the smaller PDP-11s I'd been using. Consequently, they ran a bit faster at the new company. The tools picked up even more speed when the company replaced the VAX-11/750 with a VAX-11/ 780. All this happened without modifications to the tools whatsoever.

About that time, workstations—those wondrous you-mean-I-can-have-the-whole-darn-computer-to-myself boxes—vaulted onto the scene. Everyone flocked to buy the hot new machines from Sun Microsystems, my employer included. So the tools that had been moved from the PDP-11 to the VAX line suddenly found themselves running without modifications on Sun workstations.

Having spent the greater part of my software engineering career in New England, I found the latest equipment from the original digital equipment maker to be fairly common within a hundred miles of Boston. Again, I ported my old reliable C programs and shell scripts to the Digital line, this time to the VAX 8600 series and later to the VAX 8800 series. Again, the tools ran without modification.

Necessity is the mother of midnight invention. A software engineer I was working with had noticed a large cache of Digital Professional 350s collecting dust in a warehouse. An enterprising individual concluded that these 350s would make fine personal computers for us at home, especially if they were running Unix. So he proceeded to port a version of Unix to the 350. My tools soon followed.

Then along came the parade of VAXstations and the early versions of the X Window System. A portable window system was a major step in the evolution of useful user interfaces. Despite all the whiz-bang effects of a window system, I still found that my favorite tools were very helpful in an xterm (terminal emulator) window.

But the computer business is a risky business. You must remain flexible to stay on top. To a software engineer, flexibility translates into portability. If your software does not port to the newest machine, it will die. Period. So when the RISC-based DECstation 3100s and 5000s came along, it was a port-or-die situation. My tools displayed their penchant for portability again.

Today my little bag of tricks runs on a variety of boxes, large and small, under various Linux distributions—without modification.

These C programs and shell scripts have seen more than twenty years of daily use, on different vendors' machines, under a variety of Unix versions, on everything from 16-bit to 32-bit to 64-bit CPU architectures running the gamut from PCs to microcomputers to mainframes. How many other programs do you know of that have been used for so long in so many environments?

My experience with these tools is hardly unusual. Unix and Linux programmers the world over have similar stories to tell. Nearly everyone who has used Unix or Linux on more than a casual basis has his or her own collection of goodies. Some undoubtedly have far more comprehensive tool kits than mine. Others have probably ported their software to even more platforms, all without modification and with virtually no user retraining.

The record of portability speaks for itself. By making your programs and data easy to port, you build a long-lasting, tangible value into your software. It's that simple. Code and data that opt for efficiency lock themselves into the architecture for which they were designed. With the onrush of new platforms, obsolescence preys on the nonportable. Instead of watching the worth of your software investment shrink to zero with each industry announcement, plan ahead. Design with portability in mind.

On the day you hold in your hands your first 100 PHz (petahertz) laptop with 500 exabytes of storage [2]—and this may be sooner than you think—be sure your software will be ready for it.

end example

[1]OpenVMS was originally developed by Digital Equipment Corporation. Digital Equipment Corporation was later acquired by Compaq, which was later acquired by Hewlett-Packard.

[2]One petahertz equals one quadrillion (1015) hertz. An exabyte is approximately one quintillion bytes or about one billion gigabytes. And consider that next year's model will be even faster!