"Flat text files" means that you store all data as text. Period. Binary format files are
verboten
. No special
Many consider this a bitter pill to swallow, but Unix programmers swear that this is the best way. Here is the secret: While data is kept on any kind of storage media, eventually it must go somewhere. Data sitting on a disk diminishes in value. For data to
Data that goes nowhere is dead data.
If you expect to move your data easily, you must make it portable. Any impediments to data movement, whether unintentional or by design, place limits on your data's potential value. The longer your data must sit somewhere, the less it will be worth when it finally arrives. The problem is, if your data is not in a format that is useful at its destination, it must be converted. That conversion process takes time. Every second spent in data conversion eats away at your data's value.
The Cable News Network (CNN), won top honors in 1991 for its coverage of the Persian Gulf War. CNN provided the world with graphic scenes of the conflict and did it quickly. Many people rushed home to their television sets every night to watch the events unfold. Would the CNN coverage have been as riveting if the production staff had spent several days converting the videotape from beta to VHS, airmailed the tapes to Atlanta, and showed them only during prime time?
So it is with your data. If it takes extra time to convert your data from a nonportable format to move it, the data will not be worth as much when it gets there. The world moves much too quickly to wait for your data.
Text is not
By using text, you eliminate the difficulties of converting your data from one binary format to another. Few binary formats are standardized. Each vendor has defined its own binary encoding, and most of them are different. Converting from one vendor's format to another's can be an arduous task requiring
It is possible to examine text data without conversion tools. If the data doesn't look right, you can use a standard text editor to modify it. Specialized tools are not required. You don't need a separate editor for each kind of data file. One
The real power of text files becomes apparent when developing programs that use pipes under Unix. The pipe is a mechanism for passing one program's output to another program as input without using a temporary file. Many Unix programs are little more than a collection of smaller programs joined by a series of pipes. As developers prototype a program, they can easily check the data for accuracy at each point along the pipeline. If there is a problem, they can interrupt the flow through the pipeline and figure out whether the data or its
Text files also simplify the Unix
Most Unix environments contain dozens of utilities for transmitting, modifying, and filtering text. Unix users
|
awk |
Perform functions on text arranged in fields |
|
cut |
Extract specific
|
|
diff |
Perform a
|
|
expand |
Convert tab stops to spaces |
|
expr |
Extract part of a string from another string |
|
fmt |
A simple paragraph formatter |
|
grep |
Extract lines from a file containing a specified text string |
|
head |
Display the first n lines of a file |
|
lex |
Perform lexical analysis of a text stream |
|
more |
Display a text file one screen at a time |
|
paste |
Convert a single text column into multiple columns |
|
roff |
A comprehensive text formatter and typesetter |
|
sed |
A
|
|
|
Sort a column of text |
|
tail |
Display the last n lines of a file |
|
test |
Compare two strings for equality |
|
tr |
Replace selected characters in a file |
|
wc |
Count the number of lines, words, or characters in a file |
Many of these utilities have other features besides those mentioned in the above list. For example, awk can mix alphabetical and numeric text interchangeably. Test can check the modes of files to learn whether they are writable by the user. Lex provides an interface to the C programming language driven by matching string expressions in the input stream. Sed by itself is powerful enough to replace commands like grep, head, and tail .
The mixed-mode capabilities of these commands tend to blur the line between text and what is traditionally thought of as data. Hence, it is easier to represent in textual form that which was formerly stored in binary files. Unix programmers usually store numerical data in text files because the Unix environment provides a rich set of tools to manipulate those files.
Storing data as text and then manipulating it with a diverse set of small, text-oriented tools makes Unix systems formidable data processors. With tools such as awk , sed , and grep available on virtually all Unix and Linux systems, the ability to select, modify, and move data becomes more accessible to everyone. Even people who aren't programmers find it easy to read and interpret data stored in flat text files.
The developers of Hewlett-Packard's OpenVMS
[1]
operating system may be right in thinking that most people are afraid of the computer. Instead of shielding users from the system, though, Unix takes them inside it. It leads them through the labyrinthine logic trails, while they hold onto their last vestige of
Throughout this discussion, you've probably been thinking, "Yeah, portability is nice, but what about performance?" It's true that using flat text files slows things down a bit. It takes two or three characters to represent the contents of one binary byte. So you're
Eventually, every application program is ported to a new system, or else it becomes extinct. The unrelenting progress of computer manufacturers assures us that what may have been prohibitively expensive today will be dirt cheap tomorrow. It doesn't pay to run an application on a slow system that is becoming increasingly costly to maintain.
The payoff in using text comes when you must port your application to a new architecture. If you had enough foresight to make your program portable, then with text it becomes a trivial matter to move your data to a new platform as well. Woe to software
We've
As of this writing, next year's machine usually offers enough additional computing power to render any of today's performance concerns about text files superfluous. In other words, if your application
Case Study: One Unix Philosopher's Bag Of Tricks
|
|
We have seen that given a choice between high efficiency and high portability, Unix programmers' preference weighs heavily with the latter. As a result, their applications are often among the first to run on new platforms as they become available. This gives their software a definite edge in the
How did Unix programmers come to embrace such beliefs? Most early software engineers weren't taught the importance of portability in school, at least not with any sense of
Most Unix "gurus," as they're called, carry a collection of programs and shell scripts that make up their personal tool kit. These tools have followed them as they've moved from machine to machine, job to job, and company to company. For purposes of illustration, let's look at a Unix philosopher's bag of tricks.
My personal collection of tools has varied through the years. Here is a partial sample of those that have stood the tests of time and portability:
|
cal |
A shell script front end to the Unix
cal
program that allows you to specify textual
|
|
cp |
A "fumble finger" front end to the Unix
cp
program that
|
|
l |
Runs the ls command with the - F switch specified |
|
ll |
Runs the ls command with the - l switch specified |
|
mv |
Similar to the
cp
script, it prevents you from unintentionally overwriting an existing file by renaming another file to a file with the same
|
|
vit |
Invokes the vi editor with the - t flag for use with tags and tag files. Tags make it easy to locate subroutines in a collection of files. |
|
[a] 1984, Bell Telephone Laboratories, Inc. |
|
I have converted some scripts into aliases for use with the C shell, an interactive command interpreter originally found on Berkeley Unix systems. Aliases allow you to specify alternative forms of heavily used commands without having to resort to
I originally built these tools under Unix Version 7 on a Digital PDP-11/70 at a small company engaged in the manufacture of newspaper composition systems. As the company added new systems for software development, I moved them to PDP-11/34, PDP-11/44, and LSI-11/23 systems also running Unix. This doesn't sound like a grand feat, given the renowned compatibility of the PDP-11 line, but wait. It gets better.
Eventually, I left the company in pursuit of other career opportunities, taking my tools with me on a nine-track tape. The C programs and shell scripts had soon found a home on a Digital VAX-11/750. The VAX-11/750 had more horsepower than the smaller PDP-11s I'd been using. Consequently, they ran a bit faster at the new company. The tools picked up even more speed when the company
About that time, workstations—those wondrous you-mean-I-can-have-the-whole-darn-computer-to-
Having spent the greater part of my software engineering career in New England, I found the latest equipment from the original digital equipment maker to be
Necessity is the mother of midnight invention. A software engineer I was working with had noticed a large cache of Digital Professional 350s collecting dust in a warehouse. An enterprising individual concluded that these 350s would make fine personal computers for us at home,
Then along came the parade of VAXstations and the early versions of the X Window System. A portable window system was a major step in the evolution of useful user interfaces. Despite all the whiz-bang effects of a window system, I still found that my favorite tools were very helpful in an
But the computer business is a risky business. You must remain flexible to stay on top. To a software engineer, flexibility
Today my little bag of tricks runs on a variety of boxes, large and small, under various Linux distributions—without modification.
These C programs and shell scripts have seen more than twenty years of daily use, on different vendors' machines, under a variety of Unix versions, on everything from 16-bit to 32-bit to 64-bit CPU architectures running the
My experience with these tools is hardly unusual. Unix and Linux programmers the world over have similar stories to tell. Nearly everyone who has used Unix or Linux on more than a casual basis has his or her own collection of
The record of portability speaks for itself. By making your programs and data easy to port, you build a long-
On the day you hold in your hands your first 100 PHz (petahertz) laptop with 500 exabytes of storage [2] —and this may be sooner than you think—be sure your software will be ready for it.
|
|
[1]
OpenVMS was originally developed by Digital Equipment Corporation. Digital Equipment Corporation was later acquired by Compaq, which was later
[2] One petahertz equals one quadrillion (10 15 ) hertz. An exabyte is approximately one quintillion bytes or about one billion gigabytes. And consider that next year's model will be even faster!