The Power of Plain Text

As Pragmatic Programmers, our base material isn't wood or iron, it's knowledge. We gather requirements as knowledge, and then express that knowledge in our designs, implementations , tests, and documents. And we believe that the best format for storing knowledge persistently is plain text. With plain text, we give ourselves the ability to manipulate knowledge, both manually and programmatically, using virtually every tool at our disposal.

What Is Plain Text?

Plain text is made up of printable characters in a form that can be read and understood directly by people. For example, although the following snippet is made up of printable characters, it is meaningless.

 Fieldl9=467abe

The reader has no idea what the significance of 467abe may be. A better choice would be to make it understandable to humans .

 DrawingType=UMLActivityDrawing

Plain text doesn't mean that the text is unstructured; XML, SGML, and HTML are great examples of plain text that has a well-defined structure. You can do everything with plain text that you could do with some binary format, including versioning.

Plain text tends to be at a higher level than a straight binary encoding, which is usually derived directly from the implementation. Suppose you wanted to store a property called uses_menus that can be either TRUE or FALSE. Using text, you might write this as

 myprop.uses_menus=FALSE

Contrast this with 0010010101110101.

The problem with most binary formats is that the context necessary to understand the data is separate from the data itself. You are artificially divorcing the data from its meaning. The data may as well be encrypted; it is absolutely meaningless without the application logic to parse it. With plain text, however, you can achieve a self-describing data stream that is independent of the application that created it.

Tip 20

Keep Knowledge in Plain Text

Drawbacks

There are two major drawbacks to using plain text: (1) It may take more space to store than a compressed binary format, and (2) it may be computationally more expensive to interpret and process a plain text file.

Depending on your application, either or both of these situations may be unacceptable ”for example, when storing satellite telemetry data, or as the internal format of a relational database.

But even in these situations, it may be acceptable to store metadata about the raw data in plain text (see Metaprogramming).

Some developers may worry that by putting metadata in plain text, they're exposing it to the system's users. This fear is misplaced. Binary data may be more obscure than plain text, but it is no more secure. If you worry about users seeing passwords, encrypt them. If you don't want them changing configuration parameters, include a secure hash ^[1] of all the parameter values in the file as a checksum.

^[1] MD5 is often used for this purpose. For an excellent introduction to the wonderful world of cryptography, see [Sch95].

The Power of Text

Since larger and slower aren't the most frequently requested features from users, why bother with plain text? What are the benefits?

Insurance against obsolescence
Leverage
Easier testing

Insurance Against Obsolescence

Human-readable forms of data, and self-describing data, will outlive all other forms of data and the applications that created them. Period.

As long as the data survives, you will have a chance to be able to use it ” potentially long after the original application that wrote it is defunct .

You can parse such a file with only partial knowledge of its format; with most binary files, you must know all the details of the entire format in order to parse it successfully.

Consider a data file from some legacy system ^[2] that you are given. You know little about the original application; all that's important to you is that it maintained a list of clients ' Social Security numbers , which you need to find and extract. Among the data, you see

^[2] All software becomes legacy as soon as it's written.

 <FIELD10>123-45-6789</FIELD10>     ...     <FIELD10>567-89-0123</FIELD10>     ...     <FIELD10>901-23-4567</FIELD10>

Recognizing the format of a Social Security number, you can quickly write a small program to extract that data ”even if you have no information on anything else in the file.

But imagine if the file had been formatted this way instead:

 AC27123456789B11P     ...     XY43567890123QTYL     ...     6T2190123456788AM

You may not have recognized the significance of the numbers quite as easily. This is the difference between human readable and human understandable.

While we're at it, FIELD10 doesn't help much either. Something like

 <SSNO>123-45-6789</SSNO>

makes the exercise a no-brainer ”and ensures that the data will outlive any project that created it.

Leverage

Virtually every tool in the computing universe, from source code management systems to compiler environments to editors and stand-alone filters, can operate on plain text.

The Unix Philosophy

Unix is famous for being designed around the philosophy of small, sharp tools, each intended to do one thing well. This philosphy is enabled by using a common underlying format ”the line-oriented, plain text file. Databases used for system administration (users and passwords, networking configuration, and so on) are all kept as plain text files. (some systems, such as Solaris, also maintain a binary forms of certain databases as a performance optimization. The plain text version is kept as an interface to the binary version.)

When a system crashes, you may be faced with only a minimal environment to restore it (You may not be able to access graphics drivers, for instance), Situations such as this can really make you appreciate the simplicity of plain text.

For instance, suppose you have a production deployment of a large application with a complex site-specific configuration file ( sendmail comes to mind). If this file is in plain text, you could place it under a source code control system (see Source Code Control), so that you automatically keep a history of all changes. File comparison tools such as diff and fc allow you to see at a glance what changes have been made, while sum allows you to generate a checksum to monitor the file for accidental (or malicious) modification.

Easier Testing

If you use plain text to create synthetic data to drive system tests, then it is a simple matter to add, update, or modify the test data without having to create any special tools to do so. Similarly, plain text output from regression tests can be trivially analyzed (with diff, for instance) or subjected to more thorough scrutiny with Perl, Python, or some other scripting tool.

Lowest Common Denominator

Even in the future of XML-based intelligent agents that travel the wild and dangerous Internet autonomously, negotiating data interchange among themselves , the ubiquitous text file will still be there. In fact, in heterogeneous environments the advantages of plain text can outweigh all of the drawbacks. You need to ensure that all parties can communicate using a common standard. Plain text is that standard.

Related sections include:

Source Code Control
Code Generators
Metaprogramming
Blackboards
Ubiquitous Automation
It's All Writing

Challenges

Design a small address book database ( name , phone number, and so on) using a straightforward binary representation in your language of choice. Do this before reading the rest of this challenge.
1. Translate that format into a plain text format using XML.
2. For each version, add a new, variable-length field called directions in which you might enter directions to each person's house.
What issues come up regarding versioning and extensibility? Which form was easier to modify? What about converting existing data?