0.1 What Is Text Processing? | Text Processing in Python

At the broadest level text processing is simply taking textual information and doing something with it. This doing might be restructuring or reformatting it, extracting smaller bits of information from it, algorithmically modifying the content of the information, or performing calculations that depend on the textual information. The lines between "text" and the even more general term "data" are extremely fuzzy; at an approximation, "text" is just data that lives in forms that people can themselves read at least in principle, and maybe with a bit of effort. Most typically computer "text" is composed of sequences of bits that have a "natural" representation as letters, numerals, and symbols; most often such text is delimited (if delimited at all) by symbols and formatting that can be easily pronounced as "next datum."

The lines are fuzzy, but the data that seems least like text and that, therefore, this particular book is least concerned with is the data that makes up "multimedia" (pictures, sounds, video, animation, etc.) and data that makes up UI "events" (draw a window, move the mouse, open an application, etc.). Like I said, the lines are fuzzy, and some representations of the most nontextual data are themselves pretty textual. But in general, the subject of this book is all the stuff on the near side of that fuzzy line.

Text processing is arguably what most programmers spend most of their time doing. The information that lives in business software systems mostly comes down to collections of words about the application domain maybe with a few special symbols mixed in. Internet communications protocols consist mostly of a few special words used as headers, a little bit of constrained formatting, and message bodies consisting of additional wordish texts. Configuration files, log files, CSV and fixed-length data files, error files, documentation, and source code itself are all just sequences of words with bits of constraint and formatting applied.

Programmers and developers spend so much time with text processing that it is easy to forget that that is what we are doing. The most common text processing application is probably your favorite text editor. Beyond simple entry of new characters, text editors perform such text processing tasks as search/replace and copy/paste, which given guided interaction with the user accomplish sophisticated manipulation of textual sources. Many text editors go farther than these simple capabilities and include their own complete programming systems (usually called "macro processing"); in those cases where editors include "Turing-complete" macro languages, text editors suffice, in principle, to accomplish anything that the examples in this book can.

After text editors, a variety of text processing tools are widely used by developers. Tools like "File Find" under Windows, or "grep" on Unix (and other platforms), perform the basic chore of locating text patterns. "Little languages" like sed and awk perform basic text manipulation (or even nonbasic). A large number of utilities especially in Unix-like environments perform small custom text processing tasks: wc, sort, tr, md5sum, uniq, split, strings, and many others.

At the top of the text processing food chain are general-purpose programming languages, such as Python. I wrote this book on Python in large part because Python is such a clear, expressive, and general-purpose language. But for all Python's virtues, text editors and "little" utilities will always have an important place for developers "getting the job done." As simple as Python is, it is still more complicated than you need to achieve many basic tasks. But once you get past the very simple, Python is a perfect language for making the difficult things possible (and it is also good at making the easy things simple).