Project60.Learn the awk Text Processor | Mac OS X Unix 101 Byte-Sized Projects

Project 60. Learn the awk Text Processor

"How do I write a script to perform the same sequence of editing and processing commands on multiple text files?"

This project shows you how to use the awk text processing language to change text files by reading edit commands from a script. Project 58 shows how to apply such commands to a batch of files. Project 62 covers more advanced use of awk, and Projects 59 and 61 cover the sed command.

An Editing Language

The awk text processor was written to scan text files for matching patterns of text. Each pattern can have an associated action that prints or edits the text, or perhaps increments a counter or calls a function. It's not an interactive editor like nano; instead, it reads editing commands from a script. It's most often used to apply the same set of edits to many filessometimes as a one-time operation, sometimes at regular intervals. You might want to reformat hundreds of files of source code or generate daily statistics by reading log files. The awk command is similar to the sed command (discussed in Project 59) but has a more powerful, and C-language-like, processing language.

Because awk is like a programming language, it's not as easy to master as sed. If you've had no experience writing code, you'll probably find it difficult to pick up without a more in-depth tutorial. You might want to skip the theoretical treatment until you've tried some of the examples in "Scripts for awk" later in this project.

The awk Basics

The awk command writes its output to standard out, so it can easily create new files as it processes existing ones. An awk script consists of editing commands; each command describes a pattern and an action. awk reads its input file line by line. If a line matches one (or more) of the patterns, awk applies the corresponding action(s) to that line. An input line may match many patterns and, therefore, will have many actions applied to it.

You may write the script directly to the command line or to a file. This project considers simple scripts of just a few lines. Scripts that are more complex are usually written to files and are the subject of Project 62.

The awk command sees each line of input it reads as a series of fields separated by white space. Command awk defines its own special variables, $1 to $n, to represent field numbers 1 to n; $0 represents the entire line. These field variables are used just like any other variables: You can compare them; examine their contents by using string functions; or, as is the most usual, print them by using the print and printf statements.

Patterns

The most basic pattern is an empty one that matches all lines in the input file. When not empty, a pattern may be a regular expression (such as /^$/ to select all empty lines) or plain text (such as /Sophie/ to select all lines containing the text Sophie).

More-complex patterns involve many regular expressions connected by the Boolean operators && (AND), || (OR), and ! (NOT). As an example, the pattern /Sophie/&&/rescue/ matches all lines that contain both Sophie and rescue, whereas the pattern /Sophie/&&!/rescue/ matches all lines that contain Sophie but not rescue.

A pattern need not involve matching text against the input line but may be any expression allowed in the awk language. The pattern rand() < 0.5, for example, has a 5050 chance of matching each line, and (jumping ahead a little) the pattern length ($0) > 59 matches input lines longer than 59 characters.

The awk command is Comprehensive

The awk command provides a powerful C-like processing language that includes features such as conditional statements, loops, variables, and functions. It would take a whole book to teach awk, and certainly more than two projects. To give you the most benefit from limited coverage, the two awk projects here illustrate the most useful awk techniques. Think of them as a sampling that reflects the potential of awk.

Tip

You can instruct awk to recognize field separators other than white space by using special awk variable FS. To see how this works when commas are used as separators, see "Process a CSV File" in Project 62.

Two patterns separated by a comma select all lines from the first line that matches the first pattern to the first subsequent line that matches the second pattern. To select just those lines in Chapter One, for example, we might specify the pattern /Chapter One/,/Chapter Two/ (assuming that Chapter One starts with the text "Chapter One," and similarly for Chapter Two).

Actions

Immediately following a pattern, awk expects to see an action delimited by braces. An action is a sequence of statements executed against each line that matches the pattern. Here are some of the most useful awk statements:

if(expression) statement [ else statement ]
while(expression) statement
for(expression ; expression ; expression) statement
for(var in array) statement
variable assignment
print [ expression-list ] [ > expression ]
printf format [, expression-list ] [ > expression ]

Scripts for awk

An awk script consists of a list of patternaction pairs. An action is a list of statements in braces to be applied to input lines that match its pattern. The following sections present some simple awk scripts.

Delete Blank Lines

Suppose that you wish to remove blank lines from a file. We construct a pattern matched by a nonblank line and an action that prints the line. Let's make our pattern the regular expression .+, which is matched by lines that consist of one or more characters, and let's make our action the statement print $0, which prints the line. Putting these together, we get

$ awk '/.+/{print $0}' sophie.txt > sophie2.txt

Learn More

Project 6 covers input/output redirection.

We read sophie.txt, detect nonblank lines, and print them to sophie2.txt by redirecting standard out.

It's not possible to write output back to the file being read because of the way input/output redirection works. The following trick produces that effect. The command before the semicolon redirects translated output from sophie.txt to a new file, tmp. When that command completes, the mv command renames tmp to sophie.txt, overwriting the original file with a translated replacement.

$ awk '/.+/{print $0}' sophie.txt > tmp; mv tmp sophie.txt

Tip

If you don't specify $0, it's often assumed. The statement print $0 is equivalent to print. In fact, if no action is specified, the default action is to print the input line. Therefore, the command awk '/.+/{print $0}' sophie.txt is equivalent to awk '/.+/' sophie.txt.

Make awk grep

The example above has awk behave like grep (see Project 23 for information on grep). Here's another example that searches the file vodkas.txt for six.

$ awk '/six/{print}' vodka.txt

Print Fields

A very common use for awk is to scan an input file, perhaps matching specific lines, and print just certain fields of each line. Suppose that we wish to filter a long listing produced by typing

$ ls -l *.txt -rw-r--r-- 1 saruman saruman 468 Aug 3 21:19 biff.txt -rw-r--r-- 1 saruman saruman 37080 Aug 5 15:42 big-file.txt ...

We wish to display just the filename followed by the file size. We note that the filename is field 9, which awk recognizes as $9, and that the size is field 5, which awk recognizes as $5.

To realize this, type the following command.

$ ls -l *.txt | awk '$5>400 {print $9, $5}' biff.txt 468 big-file.txt 37080 mark.txt 402

Tip

You'll often see a command combining grep and awk, such as

$ ps | grep 'bash' | ¬      awk '{print $1'}

With our newfound awk powers, we can eliminate grep from the pipe by typing

$ ps | awk ¬      '/bash/{print $1'}

We've thrown in a pattern, too.

$5>400

Tip

The printf statement implemented by awk uses the printf library documented in Section 3 of the Unix manual. To read all about what printf can do for you, type

$ man 3 printf

This says to match a line if the value of field 5 is more than 400. Our filtered list, therefore, displays the names and sizes of files whose size is more than 400 bytes.

An alternative to print is the printf command, which formats and embellishes its output according to a format string. The format string is a sequence of characters to be displayed onscreen, interspersed by special placeholder sequences. One such placeholder sequence is %ns, which displays a value space-padded so that it's n characters wide. Following the format string, we must provide the value required by each placeholder sequence.

We'll illustrate printf by repeating the example above but making the output more informative.

$ ls -l *.txt | awk 'BEGIN {print "Formatted Listing"}; ¬      $5>400 {printf "File: %-15s Size= %10s bytes\n", $9, $5}' Formatted Listing File: biff.txt        Size=        468 bytes File: big-file.txt    Size=      37080 bytes File: mark.txt        Size=        402 bytes

We threw in another trick, to generate the "Formatted Listing" header. The special pattern BEGIN matches the start of the input file, and we used it to perform a one-off action that is executed before any lines of input are read.

Print and Skip Blocks

We can build a pattern to specify a range of lines, should we wish to process a block of text. Suppose that we wish to display just those lines between clearly attempting... and helped me up..., inclusive, from the text file sophie.txt. Here's the original file.

[View full width]

$ cat sophie.txt I hopped out of the car and promptly ate gravel. The non-retracting seat belt had wrapped itself around my ankle clearly attempting to do what Sophie failed to do duringthe drive home - kill me. :-)

Sophie rushed to my rescue, helped me up, and brushed off the stones from my dress. There are better ways to get "stoned"!

The pattern we use consists of two regular expressions separated by a comma, and we print the matched range of lines with a command such as

$ awk '/^clearly/,/^helped/' sophie.txt clearly attempting to do what Sophie failed to do during the drive home - kill me. :-) Sophie rushed to my rescue, helped me up, and brushed off the stones from my dress.

Next, let's employ a trick to skip the selected range of lines and print the rest of the file instead. To the previous pattern, we apply the action next, which says to skip to the next input line without further processing. We also specify a second patternaction combination (with an empty pattern), separated from the first by a semicolon, whose action is print.

Normally, every patternaction pair would be applied to every input line. The action next, however, specifically skips the rest of the script for input lines that match. The net result is that matching lines are skipped and nonmatching lines are printed.

$ awk '/^clearly/,/^helped/{next}; {print}' sophie.txt I hopped out of the car and promptly ate gravel. The non-retracting seat belt had wrapped itself around my ankle There are better ways to get "stoned"!

Substitute with awk

Here's an example demonstrating text replacement (or substitution) in awk. Similar to the s function of sed is the awk substitute function sub. Let's substitute breakfast for lunch in the file wakes.txt. Here's the original file.

$ cat wakes.txt Sorry, breakfast is cancelled. Nobody wakes Janet Forbes at 7 am and lives long enough to see breakfast.

We call the function sub and pass two parameters; the first is a regular expression to match against, and the second is the substitute text. All text that matches the regular expression is replaced with the substitute text, as in this example.

Tip

All awk functions are documented in the awk man pages.

$ awk '{sub("breakfast", "lunch"); print}' wakes.txt Sorry, lunch is cancelled. Nobody wakes Janet Forbes at 7 am and lives long enough to see lunch.

Because awk is essentially a programming language, you'll find that all features are exposed as traditional functions that take parameters enclosed in parentheses. Contrast this with the sed command, in which substitution uses the syntax s/breakfast/lunch/.

Like all awk scripts, this one has a pattern (empty) followed by an action. To be more selective which lines we execute sub on, specify a nonempty pattern:

$ awk '/cancelled/{sub("breakfast", "lunch"); print}' ¬     wakes.txt Sorry, lunch is cancelled.

Finally, if you also wish to print all lines in the file, move the print statement into its own action, preceded by a blank pattern.

$ awk '/cancelled/{sub("breakfast", "lunch")} {print}' ¬     wakes.txt Sorry, lunch is cancelled. Nobody wakes Janet Forbes at 7 am and lives long enough to see breakfast.