Specifying Actions | UNIX: The Complete Reference, Second Edition (Complete Reference Series)

The preceding sections have illustrated some of the patterns you can use. This section gives you a brief introduction to the kinds of actions that awk can take when it matches a pattern. An action can be as simple as printing a line or changing the value of a variable, or as complex as invoking control structures and user-defined functions.

Variables

The awk program allows you to create variables, assign values to them, and perform operations on them. Variables can contain strings or numbers. A variable name can be any sequence of letters and digits, beginning with a letter. Underscores are permitted as part of a variable name, for example, old_price. Unlike many programming languages, awk doesn’t require you to declare variables as numeric or string; they are assigned a type depending on how they are used. The type of a variable may change if it is used in a different way All variables are initially set to null (or for numbers, 0). Variables are global throughout an awk program, except inside user-defined functions.

Built-in Variables

Table 21–2 shows the awk built-in variables. These variables either are set automatically or have a standard default value. For example, FILENAME is set to the name of the current input file as soon as the file is read. FS, the field separator, has a default value. Other commonly used built-in variables are NF, the number of fields in the current record (by default, each line is considered a record), and NR, the number of records read so far (which we used in the preceding example to count the number of lines in a file). ARGV is an array of the command-line arguments to your awk program.

Table 21–2: awk Built-in Variables
Variable	Meaning	Variable	Meaning
FS	Input field separator	NF	Number of fields in this record
OFS	Output field separator	NR	Number of records read so far
RS	Input record separator	FNR	Number of records from this file
ORS	Output record separator	RESTART	Set by match to the match index
ARGC	Number of arguments	RLENGTH	Set by match to the match length
ARGV	Array of arguments	OFMT	Output format for numbers
FILENAME	Name of input file	SUBSEP	Subscript separator for arrays

Built-in variables have uppercase names. They may contain string values (FILENAME, FS, OFS), or numeric values (NR, NF). You can reset the values of these variables. For example, you can change the default field separator by changing the value of FS.

Actions Involving Fields

You have already seen the field identifiers $1, $2, and so on. These are a special kind of built-in variable. You can assign values to them; change their values; and compare them to other variables, strings, or numbers. These operations allow you to create new fields, erase a field, or change the order of two or more fields.

For example, recall the inventory file, which contained the name of each item, the number on hand, the price paid for each, and the selling price. The entry for pencils is

 pencils 108 .11 .15

The following awk program calculates the total value of each item in the file:

 {   $5 = $2 * $4   print $0 }

This program multiplies field 2 times field 4 and puts the result in a new field ($5), which is added at the end of the record. (By default, a record is one line.) The program also prints the new record with $0.

You can use the NF variable to access the last field in the current record. For example, suppose that some lines have four fields while others have five. Since NF is the number of fields, $NF is the field identifier for the last field in the record (just as, in a line with four fields, $4 is the identifier for the last field). You can add a new field at the end of each record by increasing the value of NF by one and assigning the new data to $NF. For example,

 /pencil/ {              # search for lines containing "pencil"   NF += 1               # increase the number of fields   $NF="Empire"          # give the new last field the value "Empire" }

Record Separators

You have already seen many examples in which awk gets its input from a file. It normally reads one line at a time and treats each input line as a separate record. However, you might have a file with multiline records, such as a mailing list with separate lines for name, street, city, and state. To make it easier to read a file like this, you can change the record separator character.

The default separator is a newline. To change this, set the variable RS to an alternate separator. For example, to tell awk to use a blank line as a record separator, set the record separator to null in the BEGIN section of your program, like this:

 BEGIN   {RS=""}                # break records at blank lines

Now all of lines up until a blank line will be read in at once. You can use the variables $1, $2, and so on to work with the fields, just as you normally would.

When working with multiline records, you may wish to leave the field separator as a space (the default value), or you may wish to change it to a newline, with a statement such as

 BEGIN {RS=""; FS="\n"}         # separate fields at newlines

Then you can use the field identifiers to refer to complete lines of the record.

Working with Strings

awk provides a full range of functions and operations for working with strings. For example, you can assign strings to variables, concatenate strings, extract substrings, and find the length of a string.

You already know how to assign a string to a variable:

 class = "music151"

Don’t forget the quotes around music151. If you do, awk will try to assign class to the value of a variable named music151. Since you probably don’t have a variable by that name, class will end up set to null.

You can also combine several strings into one variable. For example, you could enter this at the command line:

 $ awk '{student ID = $1 $3 > print student ID}' Long, Adam 2008 Long2008

Similarly, you could use print $3 $2 with that input to print 2008Adam.

Some of the most useful string functions are length, which returns the length of a string, match, which searches for a regular expression within a string, and sub, which substitutes a string for a specified expression. You can use gsub to perform a “global” string substitution, in which anything in the line that matches a target regular expression is replaced by a new string. substr takes a string and returns the substring at a given position. In addition to these standard functions, gawk provides the functions toupper and tolower to change the case of a string.

This program shows how you can use some of the string functions:

 length($0) > 10 {             # pattern matches any line longer than 10 characters   gsub(/[0–9]+/, "---")       # replace all strings of digits with ---   print substr ($0, 1, 10)    # print the first ten characters of the new string }

Working with Numbers

awk includes the usual arithmetic operators +, −, *, and /. (Unlike in shell scripting, you do not need to quote * when multiplying in an awk program.) The % operator calculates the modulus of two numbers (the remainder from integer division), and the ^ operator is used for exponentiation.

In addition to =, you can use the assignment operators +=, −=, *=, /=, %=, and ^= as shortcuts. For example,

 { total += $1}                        # add the value of $1 to total END { print "Average = " total/NR }   # divide total by the number of lines

will find the average of the numbers in the first field of the input.

You can also use the C-style shortcuts ++ and −− to increment or decrement the value of

a variable. For example,

x++

is the same as x += 1 (or x=x+1).

awk provides a number of built-in arithmetic functions. These include trigonometric functions such as cos, the cosine function, and atan2, the arctangent function, as well as the logarithmic functions log and exp. Other useful functions are int, which returns the integral part of a number, and rand, which generates a random number between 0 and 1. For example, you can estimate the value of pi with

 at an2 (1, 1) * 4                         # four times arctan of 1/1

Arrays

It is particularly easy to create and use arrays in awk. Instead of declaring or defining an array, you define the individual array elements as needed and awk creates the array automatically One feature of awk is that it uses associative arrays-arrays that can use strings as well as numbers for subscripts. For example, votes [“republican”] and votes[“democratic”] could be two elements of an associative array

You may be familiar with associate arrays from some other language, but by a different name. In Perl, they are called hashes, and in Python they are dictionaries. There is no built-in data type for associative arrays in C, but they are sometimes implemented with hash tables.

You define an element of an array by assigning a value to it. For example,

 stock[1] = $2

assigns the value of field 2 to the first element of the array stock. You do not need to define or declare an array before assigning its elements.

You can use a string as the element identifier. For example,

 numberl [$1] =$2

If the first field ($1) is pencil, and the second field ($2) is 108, this creates an array element:

 number["pencil"] = 108

When an element of an array has been defined, it can be used like any other variable. You can change it, use it in comparisons and expressions, and set variables or fields equal to it. For example, you could print the value of number[“pencil”] with

 print number["pencil"]

You can delete an element of an array with

 delete array[subscript]

and you can test whether a particular subscript occurs in an array with

 subscript in array

where this expression will return a value of 1 if army[subscript] exists and 0 if it does not.

Control Statements

awk provides control flow statements that allow you to test logical condition (with if-then statements) or loop through blocks of code (for and while statements). The syntax is similar to that used in C.

if... then Statements

The if statement evaluates an expression and performs an action if the expression was true. It has the form

 if (condition) action

For example, this statement checks the number of pencils in an inventory and alters you if you are running low:

 /pencil/ {if $2 < 144) print "Order more pencils"}

You can add an else clause to an if statement. For example,

 if (length(input) > 0)    print "Good, we have input" else    print "Nope, no input here"

awk provides a similar conditional form that can be used in an expression. The form is

 expression1 ? expression2 : expression3

If expression1 is true, the whole statement has the value of expression2; otherwise, it has the value of expression3. For example,

 rank = ($1 > 50000) ? "high" : "low"

determines whether a number is above or below 50000.

while Loops

A while loop is used to repeat a statement as long as some condition is met. The form is

 while(condition) {     action }

For example, suppose you have a file in which different records contain different numbers of fields, such as a list of the test scores for each student in a school, where some students have more test scores than others, like this:

 Gignoux, Chris      97      88      95      92 Landfield, Ryan     75      93      99      94      89

You could use while to loop through every field in each record, add up the total score, and print the average for each student:

 {   sum=0   i=2   while (i<=NF) {     sum += $i     i++   }   average=sum/ (NF−1)   print "The average for " $1 " is " average }

In this program, i is a counter for each field in the record after the first field, which contains the student’s name. Where i is less than NF (the number of fields in the record), “sum” is incremented by the contents of field i. The average is the sum divided by the number of fields containing numbers.

The do-while statement is like the while statement, except that it executes the action first and then tests the inside condition. It has the form

 do action while(condition)

The break command is used to exit from a surrounding loop early. It can be included in a while loop or a for loop.

for Loops

The for statement repeats an action as long as a condition is satisfied. The for statement includes an initial statement that is executed the first time through the loop, a test that is executed each time through the loop, and a statement that is performed after each successful test. It has the form

 for(initial statement; test; increment) statement

The for statement is usually used to repeat an action some number of times. The following example uses for to total the scores for each student and find the average, exactly like the while example just shown:

 {   sum=0   for (i=2; i<=NF; i++) sum += $i   average=sum/ (NF−1)   print "The average for " $1" is " average }

You can use for loops to step through the elements of an array. For example, to count the number of tables in an HTML document, and the number of rows and cells in the tables, use this:

 /<TABLE>/ {count["table"]++} /<TR>/ {count["tablerow"]++} /<TD>/ {count["tablecell"]++} END {for (s in count) print s, count[s]}

The array is called count. As you find each pattern, you increment the counter with the appropriate subscript. After reading the file, you print out the totals.

Ending a Program

The exit command tells awk to stop reading input. When awk comes to an exit statement, it immediately goes executes the END action, if there is one, and then terminates. You might use this command to end a program if you discover an error in the input file, such as a missing field.

User-Defined Functions

Like many programming languages, awk allows you to define your own functions within a program. Your functions may take parameters (arguments) and may return a value.

Once a function has been defined, it may be used in a pattern or action, in any place where you could use a built-in function.

To define a function, you specify its name, the parameters it takes, and the actions to perform. A function is defined by a statement of the form

 function function_name (list of parameters) {action_list}

For example, you can define a function called in_range, which takes the value of a field and returns 1 if the value is within a certain range and 0 otherwise, as follows:

 function in_range (testval, lower, upper) {   if (testval > lower && testval < upper)    return 1   else    return 0 }

Make sure that there is no space between the function name and the parenthesis for the parameter list. The return statement is optional, but the function will not return a value if it is missing.

How to Call a Function

Once you have defined your function, you use it just like a built-in awk function. For example, you can use in_range as follows:

 if (in_range($5, 10, 15))   print "Found a match!"

This lets you know when the value of the fifth field lies between 10 and 15.

Functions may be recursive-that is, they may call themselves. A simple example of a recursive function is the factorial function:

 function factorial(n) {   if (n<=1)     return 1   else     return n * factorial(n−1) }

If you call this function in a program like this:

 print factorial(4)

it calculates and prints the value, which in this case would be 24 (because 4*3*2*1 is 24).