Chapter 7. The awk Utility: awk Programming

CONTENTS

7.1 Variables
7.2 Redirection and Pipes
7.3 Pipes
7.4 Closing Files and Pipes
7.5 Review
UNIX TOOLS LAB EXERCISE
7.6 Conditional Statements
7.7 Loops
7.8 Program Control Statements
7.9 Arrays
7.10 awk Built-In Functions
7.11 Built-In Arithmetic Functions
7.12 User-Defined Functions(nawk)
7.13 Review
UNIX TOOLS LAB EXERCISE
7.14 Odds and Ends
7.15 Review
UNIX TOOLS LAB EXERCISE

graphics/ch07.gif

7.1 Variables

7.1.1 Numeric and String Constants

Numeric constants can be represented as integers, such as 243, floating point numbers, such as 3.14, or numbers using scientific notation, such as .723E-1 or 3.4e7. Strings, such as Hello world, are enclosed in double quotes.

Initialization and Type Coercion. Just mentioning a variable in your awk program causes it to exist. A variable can be a string, a number, or both. When it is set, it becomes the type of the expression on the right-hand side of the equal sign.

Uninitialized variables have the value zero or the value " ", depending on the context in which they are used.

name = "Nancy" name is a string x++            x is a number;                x is initialized to zero and incremented by 1 number = 35    number is a number

To coerce a string to be a number:

name + 0

To coerce a number to be a string:

number " "

All fields and array elements created by the split function are considered strings, unless they contain only a numeric value. If a field or array element is null, it has the string value of null. An empty line is also considered to be a null string.

7.1.2 User-Defined Variables

User-defined variables consist of letters, digits, and underscores, and cannot begin with a digit. Variables in awk are not declared. Awk infers data type by the context of the variable in the expression. If the variable is not initialized, awk initializes string variables to null and numeric variables to zero. If necessary, awk will convert a string variable to a numeric variable, and vice versa. Variables are assigned values with awk's assignment operators. See Table 7.1.

Table 7.1. Assignment Operators
Operator	Meaning	Equivalence
=	a = 5	a = 5
+=	a = a + 5	a += 5
-=	a = a - 5	a -= 5
*=	a = a * 5	a *= 5
/=	a = a / 5	a /= 5
%=	a = a % 5	a %= 5
^=	a = a ^ 5	a ^= 5

The simplest assignment takes the result of an expression and assigns it to a variable.

FORMAT

variable = expression

Example 7.1

% nawk '$1 ~  /Tom/ {wage = $2 * $3; print wage}'  filename

EXPLANATION

Awk will scan the first field for Tom and when there is a match, it will multiply the value of the second field by the value of the third field and assign the result to the user-defined variable wage. Since the multiplication operation is arithmetic, awk assigns wage an initial value of zero. (The % is the UNIX prompt and filename is an input file.)

Increment and Decrement Operators. To add one to an operand, the increment operator is used. The expression x++ is equivalent to x = x + 1. Similarly, the decrement operator subtracts one from its operand. The expression x- - is equivalent to x = x - 1. This notation is useful in looping operations when you simply want to increment or decrement a counter. You can use the increment and decrement operators either preceding the operator, as in ++x, or after the operator, as in x++. If these expressions are used in assignment statements, their placement will make a difference in the result of the operation.

{x = 1;  y = x++ ; print x, y}

The ++ here is called a post-increment operator; y is assigned the value of one, and then x is increased by one, so that when all is said and done, y will equal one, and x will equal two.

{x = 1; y = ++x;  print x, y}

The ++ here is called a pre-increment operator; x is incremented first, and the value of two is assigned to y, so that when this statement is finished, y will equal two, and x will equal two.

User-Defined Variables at the Command Line. A variable can be assigned a value at the command line and passed into an awk script. For more on processing arguments, see "Processing Command Arguments (nawk)", ARGV, the array of command line arguments.

Example 7.2

nawk  F:  f awkscript    month=4  year=2001 filename

EXPLANATION

The user-defined variables month and year are assigned the values 4 and 2001, respectively. In the awk script, these variables may be used as though they were created in the script. Note: If filename precedes the arguments, the variables will not be available in the BEGIN statements. (See "BEGIN Patterns".)

The v Option (nawk). The v option provided by nawk allows command line arguments to be processed within a BEGIN statement. For each argument passed at the command line, there must be a v option preceding it.

Field Variables. Field variables can be used like user-defined variables, except they reference fields. New fields can be created by assignment. A field value that is referenced and has no value will be assigned the null string. If a field value is changed, the $0 variable is recomputed using the current value of OFS as a field separator. The number of fields allowed is usually limited to 100.

Example 7.3

% nawk ' { $5 = 1000 * $3 / $2;  print } '  filename

EXPLANATION

If $5 does not exist, awk will create it and assign the result of the expression 1000 * $3 / $2 to the fifth field ($5). If the fifth field exists, the result will be assigned to it, overwriting what is there.

Example 7.4

% nawk ' $4 == "CA" { $4  = "California"; print}'  filename

EXPLANATION

If the fourth field ($4) is equal to the string CA, awk will reassign the fourth field to California. The double quotes are essential. Without them, the strings become user-defined variables with an initial value of null.

Built-In Variables. Built-in variables have uppercase names. They can be used in expressions and can be reset. See Table 7.2 for a list of built-in variables.

Table 7.2. Built-In Variables
Variable Name	Variable Contents
ARGC	Number of command line argument.
ARGV	Array of command line arguments.
FILENAME	Name of current input file.
FNR	Record number in current file.
FS	The input field separator, by default a space.
NF	Number of fields in current record.
NR	Number of records so far.
OFMT	Output format for numbers.
OFS	Output field separator.
ORS	Output record separator.
RLENGTH	Length of string matched by match function.
RS	Input record separator.
RSTART	Offset of string matched by match function.
SUBSEP	Subscript separator.

Example 7.5

(The Employees Database) % cat employees2 Tom Jones:4423:5/12/66:543354 Mary Adams:5346:11/4/63:28765 Sally Chang:1654:7/22/54:650000 Mary Black:1683:9/23/44:336500 (The Command Line) % nawk   F:  '$1 == "Mary Adams"{print NR, $1, $2, $NF}' employees2 (The Output) 2  Mary Adams 5346  28765

EXPLANATION

The F option sets the field separator to a colon. The print function prints the record number, the first field, the second field, and the last field ($NF).

7.1.3 BEGIN Patterns

The BEGIN pattern is followed by an action block that is executed before awk processes any lines from the input file. In fact, a BEGIN block can be tested without any input file, since awk does not start reading input until the BEGIN action block has completed. The BEGIN action is often used to change the value of the built-in variables, OFS, RS, FS, and so forth, to assign initial values to user-defined variables and to print headers or titles as part of the output.

Example 7.6

% nawk 'BEGIN{FS=":"; OFS="\t"; ORS="\n\n"}{print $1,$2,$3}' file

EXPLANATION

Before the input file is processed, the field separator (FS) is set to a colon, the output field separator (OFS) to a tab, and the output record separator (ORS) to two newlines. If there are two or more statements in the action block, they should be separated with semicolons or placed on separate lines (use a backslash to escape the newline character if at the shell prompt).

Example 7.7

% nawk 'BEGIN{print "MAKE YEAR"}' make year

EXPLANATION

Awk will display MAKE YEAR. The print function is executed before awk opens the input file, and even though the input file has not been assigned, awk will still print MAKE and YEAR. When debugging awk scripts, you can test the BEGIN block actions before writing the rest of the program.

7.1.4 END Patterns

END patterns do not match any input lines, but execute any actions that are associated with the END pattern. END patterns are handled after all lines of input have been processed.

Example 7.8

% nawk 'END{print "The number of records is " NR }' filename The number of records is 4

EXPLANATION

The END block is executed after awk has finished processing the file. The value of NR is the number of the last record read.

Example 7.9

% nawk '/Mary/{count++}END{print "Mary was found " count " times."}'\   employees Mary was found 2 times.

EXPLANATION

For every line that contains the pattern sun, the value of the count variable is incremented by one. After awk has processed the entire file, the END block prints the string Sun was found, the value of count, and the string times.

7.2 Redirection and Pipes

7.2.1 Output Redirection

When redirecting output from within awk to a UNIX file, the shell redirection operators are used. The filename must be enclosed in double quotes. When the > symbol is used, the file is opened and truncated. Once the file is opened, it remains open until explicitly closed or the awk program terminates. Output from subsequent print statements to that file will be appended to the file.

The >> symbol is used to open the file, but does not clear it out; instead it simply appends to it.

Example 7.10

% nawk '$4 >= 70 {print $1, $2  > "passing_file" }' filename

EXPLANATION

If the value of the fourth field is greater than or equal to 70, the first and second fields will be printed to the file passing_ file.

7.2.2 Input Redirection (getline)

The getline Function. The getline function is used to read input from the standard input, a pipe, or a file other than the current file being processed. It gets the next line of input and sets the NF, NR, and the FNR built-in variables. The getline function returns one if a record is found and zero if EOF (end of file) is reached. If there is an error, such as failure to open a file, the getline function returns a value of -1.

Example 7.11

% nawk 'BEGIN{ "date" | getline d; print d}' filename Thu Jan 14 11:24:24 PST 2001

EXPLANATION

Will execute the UNIX date command, pipe the output to getline, assign it to the user-defined variable d, and then print d.

Example 7.12

% nawk 'BEGIN{ "date " | getline d; split( d, mon) ; print mon[2]}'\   filename Jan

EXPLANATION

Will execute the date command and pipe the output to getline. The getline function will read from the pipe and store the input in a user-defined variable, d. The split function will create an array called mon out of variable d and then the second element of the array mon will be printed.

Example 7.13

% nawk 'BEGIN{while("ls" | getline) print}' a.out db dbook getdir file sortedf

EXPLANATION

Will send the output of the ls command to getline; for each iteration of the loop, getline will read one more line of the output from ls and then print it to the screen. An input file is not necessary, since the BEGIN block is processed before awk attempts to open input.

Example 7.14

(The Command Line) 1  % nawk 'BEGIN{ printf "What is your name?" ;\      getline name < "/dev/tty"}\ 2  $1 ~ name {print "Found " name " on line ", NR "."}\ 3  END{print "See ya, " name "."}' filename (The Output) What is your name?  Ellie     < Waits for input from user > Found Ellie on line 5. See ya, Ellie.

EXPLANATION

Will print to the screen What is your name? and wait for user response; the getline function will accept input from the terminal (/dev/tty) until a newline is entered, and then store the input in the user-defined variable name.
If the first field matches the value assigned to name, the print function is executed.
The END statement prints out See ya, and then the value Ellie, stored in variable name, is displayed, followed by a period.

Example 7.15

(The Command Line) % nawk 'BEGIN{while (getline < "/etc/passwd"  > 0 )lc++; print lc}'\   file (The Output) 16

EXPLANATION

Awk will read each line from the /etc/passwd file, increment lc until EOF is reached, then print the value of lc, which is the number of lines in the passwd file.

Note

The value returned by getline is negative one if the file does not exist. If the end of file is reached, the return value is zero, and if a line was read, the return value is one. Therefore, the command

while ( getline < "/etc/junk")

would start an infinite loop if the file /etc/junk did not exist, since the return value of negative one yields a true condition.

7.3 Pipes

If you open a pipe in an awk program, you must close it before opening another one. The command on the right-hand side of the pipe symbol is enclosed in double quotes. Only one pipe can be open at a time.

Example 7.16

(The Database) % cat names john smith alice cheba george goldberg susan goldberg tony tram barbara nguyen elizabeth lone dan savage eliza goldberg john goldenrod (The Command Line) % nawk '{print $1, $2 | "sort  r +1  2 +0  1 "}' names (The Output) tony tram john smith dan savage barbara nguyen elizabeth lone john goldenrod susan goldberg george goldberg eliza goldberg alice cheba

EXPLANATION

Awk will pipe the output of the print statement as input to the UNIX sort command, which does a reversed sort using the second field as the primary key and the first field as the secondary key. The UNIX command must be enclosed in double quotes. (See "sort" in Appendix A.)

7.4 Closing Files and Pipes

If you plan to use a file or pipe in an awk program again for reading or writing, you may want to close it first, since it remains open until the script ends. Once opened, the pipe remains open until awk exits. Therefore, statements in the END block will also be affected by the pipe. The first line in the END block closes the pipe.

Example 7.17

(In Script) 1   { print $1, $2, $3 | " sort -r +1 -2 +0 -1"}     END{ 2   close("sort  r +1  2 +0  1")      <rest of statements>  }

EXPLANATION

Awk pipes each line from the input file to the UNIX sort utility.
When the END block is reached, the pipe is closed. The string enclosed in double quotes must be identical to the pipe string where the pipe was initially opened.

The system Function. The built-in system function takes a UNIX system command as its argument, executes the command, and returns the exit status to the awk program. It is similar to the C standard library function, also called system(). The UNIX command must be enclosed in double quotes.

FORMAT

system( "UNIX Command")

Example 7.18

(In Script)     { 1   system ( "cat" $1 ) 2   system ( "clear" )     }

EXPLANATION

The system function takes the UNIX cat command and the value of the first field in the input file as its arguments. The cat command takes the value of the first field, a filename, as its argument. The UNIX shell causes the cat command to be executed.
The system function takes the UNIX clear command as its argument. The shell executes the command, causing the screen to be cleared.

7.5 Review

7.5.1 Increment and Decrement Operators

% cat datafile northwest          NW    Joel Craig        3.0    .98    3     4 western            WE    Sharon Kelly      5.3    .97    5     23 southwest          SW    Chris Foster      2.7    .8     2     18 southern           SO    May Chin          5.1    .95    4     15 southeast          SE    Derek Johnson     4.0    .7     4     17 eastern            EA    Susan Beal        4.4    .84    5     20 northeast          NE    TJ Nichols        5.1    .94    3     13 north              NO    Val Shultz        4.5    .89    5     9 central            CT    Sheri Watson      5.7    .94    5     13

Example 7.19

% nawk '/^north/{count += 1; print count}' datafile 1 2 3

% cat datafile northwest          NW    Joel Craig        3.0    .98    3     4 western            WE    Sharon Kelly      5.3    .97    5     23 southwest          SW    Chris Foster      2.7    .8     2     18 southern           SO    May Chin          5.1    .95    4     15 southeast          SE    Derek Johnson     4.0    .7     4     17 eastern            EA    Susan Beal        4.4    .84    5     20 northeast          NE    TJ Nichols        5.1    .94    3     13 north              NO    Val Shultz        4.5    .89    5     9 central            CT    Sheri Watson      5.7    .94    5     13

EXPLANATION

If the record begins with the regular expression north, a user-defined variable, count, is created; count is incremented by 1 and its value is printed.

Example 7.20

% nawk '/^north/{count++; print count}' datafile 1 2 3

EXPLANATION

The auto-increment operator increments the user-defined variable count by 1. The value of count is printed.

Example 7.21

% nawk '{x = $7--; print "x = "x ", $7 = "$7}' datafile x = 3, $7 = 2 x = 5, $7 = 4 x = 2, $7 = 1 x = 4, $7 = 3 x = 4, $7 = 3 x = 5, $7 = 4 x = 3, $7 = 2 x = 5, $7 = 4 x = 5, $7 = 4

EXPLANATION

After the value of the seventh field ($7) is assigned to the user-defined variable x, the auto-decrement operator decrements the seventh field by one. The value of x and the seventh field are printed.

7.5.2 Built-In Variables

EXAMPLE 7.22 % nawk '/^north/{print "The record number is " NR}' datafile The record number is 1 The record number is 7 The record number is 8

EXPLANATION

If the record begins with the regular expression north, the string The record number is and the value of NR (record number) are printed.

Example 7.23

% nawk '{print NR, $0}' datafile 1 northwest   NW     Joel Craig          3.0  .98  3    4 2 western     WE     Sharon Kelly        5.3  .97  5    23 3 southwest   SW     Chris Foster        2.7  .8   2    18 4 southern    SO     May Chin            5.1  .95  4    15 5 southeast   SE     Derek Johnson       4.0  .7   4    17 6 eastern     EA     Susan Beal          4.4  .84  5    20 7 northeast   NE     TJ Nichols          5.1  .94  3    13 8 north       NO     Val Shultz          4.5  .89  5    9 9 central     CT     Sheri Watson        5.7  .94  5    13

EXPLANATION

The value of NR, the number of the current record, and the value of $0, the entire record, are printed.

Example 7.24

% nawk 'NR==2,NR==5{print NR, $0}' datafile 2 western     WE      Sharon Kelly       5.3  97   5    23 3 southwest   SW      Chris Foster       2.7   8   2    18 4 southern    SO      May Chin           5.1  95   4    15 5 southeast   SE      Derek Johnson      4.0   7   4    17

EXPLANATION

If the value of NR is in the range between 2 and 5 (record numbers 2 5), the number of the record (NR) and the record ($0) are printed.

% cat datafile northwest          NW   Joel Craig          3.0    .98    3    4 western            WE   Sharon Kelly        5.3    .97    5    23 southwest          SW   Chris Foster        2.7    .8     2    18 southern           SO   May Chin            5.1    .95    4    15 southeast          SE   Derek Johnson       4.0    .7     4    17 eastern            EA   Susan Beal          4.4    .84    5    20 northeast          NE   TJ Nichols          5.1    .94    3    13 north              NO   Val Shultz          4.5    .89    5    9 central            CT   Sheri Watson        5.7    .94    5    13

Example 7.25

% nawk '/^north/{print NR, $1, $2, $NF, RS}' datafile 1 northwest NW 4 7 northeast NE 13 8 north NO 9

EXPLANATION

If the record begins with the regular expression north, the number of the record (NR), followed by the first field, the second field, the value of the last record (NF preceded by a dollar sign) and the value of RS (a newline) are printed. Since the print function generates a newline by default, RS will generate another newline, resulting in double spacing between records.

% cat datafile2 Joel Craig:northwest:NW:3.0:.98:3:4 Sharon Kelly:western:WE:5.3:.97:5:23 Chris Foster:southwest:SW:2.7:.8:2:18 May Chin:southern:SO:5.1:.95:4:15 Derek Johnson:southeast:SE:4.0:.7:4:17 Susan Beal:eastern:EA:4.4:.84:5:20 TJ Nichols:northeast:NE:5.1:.94:3:13 Val Shultz:north:NO:4.5:.89:5:9 Sheri Watson:central:CT:5.7:.94:5:131.

Example 7.26

% nawk -F: 'NR == 5{print NF}' datafile2 7

EXPLANATION

The field separator is set to a colon at the command line with the F option. If the number of the record (NR) is 5, the number of fields (NF) is printed.

Example 7.27

% nawk 'BEGIN{OFMT="%.2f";print 1.2456789,12E-2}' datafile2 1.25 0.12

EXPLANATION

The OFMT, output format variable for the print function, is set so that floating point numbers will be printed with a decimal-point precision of two digits. The numbers 1.23456789 and 12E-2 are printed in the new format.

% cat datafile northwest          NW   Joel Craig          3.0    .98     3     4 western            WE   Sharon Kelly        5.3    .97     5     23 southwest          SW   Chris Foster        2.7    .8      2     18 southern           SO   May Chin            5.1    .95     4     15 southeast          SE   Derek Johnson       4.0    .7      4     17 eastern            EA   Susan Beal          4.4    .84     5     20 northeast          NE   TJ Nichols          5.1    .94     3     13 north              NO   Val Shultz          4.5    .89     5     9 central            CT   Sheri Watson        5.7    .94     5     13

Example 7.28

% nawk '{$9 = $6 * $7; print $9}' datafile 2.94 4.85 1.6 3.8 2.8 4.2 2.82 4.45 4.7

EXPLANATION

The result of multiplying the sixth field ($6) and the seventh field ($7) is stored in a new field, $9, and printed. There were eight fields; now there are nine.

Example 7.29

% nawk '{$10 = 100; print NF, $9, $0}' datafile 10  northwest NW     Joel Craig         3.0  .98  3    4    100 10  western   WE     Sharon Kelly       5.3  .97  5    23   100 10  southwest SW     Chris Foster       2.7  .8   2    18   100 10  southern  SO     May Chin           5.1  .95  4    15   100 10  southeast SE     Derek Johnson      4.0  .7   4    17   100 10  eastern   EA     Susan Beal         4.4  .84  5    20   100 10  northeast NE     TJ Nichols         5.1  .94  3    13   100 10  north     NO     Val Shultz         4.5  .89  5    9    100 10  central   CT     Sheri Watson       5.7  .94  5    13   100

EXPLANATION

The tenth field ($10) is assigned 100 for each record. This is a new field. The ninth field ($9) does not exist, so it will be considered a null field. The number of fields is printed (NF), followed by the value of $9, the null field, and the entire record ($0). The value of the tenth field is 100.

7.5.3 BEGIN Patterns

Example 7.30

% nawk 'BEGIN{print "---------EMPLOYEES---------"}' ---------EMPLOYEES---------

EXPLANATION

The BEGIN pattern is followed by an action block. The action is to print out the string ---------EMPLOYEES--------- before opening the input file. Note that an input file has not been provided and awk does not complain.

Example 7.31

% nawk 'BEGIN{print "\t\t---------EMPLOYEES-------\n"}\   {print $0}' datafile                      ---------EMPLOYEES------- northwest     NW    Joel Craig          3.0   .98  3    4 western       WE    Sharon Kelly        5.3   .97  5    23 southwest     SW    Chris Foster        2.7   .8   2    18 southern      SO    May Chin            5.1   .95  4    15 southeast     SE    Derek Johnson       4.0   .7   4    17 eastern       EA    Susan Beal          4.4   .84  5    20 northeast     NE    TJ Nichols          5.1   .94  3    13 north         NO    Val Shultz          4.5   .89  5    9 central       CT    Sheri Watson        5.7   .94  5    13

EXPLANATION

The BEGIN action block is executed first. The title ---------EMPLOYEES------- is printed. The second action block prints each record in the input file. When breaking lines, the backslash is used to suppress the carriage return. Lines can be broken at a semicolon or a curly brace.

% cat datafile2 Joel Craig:northwest:NW:3.0:.98:3:4 Sharon Kelly:western:WE:5.3:.97:5:23 Chris Foster:southwest:SW:2.7:.8:2:18 May Chin:southern:SO:5.1:.95:4:15 Derek Johnson:southeast:SE:4.0:.7:4:17 Susan Beal:eastern:EA:4.4:.84:5:20 TJ Nichols:northeast:NE:5.1:.94:3:13 Val Shultz:north:NO:4.5:.89:5:9 Sheri Watson:central:CT:5.7:.94:5:131.

Example 7.32

% nawk 'BEGIN{ FS=":";OFS="\t"};/^Sharon/{print $1, $2, $8 }'\   datafile2 Sharon Kelly      western     28

EXPLANATION

The BEGIN action block is used to initialize variables. The FS variable (field separator) is assigned a colon. The OFS variable (output field separator) is assigned a tab (\t). If a record begins with the regular expression Sharon, the first, second, and eighth fields ($1, $2, $8) are printed. Each field in the output is separated by a tab.

7.5.4 END Patterns

% cat datafile northwest          NW   Joel Craig          3.0    .98    3    4 western            WE   Sharon Kelly        5.3    .97    5    23 southwest          SW   Chris Foster        2.7    .8     2    18 southern           SO   May Chin            5.1    .95    4    15 southeast          SE   Derek Johnson       4.0    .7     4    17 eastern            EA   Susan Beal          4.4    .84    5    20 northeast          NE   TJ Nichols          5.1    .94    3    13 north              NO   Val Shultz          4.5    .89    5    9 central            CT   Sheri Watson        5.7    .94    5    13

Example 7.33

% nawk 'END{print "The total number of records is " NR}' datafile The total number of records is 9

EXPLANATION

After awk has finished processing the input file, the statements in the END block are executed. The string The total number of records is is printed, followed by the value of NR, the number of the last record.

Example 7.34

% nawk '/^north/{count++}END{print count}' datafile 3

EXPLANATION

If the record begins with the regular expression north, the user-defined variable count is incremented by one. When awk has finished processing the input file, the value stored in the variable count is printed.

7.5.5 awk Script with BEGIN and END

Example 7.35

    # Second awk script-- awk.sc2 1   BEGIN{ FS=":"; OFS="\t"        print "  NAME\t\tDISTRICT\tQUANTITY"        print "___________________________________________\n"     } 2      {print $1"\t  " $3"\t\t" $7}        {total+=$7}        /north/{count++} 3   END{        print "---------------------------------------------"        print "The total quantity is " total        print "The number of northern salespersons is " count "."     } (The Output) 4  % nawk -f awk.sc2 datafile2      NAME      DISTRICT  QUANTITY    ___________________________________________    Joel Craig        NW       4    Sharon Kelly      WE       23    Chris Foster      SW       18    May Chin          SO       15    Derek Johnson     SE       17    Susan Beal        EA       20    TJ Nichols        NE       13    Val Shultz        NO       9    Sheri Watson      CT       13    ---------------------------------------------    The total quantity is 132    The number of northern salespersons is 3.

% cat datafile2 Joel Craig:northwest:NW:3.0:.98:3:4 Sharon Kelly:western:WE:5.3:.97:5:23 Chris Foster:southwest:SW:2.7:.8:2:18 May Chin:southern:SO:5.1:.95:4:15 Derek Johnson:southeast:SE:4.0:.7:4:17 Susan Beal:eastern:EA:4.4:.84:5:20 TJ Nichols:northeast:NE:5.1:.94:3:13 Val Shultz:north:NO:4.5:.89:5:9 Sheri Watson:central:CT:5.7:.94:5:131.

EXPLANATION

The BEGIN block is executed first. The field separator (FS) and the output field separator (OFS) are set. Header output is printed.
The body of the awk script contains statements that are executed for each line of input coming from datafile2.
Statements in the END block are executed after the input file has been closed, i.e., before awk exits.
At the command line, the nawk program is executed. The f option is followed by the script name, awk.sc2, and then by the input file, datafile2.

7.5.6 The printf Function

Example 7.36

% nawk '{printf "$%6.2f\n",$6 * 100}' datafile $ 98.00 $ 97.00 $ 80.00 $ 95.00 $ 70.00 $ 84.00 $ 94.00 $ 89.00 $ 94.00

EXPLANATION

The printf function formats a floating point number to be right-justified (the default) with a total of 6 digits, one for the decimal point, and two for the decimal numbers to the right of the period. The number will be rounded up and printed.

Example 7.37

% nawk '{printf "|%-15s|\n",$4}' datafile |Craig        | |Kelly        | |Foster       | |Chin         | |Johnson      | |Beal         | |Nichols      | |Shultz       | |Watson       |

EXPLANATION

A left-justified, 15-space string is printed. The fourth field ($4) is printed enclosed in vertical bars to illustrate the spacing.

% cat datafile northwest          NW   Joel Craig          3.0    .98    3    4 western            WE   Sharon Kelly        5.3    .97    5    23 southwest          SW   Chris Foster        2.7    .8     2    18 southern           SO   May Chin            5.1    .95    4    15 southeast          SE   Derek Johnson       4.0    .7     4    17 eastern            EA   Susan Beal          4.4    .84    5    20 northeast          NE   TJ Nichols          5.1    .94    3    13 north              NO   Val Shultz          4.5    .89    5    9 central            CT   Sheri Watson        5.7    .94    5    13

7.5.7 Redirection and Pipes

Example 7.38

% nawk '/north/{print $1, $3, $4 > "districts"}' datafile % cat districts northwest Joel Craig northeast TJ Nichols north Val Shultz

EXPLANATION

If the record contains the regular expression north, the first, third, and fourth fields ($1, $3, $4) are printed to an output file called districts. Once the file is opened, it remains open until closed or the program terminates. The filename "districts" must be enclosed in double quotes.

Example 7.39

% nawk '/south/{print $1, $2, $3 >> "districts"}' datafile % cat districts northwest Joel Craig northeast TJ Nichols north Val Shultz southwest SW Chris southern SO May southeast SE Derek

EXPLANATION

If the record contains the pattern south, the first, second, and third fields ($1, $2, $3) are appended to the output file districts.

7.5.8 Opening and Closing a Pipe

Example 7.40

# awk script using pipes -- awk.sc3 1  BEGIN{ 2     printf " %-22s%s\n", "NAME", "DISTRICT"       print "--------------------------------------" 3   } 4   /west/{count++} 5   {printf "%s %s\t\t%-15s\n", $3, $4, $1| "sort +1" } 6   END{ 7       close "sort +1"         printf "The number of sales persons in the western "         printf "region is " count "."} (The Output)    % nawk -f awk.sc3 datafile 1  NAME                        DISTRICT 2  -------------------------------------------------- 3  Susan Beal                  eastern    May Chin                    southern    Joel Craig                  northwest    Chris Foster                southwest    Derek Johnson               southeast    Sharon Kelly                western    TJ Nichols                  northeast    Val Shultz                  north    Sheri Watson                central    The number of sales persons in the western region is 3.

EXPLANATION

The special BEGIN pattern is followed by an action block. The statements in this block are executed first, before awk processes the input file.
The printf function displays the string NAME as a 22-character, left-justified string, followed by the string DISTRICT, which is right-justified.
The BEGIN block ends.
Now awk will process the input file, one line at a time. If the pattern west is found, the action block is executed, i.e., the user-defined variable count is incremented by one. The first time awk encounters the count variable, it will be created and given an initial value of zero.
The print function formats and sends its output to a pipe. After all of the output has been collected, it will be sent to the sort command
The END block is started.
The pipe (sort +1) must be closed with exactly the same command that opened it; in this example, "sort +1". Otherwise, the END statements will be sorted with the rest of the output.

UNIX TOOLS LAB EXERCISE

Lab 5: nawk Exercise

Mike Harrington:(510) 548-1278:250:100:175

Christian Dobbins:(408) 538-2358:155:90:201

Susan Dalsass:(206) 654-6279:250:60:50

Archie McNichol:(206) 548-1348:250:100:175

Jody Savage:(206) 548-1278:15:188:150

Guy Quigley:(916) 343-6410:250:100:175

Dan Savage:(406) 298-7744:450:300:275

Nancy McNeil:(206) 548-1278:250:80:75

John Goldenrod:(916) 348-4278:250:100:175

Chet Main:(510) 548-5258:50:95:135

Tom Savage:(408) 926-3456:250:168:200

Elizabeth Stachelin:(916) 440-1763:175:75:300

(Refer to the database called lab5.data on the CD.)

The database above contains the names, phone numbers, and money contributions to the party campaign for the past three months.

Write a nawk script to produce the following output:

% nawk -f nawk.sc db                     ***CAMPAIGN 1998 CONTRIBUTIONS*** ---------------------------------------------------------------------- NAME                 PHONE            Jan  |  Feb  |  Mar | Total Donated ---------------------------------------------------------------------- Mike Harrington     (510) 548-1278    250.00   100.00  175.00   525.00 Christian Dobbins   (408) 538-2358    155.00    90.00  201.00   446.00 Susan Dalsass       (206) 654-6279    250.00    60.00   50.00   360.00 Archie McNichol     (206) 548-1348    250.00   100.00  175.00   525.00 Jody Savage         (206) 548-1278     15.00   188.00  150.00   353.00 Guy Quigley         (916) 343-6410    250.00   100.00  175.00   525.00 Dan Savage          (406) 298-7744    450.00   300.00  275.00  1025.00 Nancy McNeil        (206) 548-1278    250.00    80.00   75.00   405.00 John Goldenrod      (916) 348-4278    250.00   100.00  175.00   525.00 Chet Main           (510) 548-5258     50.00    95.00  135.00   280.00 Tom Savage          (408) 926-3456    250.00   168.00  200.00   618.00 Elizabeth Stacheli  (916) 440-1763    175.00    75.00  300.00   550.00 ----------------------------------------------------------------------                               SUMMARY ---------------------------------------------------------------------- The campaign received a total of $6137.00 for this quarter. The average donation for the 12 contributors was $511.42. The highest contribution was $300.00. The lowest contribution was $15.00.

7.6 Conditional Statements

The conditional statements in awk were borrowed from the C language. They are used to control the flow of the program in making decisions.

7.6.1 if Statements

Statements beginning with the if construct are action statements. With conditional patterns, the if is implied; with a conditional action statement, the if is explicitly stated, and followed by an expression enclosed in parentheses. If the expression evaluates true (nonzero or non-null), the statement or block of statements following the expression is executed. If there is more than one statement following the conditional expression, the statements are separated either by semicolons or a newline, and the group of statements must be enclosed in curly braces so that the statements are executed as a block.

FORMAT

if (expression) {     statement; statement; ...     }

Example 7.41

1   % nawk '{if ( $6 > 50 ) print $1 "Too high"}' filename 2   % nawk '{if ($6 > 20 && $6  <= 50){safe++; print "OK"}}' filename

EXPLANATION

In the if action block, the expression is tested. If the value of the sixth field is greater than 50, the print statement is executed. Since the statement following the expression is a single statement, curly braces are not required. (filename represents the input file.)
In the if action block, the expression is tested. If the sixth field is greater than 20 and the sixth field is less than or equal to 50, the statements following the expression are executed as a block and must be enclosed in curly braces.

7.6.2 if/else Statements

The if/else statement allows a two-way decision. If the expression after the if keyword is true, the block of statements associated with that expression are executed. If the first expression evaluates to false or zero, the block of statements after the else keyword is executed. If multiple statements are to be included with the if or else, they must be blocked with curly braces.

FORMAT

{if (expression) {     statement; statement; ...     } else{     statement; statement; ...     } }

Example 7.42

1   % nawk '{if( $6 > 50) print  $1 " Too high" ;\       else print "Range is OK"}' filename 2   % nawk '{if ( $6 > 50 ) { count++; print $3 } \       else { x+5; print $2 } }' filename

EXPLANATION

If the first expression is true, that is, the sixth field ($6) is greater than 50, the print function prints the first field and Too high; otherwise, the statement after the else, Range is OK, is printed.
If the first expression is true, that is, the sixth field ($6) is greater than 50, the block of statements is executed; otherwise, the block of statements after the else is executed. Note that the blocks are enclosed in curly braces.

7.6.3 if/else and else if Statements

The if/else and else if statements allow a multiway decision. If the expression following the keyword if is true, the block of statements associated with that expression is executed and control starts again after the last closing curly brace associated with the final else. Otherwise, control goes to the else if and that expression is tested. When the first else if condition is true, the statements following the expression are executed. If none of the conditional expressions test true, control goes to the else statements. The else is called the default action because if none of the other statements are true, the else block is executed.

FORMAT

{if (expression) {     statement; statement; ...     } else if (expression){     statement; statement; ...     } else if (expression){     statement; statement; ...     } else{     statement     } }

Example 7.43

(In the Script) 1   {if ( $3 > 89 && $3 < 101 ) Agrade++ 2   else if ( $3 > 79 ) Bgrade++ 3   else if ( $3 > 69 ) Cgrade++ 4   else if ( $3 > 59 ) Dgrade++ 5   else Fgrade++     }     END{print "The number of failures is" Fgrade }

EXPLANATION

The if statement is an action and must be enclosed in curly braces. The expression is evaluated from left to right. If the first expression is false, the whole expression is false; if the first expression is true, the expression after the logical and (&&) is evaluated. If it is true, the variable Agrade is incremented by one.
If the first expression following the if keyword evaluates to false (0), the else if expression is tested. If it evaluates to true, the statement following the expression is executed; that is, if the third field ($3) is greater than 79, Bgrade is incremented by one.
If the first two statements are false, the else if expression is tested, and if the third field ($3) is greater than 69, Cgrade is incremented.
If the first three statements are false, the else if expression is tested, and if the third field is greater than 59, Dgrade is incremented.
If none of the expressions tested above is true, the else block is executed. The curly brace ends the action block. Fgrade is incremented.

7.7 Loops

Loops are used to repeatedly execute the statements following the test expression if a condition is true. Loops are often used to iterate through the fields within a record and to loop through the elements of an array in the END block. Awk has three types of loops: the while loop, the for loop, and the special for loop, which will be discussed later when working with awk arrays.

7.7.1 while Loop

The first step in using a while loop is to set a variable to an initial value. The value is then tested in the while expression. If the expression evaluates to true (nonzero), the body of the loop is entered and the statements within that body are executed. If there is more than one statement within the body of the loop, those statements must be enclosed in curly braces. Before ending the loop block, the variable controlling the loop expression must be updated or the loop will continue forever. In the following example, the variable is reinitialized each time a new record is processed.

The do/while loop is similar to the while loop, except that the expression is not tested until the body of the loop is executed at least once.

Example 7.44

% nawk '{ i  = 1; while ( i <= NF ) { print NF, $i ; i++ } }' filename

EXPLANATION

The variable i is initialized to one; while i is less than or equal to the number of fields (NF) in the record, the print statement will be executed, then i will be incremented by one. The expression will then be tested again, until the variable i is greater than the value of NF. The variable i is not reinitialized until awk starts processing the next record.

7.7.2 for Loop

The for loop and while loop are essentially the same, except the for loop requires three expressions within the parentheses: the initialization expression, the test expression, and the expression to update the variables within the test expression. In awk, the first statement within the parentheses of the for loop can perform only one initialization. (In C, you can have multiple initializations separated by commas.)

Example 7.45

% nawk '{ for( i = 1; i <= NF; i++) print NF,$i }' filex

EXPLANATION

The variable i is initialized to one and tested to see whether it is less than or equal to the number of fields (NF) in the record. If so, the print function prints the value of NF and the value of $i (the $ preceding the i is the number of the ith field), then i is incremented by one. (Frequently the for loop is used with arrays in an END action to loop through the elements of an array.) See "Arrays".

7.7.3 Loop Control

break and continue Statements. The break statement lets you break out of a loop if a certain condition is true. The continue statement causes the loop to skip any statements that follow if a certain condition is true, and returns control to the top of the loop, starting at the next iteration.

Example 7.46

(In the Script) 1   {for ( x = 3; x <= NF; x++ )         if ( $x < 0 ){ print "Bottomed out!"; break}         # breaks out of for loop     } 2   {for ( x = 3; x <= NF; x++ )         if ( $x == 0 ) { print "Get next item"; continue}         # starts next iteration of  the for loop     }

EXPLANATION

If the value of the field $x is less than zero, the break statement causes control to go to the statement after the closing curly brace of the loop body; i.e., it breaks out of the loop.
If the value of the field $x is equal to zero, the continue statement causes control to start at the top of the loop and start execution, in the third expression at the for loop at x++.

7.8 Program Control Statements

7.8.1 next Statement

The next statement gets the next line of input from the input file, restarting execution at the top of the awk script.

Example 7.47

(In Script) { if ($1 ~ /Peter/){next}     else {print} }

EXPLANATION

If the first field contains Peter, awk skips over this line and gets the next line from the input file. The script resumes execution at the beginning.

7.8.2 exit Statement

The exit statement is used to terminate the awk program. It stops processing records, but does not skip over an END statement. If the exit statement is given a value between 0 and 255 as an argument (exit 1), this value can be printed at the command line to indicate success or failure by typing:

Example 7.48

(In Script)    {exit (1) } (The Command Line) % echo $status     (csh) 1 $ echo $?    (sh/ksh) 1

EXPLANATION

An exit status of zero indicates success, and an exit value of nonzero indicates failure (a convention in UNIX). It is up to the programmer to provide the exit status in a program. The exit value returned in this example is 1.

7.9 Arrays

Arrays in awk are called associative arrays because the subscripts can be either numbers or strings. The subscript is often called the key and is associated with the value assigned to the corresponding array element. The keys and values are stored internally in a table where a hashing algorithm is applied to the value of the key in question. Due to the techniques used for hashing, the array elements are not stored in a sequential order, and when the contents of the array are displayed, they may not be in the order you expected.

An array, like a variable, is created by using it, and awk can infer whether it is used to store numbers or strings. Array elements are initialized with numeric value zero and string value null, depending on the context. You do not have to declare the size of an awk array. Awk arrays are used to collect information from records and may be used for accumulating totals, counting words, tracking the number of times a pattern occurred, and so forth.

7.9.1 Subscripts for Associative Arrays

Using Variables as Array Indexes

Example 7.49

(The Input File) % cat employees Tom Jones      4424   5/12/66   543354 Mary Adams     5346   11/4/63   28765 Sally Chang    1654   7/22/54   650000 Billy Black    1683   9/23/44   336500 (The Command Line) 1  % nawk '{name[x++]=$2};END{for(i=0; i<NR; i++)\      print i, name[i]}' employees    0 Jones    1 Adams    2 Chang    3 Black 2  % nawk '{id[NR]=$3};END{for(x = 1; x <= NR; x++)\      print id[x]}' employees    4424    5346    1654    1683

EXPLANATION

The subscript in array name is a user-defined variable, x. The ++ indicates a numeric context. Awk initializes x to zero and increments x by one after (post-increment operator) it is used. The value of the second field is assigned to each element of the name array. In the END block, the for loop is used to loop through the array, printing the value that was stored there, starting at subscript zero. Since the subscript is just a key, it does not have to start at zero. It can start at any value, either a number or a string.
The awk variable NR contains the number of the current record. By using NR as a subscript, the value of the third field is assigned to each element of the array for each record. At the end, the for loop will loop through the array, printing out the values that were stored there.

The Special for Loop. The special for loop is used to read through an associative array in cases where the for loop is not practical; that is, when strings are used as subscripts or the subscripts are not consecutive numbers. The special for loop uses the subscript as a key into the value associated with it.

FORMAT

{for(item in arrayname){     print arrayname[item]     } }

Example 7.50

(The Input File) % cat db Tom Jones Mary Adams Sally Chang Billy Black Tom Savage Tom Chung Reggie Steel Tommy Tucker (The Command Line, for Loop) 1  % nawk '/^Tom/{name[NR]=$1};\      END{for( i = 1; i <= NR; i++ )print name[i]}' db    Tom    Tom    Tom    Tommy (The Command Line, Special for Loop) 2  % nawk '/^Tom/{name[NR]=$1};\      END{for(i in name){print name[i]}}' db    Tom    Tommy    Tom    Tom

EXPLANATION

If the regular expression Tom is matched against an input line, the name array is assigned a value. Since the subscript used is NR, the number of the current record, the subscripts in the array will not be in numeric order. Therefore, when printing the array with the traditional for loop, there will be null values printed where an array element has no value.
The special for loop iterates through the array, printing only values where there was a subscript associated with that value. The order of the printout is random because of the way the associative arrays are stored (hashed).

Using Strings as Array Subscripts. A subscript may consist of a variable containing a string or literal string. If the string is a literal, it must be enclosed in double quotes.

Example 7.51

(The Input File) % cat datafile3 tom mary sean tom mary mary bob mary alex (The Script)     # awk.sc script 1   /tom/ { count["tom"]++ } 2   /mary/ { count["mary"]++ } 3   END{print "There are " count["tom"] " Toms in the file and        " count["mary"]" Marys in the file."} (The Command Line)    % nawk -f awk.sc datafile3    There are 2 Toms in the file and 4 Marys in the file.

EXPLANATION

An array called count consists of two elements, count["tom"] and count["mary"]. The initial value of each of the array elements is zero. Every time tom is matched, the value of the array is incremented by one.
The same procedure applies to count["mary"]. Note: Only one tom is recorded for each line, even if there are multiple occurrences on the line.
The END pattern prints the value stored in each of the array elements.

Using Field Values as Array Subscripts. Any expression can be used as a subscript in an array. Therefore, fields can be used. The program in Example 7.52 counts the frequency of all names appearing in the second field and introduces a new form of the for loop.

Example 7.52

(The Input File) % cat datafile4 4234  Tom    43 4567  Arch   45 2008  Eliza  65 4571  Tom    22 3298  Eliza  21 4622  Tom    53 2345  Mary   24 (The Command Line) % nawk '{count[$2]++}END{for(name in count)print name,count[name] }'\   datafile4 Tom 3 Arch 1 Eliza 2 Mary 1

EXPLANATION

The awk statement first will use the second field as an index in the count array. The index varies as the second field varies, thus the first index in the count array is Tom and the value stored in count["Tom"] is one.

Next, count["Arch"] is set to one, count["Eliza"] to one, and count["Mary"] to one. When awk finds the next occurrence of Tom in the second field, count["Tom"] is incremented, now containing the value 2. The same thing happens for each occurrence of Arch, Eliza, and Mary.

Figure 7.1. Using strings as subscripts in an array (Example 7.51).

graphics/07fig01.gif

for( index_value in array ) statement

The for loop found in the END block of the previous example works as follows: The variable name is set to the index value of the count array. After each iteration of the for loop, the print action is performed, first printing the value of the index, and then the value stored in that element. (The order of the printout is not guaranteed.)

Example 7.53

(The Input File) % cat datafile4 4234  Tom    43 4567  Arch   45 2008  Eliza  65 4571  Tom    22 3298  Eliza  21 4622  Tom    53 2345  Mary   24 (The Command Line) % nawk  '{dup[$2]++; if (dup[$2] > 1){name[$2]++ }}\   END{print "The duplicates were"\   for (i in name){print i, name[i]}}' datafile4 (The Output) Tom 2 Eliza 2

EXPLANATION

The subscript for the dup array is the value in the second field, that is, the name of a person. The value stored there is initially zero, and it is incremented by one each time a new record is processed. If the name is a duplicate, the value stored for that subscript will go up to two, and so forth. If the value in the dup array is greater than one, a new array called name also uses the second field as a subscript and keeps track of the number of names greater than one.

Arrays and the split Function. Awk's built-in split function allows you to split a string into words and store them in an array. You can define the field separator or use the value currently stored in FS.

FORMAT

split(string, array, field separator) split (string, array)

Example 7.54

(The Command Line) % nawk BEGIN{ split( "3/15/2001", date, "/");\   print "The month is " date[1] "and the year is "date[3]"} \   filename (The Output) The month is 3 and the year is 2001.

EXPLANATION

The string 3/15/2001 is stored in the array date, using the forward slash as the field separator. Now date[1] contains 3, date[2] contains 15, and date[3] contains 2001. The field separator is specified in the third argument; if not specified, the value of FS is used as the separator.

The delete Function. The delete function removes an array element.

Example 7.55

% nawk '{line[x++]=$2}END{for(x in line) delete(line[x])}'\   filename

EXPLANATION

The value assigned to the array line is the value of the second field. After all the records have been processed, the special for loop will go through each element of the array, and the delete function will in turn remove each element.

Multidimensional Arrays (nawk). Although awk does not officially support multidimensional arrays, a syntax is provided that gives the appearance of a multidimensional array. This is done by concatenating the indices into a string separated by the value of a special built-in variable, SUBSEP. The SUBSEP variable contains the value "\034," an unprintable character that is so unusual that it is unlikely to be found as an index character. The expression matrix[2,8] is really the array matrix[2 SUBSEP 8], which evaluates to matrix["2\0348"]. The index becomes a unique string for an associative array.

Example 7.56

(The Input File) 1 2 3 4 5 2 3 4 5 6 6 7 8 9 10 (The Script) 1   {nf=NF 2   for(x = 1; x <= NF; x++ ){ 3      matrix[NR, x] = $x        }     } 4   END { for (x=1; x <= NR; x++ ){         for (y = 1; y <= nf; y++ )            printf "%d ", matrix[x,y]     printf"\n"        }     } (The Output) 1 2 3 4 5 2 3 4 5 6 6 7 8 9 10

EXPLANATION

The variable nf is assigned the value of NF, the number of fields. (This program assumes a fixed number of five fields per record.)
The for loop is entered, storing the number of each field on the line in the variable x.
The matrix array is a two-dimensional array. The two indices, NR (number of the current record) and x, are assigned the value of each field.
In the END block, the two for loops are used to iterate through the matrix array, printing out the values stored there. This example does nothing more than demonstrate that multidimensional arrays can be simulated.

7.9.2 Processing Command Arguments (nawk)

ARGV. Command line arguments are available to nawk (the new version of awk) with the built-in array called ARGV. These arguments include the command nawk, but not any of the options passed to nawk. The index of the ARGV array starts at zero. (This works only for nawk.)

ARGC. ARGC is a built-in variable that contains the number of command line arguments.

Example 7.57

(The Script) # This script is called argvs BEGIN{     for  ( i=0; i < ARGC; i++ ){         printf("argv[%d] is %s\n", i, ARGV[i])         }     printf("The number of arguments, ARGC=%d\n", ARGC) } (The Output) % nawk  f argvs datafile argv[0] is nawk argv[1] is datafile The number of arguments, ARGC=2

EXPLANATION

In the for loop, i is set to zero, i is tested to see if it is less than the number of command line arguments (ARGC), and the printf function displays each argument encountered, in turn. When all of the arguments have been processed, the last printf statement outputs the number of arguments, ARGC. The example demonstrates that awk does not count command line options as arguments.

Example 7.58

(The Command Line) % nawk  f argvs datafile "Peter Pan" 12 argv[0] is nawk argv[1] is datafile argv[2] is Peter Pan argv[3] is 12 The number of arguments, ARGC=4

EXPLANATION

As in the last example, each of the arguments is printed. The nawk command is considered the first argument, whereas the f option and script name, argvs, are excluded.

Example 7.59

(The Datafile) % cat datafile5 Tom Jones:123:03/14/56 Peter Pan:456:06/22/58 Joe Blow:145:12/12/78 Santa Ana:234:02/03/66 Ariel Jones:987:11/12/66 (The Script) % cat arging.sc # This script is called arging.sc 1   BEGIN{FS=":"; name=ARGV[2] 2      print "ARGV[2] is "ARGV[2]     }     $1  ~ name { print $0 } (The Command Line) % nawk  f arging.sc datafile5 "Peter Pan" ARGV[2] is Peter Pan Peter Pan:456:06/22/58 nawk: can't open Peter Pan input record number 5, file Peter Pan source line number 2

EXPLANATION

In the BEGIN block, the variable name is assigned the value of ARGV[2], Peter Pan.
Peter Pan is printed, but then awk tries to open Peter Pan as an input file after it has processed and closed the datafile. Awk treats arguments as input files.

Example 7.60

(The Script) % cat arging2.sc BEGIN{FS=":"; name=ARGV[2]    print "ARGV[2] is " ARGV[2]    delete ARGV[2] } $1  ~ name { print $0 } (The Command Line) % nawk  f arging2.sc datafile "Peter Pan" ARGV[2] is Peter Pan Peter Pan:456:06/22/58

EXPLANATION

Awk treats the elements of the ARGV array as input files; after an argument is used, it is shifted to the left and the next one is processed, until the ARGV array is empty. If the argument is deleted immediately after it is used, it will not be processed as the next input file.

7.10 awk Built-In Functions

7.10.1 String Functions

The sub and gsub Functions. The sub function matches the regular expression for the largest and leftmost substring in the record, and then replaces that substring with the substitution string. If a target string is specified, the regular expression is matched for the largest and leftmost substring in the target string, and the substring is replaced with the substitution string. If a target string is not specified, the entire record is used.

FORMAT

sub (regular expression, substitution string); sub (regular expression, substitution string, target string)

Example 7.61

1  % nawk '{sub(/Mac/, "MacIntosh"); print}' filename 2  % nawk '{sub(/Mac/, "MacIntosh", $1); print}' filename

EXPLANATION

The first time the regular expression Mac is matched in the record ($0), it will be replaced with the string MacIntosh. The replacement is made only on the first occurrence of a match on the line. (See gsub for multiple occurrences.)
The first time the regular expression Mac is matched in the first field of the record, it will be replaced with the string MacIntosh. The replacement is made only on the first occurrence of a match on the line for the target string. The gsub function substitutes a regular expression with a string globally, that is, for every occurrence where the regular expression is matched in each record ($0).

FORMAT

gsub(regular expression, substitution string) gsub(regular expression, substitution string, target string)

Example 7.62

1  % nawk '{ gsub(/CA/, "California"); print }' datafile 2  % nawk '{ gsub(/[Tt]om/, "Thomas", $1 ); print }' filename

EXPLANATION

Everywhere the regular expression CA is found in the record ($0), it will be replaced with the string California.
Everywhere the regular expression Tom or tom is found in the first field, it will be replaced with the string Thomas.

The index Function. The index function returns the first position where a substring is found in a string. Offset starts at position 1.

FORMAT

index(string, substring)

Example 7.63

% nawk '{ print index("hollow", "low") }' filename 4

EXPLANATION

The number returned is the position where the substring low is found in hollow, with the offset starting at one.

The length Function. The length function returns the number of characters in a string. Without an argument, the length function returns the number of characters in a record.

FORMAT

length ( string ) length

Example 7.64

% nawk '{ print length("hello") }' filename 5

EXPLANATION

The length function returns the number of characters in the string hello.

The substr Function. The substr function returns the substring of a string starting at a position where the first position is one. If the length of the substring is given, that part of the string is returned. If the specified length exceeds the actual string, the string is returned.

FORMAT

substr(string, starting position) substr(string, starting position, length of string)

Example 7.65

% nawk ' { print substr("Santa Claus", 7, 6 )} ' filename Claus

EXPLANATION

In the string Santa Claus, print the substring starting at position 7 with a length of 6 characters.

The match Function. The match function returns the index where the regular expression is found in the string, or zero if not found. The match function sets the built-in variable RSTART to the starting position of the substring within the string, and RLENGTH to the number of characters to the end of the substring. These variables can be used with the substr function to extract the pattern. (Works only with nawk.)

FORMAT

match(string, regular expression)

Example 7.66

% nawk 'END{start=match("Good ole USA", /[A Z]+$/); print start}'\   filename 10

EXPLANATION

The regular expression /[A Z]+$/ says search for consecutive uppercase letters at the end of the string. The substring USA is found starting at the tenth character of the string Good ole USA. If the string cannot be matched, 0 is returned.

Example 7.67

1   % nawk 'END{start=match("Good ole USA", /[A Z]+$/);\       print RSTART, RLENGTH}' filename     10 3 2   % nawk 'BEGIN{ line="Good ole USA"}; \       END{ match( line, /[A Z]+$/);\       print substr(line, RSTART,RLENGTH)}' filename     USA

EXPLANATION

The RSTART variable is set by the match function to the starting position of the regular expression matched. The RLENGTH variable is set to the length of the substring.
The substr function is used to find a substring in the variable line, and uses the RSTART and RLENGTH values (set by the match function) as the beginning position and length of the substring.

The split Function. The split function splits a string into an array using whatever field separator is designated as the third parameter. If the third parameter is not provided, awk will use the current value of FS.

FORMAT

split (string, array, field separator) split (string, array)

Example 7.68

% awk 'BEGIN{split("12/25/2001",date,"/");print date[2]}' filename 25

EXPLANATION

The split function splits the string 12/25/2001 into an array, called date, using the forward slash as the separator. The array subscript starts at 1. The second element of the date array is printed.

The sprintf Function. The sprintf function returns an expression in a specified format. It allows you to apply the format specifications of the printf function.

FORMAT

variable=sprintf("string with format specifiers ", expr1, expr2, \ ... , expr2)

Example 7.69

% awk '{line = sprintf ( "% 15s %6.2f ", $1 , $3 );\   print line}'  filename

EXPLANATION

The first and third fields are formatted according to the printf specifications (a left-justified, 15-space string and a right-justified, 6-character floating point number). The result is assigned to the user-defined variable line. See "The printf Function".

7.11 Built-In Arithmetic Functions

Table 7.3 lists the built-in arithmetic functions, where x and y are arbitrary expressions.

Table 7.3. Arithmetic Functions
Name	Value Returned
atan2(x,y)	Arctangent of y/x in the range.
cos(x)	Cosine of x, with x in radians.
exp(x)	Exponential function of x, e.
int(x)	Integer part of x; truncated toward 0 when x > 0.
log(x)	Natural (base e) logarithm of x.
rand( )	Random number r, where 0 < r < 1.
sin(x)	Sine of x, with x in radians.
sqrt(x)	Square root of x.
srand(x)	x is a new seed for rand( ).^[a]

^[a] From Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger, The AWK Programming Language (Boston: Addison-Wesley, 1988). 1988 Bell Telephone Laboratories, Inc. Reprinted by permission of Pearson Education, Inc.

7.11.1 Integer Function

The int function truncates any digits to the right of the decimal point to create a whole number. There is no rounding off.

Example 7.70

1   % awk 'END{print 31/3}' filename     10.3333 2   % awk 'END{print int(31/3})' filename     10

EXPLANATION

In the END block, the result of the division is to print a floating point number.
In the END block, the int function causes the result of the division to be truncated at the decimal point. A whole number is displayed.

7.11.2 Random Number Generator

The rand Function. The rand function generates a pseudorandom floating point number greater than or equal to zero and less than one.

Example 7.71

% nawk '{print rand()}' filename 0.513871 0.175726 0.308634 % nawk '{print rand()}' filename 0.513871 0.175726 0.308634

EXPLANATION

Each time the program runs, the same set of numbers is printed. The srand function can be used to seed the rand function with a new starting value. Otherwise, as in this example, the same sequence is repeated each time rand is called.

The srand Function. The srand function without an argument uses the time of day to generate the seed for the rand function. Srand(x) uses x as the seed. Normally, x should vary during the run of the program.

Example 7.72

% nawk 'BEGIN{srand()};{print rand()}' filename 0.508744 0.639485 0.657277 % nawk 'BEGIN{srand()};{print rand()}' filename 0.133518 0.324747 0.691794

EXPLANATION

The srand function sets a new seed for rand. The starting point is the time of day. Each time rand is called, a new sequence of numbers is printed.

Example 7.73

% nawk 'BEGIN{srand()};{print 1 + int(rand() * 25)}' filename 6 24 14

EXPLANATION

The srand function sets a new seed for rand. The starting point is the time of day. The rand function selects a random number between 0 and 25 and casts it to an integer value.

7.12 User-Defined Functions (nawk)

A user-defined function can be placed anywhere in the script that a pattern action rule can.

FORMAT

function name ( parameter, parameter, parameter, ... ) {    statements    return expression (The return statement and expression are optional) }

Variables are passed by value and are local to the function where they are used. Only copies of the variables are used. Arrays are passed by address or by reference, so array elements can be directly changed within the function. Any variable used within the function that has not been passed in the parameter list is considered a global variable; that is, it is visible to the entire awk program, and if changed in the function, is changed throughout the program. The only way to provide local variables within a function is to include them in the parameter list. Such parameters are usually placed at the end of the list. If there is not a formal parameter provided in the function call, the parameter is initially set to null. The return statement returns control and possibly a value to the caller.

Example 7.74

(The Command Line Display of grades File before Sort) % cat  grades 44 55 66 22 77 99 100 22 77 99 33 66 55 66 100 99 88 45 (The Script) % cat sorter.sc     # Script is called sorter     # It sorts numbers in ascending order 1   function sort ( scores, num_elements, temp, i, j ) {         # temp, i, and j will be local and private,         # with an initial value of null. 2       for( i = 2; i <= num_elements ; ++i ) { 3          for ( j = i; scores [j 1] > scores[j];  j ){               temp = scores[j]               scores[j] = scores[j 1]               scores[j 1] = temp            } 4       } 5   } 6   {for ( i = 1; i <= NF; i++)        grades[i]=$i 7   sort(grades, NF)    # Two arguments are passed 8   for( j = 1; j <= NF; ++j )        printf( "%d ", grades[j] )        printf("\n")     } (After the Sort) % nawk  f sorter.sc grades 22 44 55 66 77 99 22 33 66 77 99 100 45 55 66 88 99 100

EXPLANATION

The function called sort is defined. The function can be defined anywhere in the script. All variables, except those passed as parameters, are global in scope. If changed in the function, they will be changed throughout the awk script. Arrays are passed by reference. Five formal arguments are enclosed within the parentheses. The array scores will be passed by reference, so that if any of the elements of the array are modified within the function, the original array will be modified. The variable num_elements is a local variable, a copy of the original. The variables temp, i, and j are local variables in the function.
The outer for loop will iterate through an array of numbers, as long as there are at least two numbers to compare.
The inner for loop will compare the current number with the previous number, scores[j 1]). If the previous array element is larger than the current one, temp will be assigned the value of the current array element, and the current array element will be assigned the value of the previous element.
The outer loop block ends.
This is the end of the function definition.
The first action block of the script starts here. The for loop iterates through each field of the current record, creating an array of numbers.
The sort function is called, passing the array of numbers from the current record and the number of fields in the current record.
When the sort function has completed, program control starts here. The for loop prints the elements in the sorted array.

7.13 Review

% cat datafile northwest          NW   Joel Craig          3.0    .98    3    4 western            WE   Sharon Kelly        5.3    .97    5    23 southwest          SW   Chris Foster        2.7    .8     2    18 southern           SO   May Chin            5.1    .95    4    15 southeast          SE   Derek Johnson       4.0    .7     4    17 eastern            EA   Susan Beal          4.4    .84    5    20 northeast          NE   TJ Nichols          5.1    .94    3    13 north              NO   Val Shultz          4.5    .89    5    9 central            CT   Sheri Watson        5.7    .94    5    13

Example 7.75

% nawk '{if ( $8 > 15 ){ print $3 " has a high rating"}\   else print $3 "---NOT A COMPETITOR---"}' datafile Joel---NOT A COMPETITOR--- Sharon has a high rating Chris has a high rating May---NOT A COMPETITOR--- Derek has a high rating Susan has a high rating TJ---NOT A COMPETITOR--- Val---NOT A COMPETITOR--- Sheri---NOT A COMPETITOR---

EXPLANATION

The if statement is an action statement. If there is more than one statement following the expression, it must be enclosed in curly braces. (Curly braces are not required in this example, since there is only one statement following the expression.) The expression reads if the eighth field is greater than 15, print the third field and the string has a high rating; else print the third field and ---NOT A COMPETITOR---.

Example 7.76

% nawk '{i=1; while(i<=NF && NR < 2){print $i; i++}}' datafile northwest NW Joel Craig 3.0 .98 3 4

EXPLANATION

The user-defined variable i is assigned 1. The while loop is entered and the expression tested. If the expression evaluates true, the print statement is executed; the value of the ith field is printed. The value of i is printed, next the value is incremented by 1, and the loop is reentered. The loop expression will become false when the value of i is greater than NF and the value of NR is two or more. The variable i will not be reinitialized until the next record is entered.

% cat datafile northwest          NW   Joel Craig          3.0    .98    3    4 western            WE   Sharon Kelly        5.3    .97    5    23 southwest          SW   Chris Foster        2.7    .8     2    18 southern           SO   May Chin            5.1    .95    4    15 southeast          SE   Derek Johnson       4.0    .7     4    17 eastern            EA   Susan Beal          4.4    .84    5    20 northeast          NE   TJ Nichols          5.1    .94    3    13 north              NO   Val Shultz          4.5    .89    5    9 central            CT   Sheri Watson        5.7    .94    5    13

Example 7.77

% nawk '{ for( i=3 ; i <= NF && NR == 3 ; i++ ){ print $i }}' datafile Chris Foster 2.7 .8 2 18

EXPLANATION

This is similar to the while loop in functionality. The initialization, test, and loop control statements are all in one expression. The value of i (i = 3) is initialized once for the current record. The expression is then tested. If i is less than or equal to NF, and NR is equal to 3, the print block is executed. After the value of the ith field is printed, control is returned to the loop expression. The value of i is incremented and the test is repeated.

Example 7.78

(The Command Line) % cat nawk.sc4 # Awk script illustrating arrays BEGIN{OFS="\t"} { list[NR] = $1 }   # The array is called list. The index in the                     # number of the current record. The value of the                     # first field is assigned to the array element. END{ for( n = 1; n <= NR; n++){          print list[n]} # for loop is used to loop                         # through the array. } (The Command Line) % nawk -f nawk.sc4 datafile northwest western southwest southern southeast eastern northeast north central

EXPLANATION

The array, list, uses NR as an index value. Each time a line of input is processed, the first field is assigned to the list array. In the END block, the for loop iterates through each element of the array.

Example 7.79

(The Command Line) % cat nawk.sc5 # Awk script with special for loop /north/{name[count++]=$3} END{ print "The number living in a northern district: " count      print "Their names are: "      for ( i in name )       # Special nawk for loop is used to            print name[i]     # iterate through the array. } % nawk -f nawk.sc5 datafile The number living in a northern district: 3 Their names are: Joel TJ Val

EXPLANATION

Each time the regular expression north appears on the line, the name array is assigned the value of the third field. The index count is incremented each time a new record is processed, thus producing another element in the array. In the END block, the special for loop is used to iterate through the array.

% cat datafile northwest          NW   Joel Craig          3.0    .98    3    4 western            WE   Sharon Kelly        5.3    .97    5    23 southwest          SW   Chris Foster        2.7    .8     2    18 southern           SO   May Chin            5.1    .95    4    15 southeast          SE   Derek Johnson       4.0    .7     4    17 eastern            EA   Susan Beal          4.4    .84    5    20 northeast          NE   TJ Nichols          5.1    .94    3    13 north              NO   Val Shultz          4.5    .89    5    9 central            CT   Sheri Watson        5.7    .94    5    13

Example 7.80

(The Command Line) % cat nawk.sc6 # Awk and the special for loop {region[$1]++}   # The index is the first field of each record END{for(item in region){         print region[item], item     } } % nawk -f nawk.sc6 datafile 1 central 1 northwest 1 western 1 southeast 1 north 1 southern 1 northeast 1 southwest 1 eastern % nawk -f nawk.sc6 datafile3 4 Mary 2 Tom 1 Alax 1 Bob 1 Sean

EXPLANATION

The region array uses the first field as an index. The value stored is the number of times each region was found. The END block uses the special awk for loop to iterate through the array called region.

UNIX TOOLS LAB EXERCISE

Lab 6: nawk Exercise

Mike Harrington:(510) 548-1278:250:100:175

Christian Dobbins:(408) 538-2358:155:90:201

Susan Dalsass:(206) 654-6279:250:60:50

Archie McNichol:(206) 548-1348:250:100:175

Jody Savage:(206) 548-1278:15:188:150

Guy Quigley:(916) 343-6410:250:100:175

Dan Savage:(406) 298-7744:450:300:275

Nancy McNeil:(206) 548-1278:250:80:75

John Goldenrod:(916) 348-4278:250:100:175

Chet Main:(510) 548-5258:50:95:135

Tom Savage:(408) 926-3456:250:168:200

Elizabeth Stachelin:(916) 440-1763:175:75:300

(Refer to the database called lab6.data on the CD.)

The database above contains the names, phone numbers, and money contributions to the party campaign for the past three months.

1. Write a nawk script that will produce the following report:

                      ***FIRST QUARTERLY REPORT***                     ***CAMPAIGN 1998 CONTRIBUTIONS*** ----------------------------------------------------------------------    NAME                 PHONE     Jan  |  Feb  |  Mar  |  Total Donated ---------------------------------------------------------------------- Mike Harrington     (510) 548-1278  250.00  100.00  175.00   525.00 Christian Dobbins   (408) 538-2358  155.00   90.00  201.00   446.00 Susan Dalsass       (206) 654-6279  250.00   60.00   50.00   360.00 Archie McNichol     (206) 548-1348  250.00  100.00  175.00   525.00 Jody Savage         (206) 548-1278   15.00  188.00  150.00   353.00 Guy Quigley         (916) 343-6410  250.00  100.00  175.00   525.00 Dan Savage          (406) 298-7744  450.00  300.00  275.00  1025.00 Nancy McNeil        (206) 548-1278  250.00   80.00   75.00   405.00 John Goldenrod      (916) 348-4278  250.00  100.00  175.00   525.00 Chet Main           (510) 548-5258   50.00   95.00  135.00   280.00 Tom Savage          (408) 926-3456  250.00  168.00  200.00   618.00 Elizabeth Stachelin (916) 440-1763  175.00   75.00  300.00   550.00 ----------------------------------------------------------------------                                     SUMMARY ---------------------------------------------------------------------- The campaign received a total of $6137.00 for this quarter. The average donation for the 12 contributors was $511.42. The highest total contribution was $1025.00 made by Dan Savage.                       ***THANKS Dan*** The following people donated over $500 to the campaign. They are eligible for the quarterly drawing!! Listed are their names (sorted by last names) and phone numbers:    John Goldenrod--(916) 348-4278    Mike Harrington--(510) 548-1278    Archie McNichol--(206) 548-1348    Guy Quigley--(916) 343-6410    Dan Savage--(406) 298-7744    Tom Savage--(408) 926-3456    Elizabeth Stachelin--(916) 440-1763       Thanks to all of you for your continued support!!

7.14 Odds and Ends

Some data (e.g., that read in from tape or from a spreadsheet) may not have obvious field separators but may instead have fixed-width columns. To preprocess this type of data, the substr function is useful.

7.14.1 Fixed Fields

In the following example, the fields are of a fixed width, but are not separated by a field separator. The substr function is used to create fields.

Example 7.81

% cat fixed 031291ax5633(408)987 0124 021589bg2435(415)866 1345 122490de1237(916)933 1234 010187ax3458(408)264 2546 092491bd9923(415)134 8900 112990bg4567(803)234 1456 070489qr3455(415)899 1426 % nawk '{printf substr($0,1,6)" ";printf substr($0,7,6)" ";\   print substr($0,13,length)}' fixed 031291  ax5633  (408)987 0124 021589  bg2435  (415)866 1345 122490  de1237  (916)933 1234 010187  ax3458  (408)264 2546 092491  bd9923  (415)134 8900 112990  bg4567  (803)234 1456 070489  qr3455  (415)899 1426

EXPLANATION

The first field is obtained by getting the substring of the entire record, starting at the first character, offset by 6 places. Next, a space is printed. The second field is obtained by getting the substring of the record, starting at position 7, offset by 6 places, followed by a space. The last field is obtained by getting the substring of the entire record, starting at position 13 to the position represented by the length of the line. (The length function returns the length of the current line, $0, if it does not have an argument.)

Empty Fields. If the data is stored in fixed-width fields, it is possible that some of the fields are empty. In the following example, the substr function is used to preserve the fields, regardless of whether they contain data.

Example 7.82

1   % cat db     xxx xxx     xxx abc xxx     xxx a   bbb     xxx     xx     % cat awkfix     # Preserving empty fields. Field width is fixed.     { 2   f[1]=substr($0,1,3) 3   f[2]=substr($0,5,3) 4   f[3]=substr($0,9,3) 5   line=sprintf("%-4s%-4s%-4s\n", f[1],f[2], f[3]) 6   print line     }     % nawk  f awkfix db     xxx xxx     xxx abc xxx     xxx a   bbb     xxx     xx

EXPLANATION

The contents of the file db are printed. There are empty fields in the file.
The first element of the f array is assigned the substring of the record, starting at position 1 and offset by 3.
The second element of the f array is assigned the substring of the record, starting at position 5 and offset by 3.
The third element of the f array is assigned the substring of the record, starting at position 9 and offset by 3.
The elements of the array are assigned to the user-defined variable line after being formatted by the sprintf function.
The value of line is printed and the empty fields are preserved.

Numbers with $, Commas, or Other Characters. In the following example, the price field contains a dollar sign and comma. The script must eliminate these characters to add up the prices to get the total cost. This is done using the gsub function.

Example 7.83

% cat vendor access tech:gp237221:220:vax789:20/20:11/01/90:$1,043.00 alisa systems:bp262292:280:macintosh:new updates:06/30/91:$456.00 alisa systems:gp262345:260:vax8700:alisa talk:02/03/91:$1,598.50 apple computer:zx342567:240:macs:e mail:06/25/90:$575.75 caci:gp262313:280:sparc station:network11.5:05/12/91:$1,250.75 datalogics:bp132455:260:microvax2:pagestation maint:07/01/90:$1,200.00 dec:zx354612:220:microvax2:vms sms:07/20/90:$1,350.00 % nawk  F: '{gsub(/\$/,"");gsub(/,/,""); cost +=$7};\ END{print "The total is $" cost}' vendor $7474

EXPLANATION

The first gsub function globally substitutes the literal dollar sign (\$) with the null string, and the second gsub function substitutes commas with a null string. The user-defined cost variable is then totalled by adding the seventh field to cost and assigning the result back to cost. In the END block, the string The total cost is $ is printed, followed by the value of cost.^[1]

7.14.2 Bundling and Unbundling Files

The Bundle Program. In The AWK Programming Language, the program to bundle files together is very short and to the point. We are trying to combine several files into one file to save disk space, to send files through electronic mail, and so forth. The following awk command will print every line of each file, preceded with the filename.

Example 7.84

% nawk '{ print FILENAME, $0 }' file1 file2 file3 > bundled

EXPLANATION

The name of the current input file, FILENAME, is printed, followed by the record ($0) for each line of input in file1. After file1 has reached the end of file, awk will open the next file, file2, and do the same thing, and so on. The output is redirected to a file called bundled.

Unbundle. The following example displays how to unbundle files, or put them back into separate files.

Example 7.85

%  nawk '$1 != previous { close(previous); previous=$1};\    {print substr($0, index($0, " ") + 1) > $1}' bundled

EXPLANATION

The first field is the name of the file. If the name of the file is not equal to the value of the user-defined variable previous (initially null), the action block is executed. The file assigned to previous is closed, and previous is assigned the value of the first field. Then the substr of the record, the starting position returned from the index function (the position of the first space + 1), is redirected to the filename contained in the first field.

To bundle the files so that the filename appears on a line by itself, above the contents of the file use, the following command:

% nawk '{if(FNR==1){print FILENAME;print $0}\   else print $0}'  file1 file2 file3 > bundled

The following command will unbundle the files:

% nawk 'NF==1{filename=$NF} ;\   NF != 1{print $0 > filename}' bundled

7.14.3 Multiline Records

In the sample data files used so far, each record is on a line by itself. In the following sample datafile, called checkbook, the records are separated by blank lines and the fields are separated by newlines. To process this file, the record separator (RS) is assigned a value of null, and the field separator (FS) is assigned the newline.

Example 7.86

(The Input File) % cat checkbook 1/1/01 #125  695.00 Mortgage 1/1/01 #126  56.89 PG&E 1/2/01 #127  89.99 Safeway 1/3/01 +750.00 Pay Check 1/4/01 #128  60.00 Visa (The Script)    % cat awkchecker 1  BEGIN{RS=""; FS="\n";ORS="\n\n"} 2  {print  NR, $1,$2,$3,$4} (The Output) % nawk  f awkchecker checkbook 1 1/1/01  #125   695.00  Mortgage 2 1/1/01  #126   56.89  PG&E 3 1/2/01  #127   89.99  Safeway 4 1/3/01  +750.00  Pay Check 5 1/4/01  #128   60.00  Visa

EXPLANATION

In the BEGIN block, the record separator (RS) is assigned null, the field separator (FS) is assigned a newline, and the output record separator (ORS) is assigned two newlines. Now each line is a field and each output record is separated by two newlines.
The number of the record is printed, followed by each of the fields.

7.14.4 Generating Form Letters

The following example is modified from a program in The AWK Programming Language.^[2] The tricky part of this is keeping track of what is actually being processed. The input file is called data.form. It contains just the data. Each field in the input file is separated by colons. The other file is called form.letter. It is the actual form that will be used to create the letter. This file is loaded into awk's memory with the getline function. Each line of the form letter is stored in an array. The program gets its data from data.form, and the letter is created by substituting real data for the special strings preceded by # and @ found in form.letter. A temporary variable, temp, holds the actual line that will be displayed after the data has been substituted. This program allows you to create personalized form letters for each person listed in data.form.

Example 7.87

(The Awk Script) % cat form.awk # form.awk is an awk script that requires access to 2 files: The # first file is called "form.letter." This file contains the # format for a form letter. The awk script uses another file, # "data.form," as its input file. This file contains the # information that will be substituted into the form letters in # the place of the numbers preceded by pound signs. Today's date # is substituted in the place of "@date" in "form.letter." 1   BEGIN{ FS=":"; n=1 2   while(getline < "form.letter" >  0) 3       form[n++] = $0   # Store lines from form.letter in an array 4   "date" | getline d; split(d, today, " ")         # Output of date is Fri Mar 2 14:35:50   PST 2001 5   thisday=today[2]". "today[3]", "today[6] 6   } 7   { for( i = 1; i < n; i++ ){ 8       temp=form[i] 9       for ( j = 1; j <=NF; j++ ){            gsub("@date", thisday, temp) 10         gsub("#" j, $j , temp )         } 11  print temp     }     } % cat form.letter    The form letter, form.letter, looks like this: *********************************************************    Subject: Status Report for Project "#1"    To: #2    From: #3    Date: @date    This letter is to tell you, #2, that project "#1" is up to    date.    We expect that everything will be completed and ready for    shipment as scheduled on #4.    Sincerely,    #3 ********************************************************** The file, data.form, is awk's input file containing the data that will replace the #1 4 and the @date in form.letter. % cat data.form    Dynamo:John Stevens:Dana Smith, Mgr:4/12/2001    Gallactius:Guy Sterling:Dana Smith, Mgr:5/18/2001 (The Command Line)    % nawk   f form.awk  data.form    *********************************************************    Subject: Status Report for Project "Dynamo"    To: John Stevens    From: Dana Smith, Mgr    Date: Mar. 2, 2001    This letter is to tell you, John Stevens, that project    "Dynamo" is up to date.    We expect that everything will be completed and ready for    shipment as scheduled on 4/12/2001.    Sincerely,    Dana Smith, Mgr    Subject: Status Report for Project "Gallactius"    To: Guy Sterling    From: Dana Smith, Mgr    Date: Mar. 2, 2001    This letter is to you, Guy Sterling, that project "Gallactius"    is up to date.    We expect that everything will be completed and ready for    shipment as scheduled on 5/18/2001.    Sincerely,    Dana Smith, Mgr

EXPLANATION

In the BEGIN block, the field separator (FS) is assigned a colon; a user-defined variable n is assigned 1.
In the while loop, the getline function reads a line at a time from the file called form.letter. If getline fails to find the file, it returns a 1. When it reaches the end of file, it returns zero. Therefore, by testing for a return value of greater than one, we know that the function has read in a line from the input file.
Each line from form.letter is assigned to an array called form.
The output from the UNIX date command is piped to the getline function and assigned to the user-defined variable d. The split function then splits up the variable d with whitespace, creating an array called today.
The user-defined variable thisday is assigned the month, day, and year.
The BEGIN block ends.
The for loop will loop n times.
The user-defined variable temp is assigned a line from the form array.
The nested for loop is looping through a line from the input file, data.form, NF number of times. Each line stored in the temp variable is checked for the string @date. If @date is matched, the gsub function replaces it with today's date (the value stored in this day).
If a # and a number are found in the line stored in temp, the gsub function will replace the # and number with the value of the corresponding field in the input file, data.form. For example, if the first line stored is being tested, #1 would be replaced with Dynamo, #2 with John Stevens, #3 with Dana Smith, #4 with 4/12/2001, and so forth.
The line stored in temp is printed after the substitutions.

7.14.5 Interaction with the Shell

Now that you have seen how awk works, you will find that awk is a very powerful utility when writing shell scripts. You can embed one-line awk commands or awk scripts within your shell scripts. The following is a sample of a Korn shell program embedded with awk commands.

Example 7.88

!#/bin/ksh # This korn shell script will collect data for awk to use in # generating form letter(s). See above. print "Hello $LOGNAME. " print "This report is for the month and year:" 1   cal | nawk 'NR==1{print $0}'     if [[  f data.form  ||  f formletter? ]]     then        rm data.form formletter?  2> /dev/null     fi     integer num=1     while true     do        print "Form letter #$num:"        read project?"What is the name of the project? "        read sender?"Who is the status report from? "        read recipient?"Who is the status report to? "        read due_date?"What is the completion date scheduled? "        echo $project:$recipient:$sender:$due_date > data.form        print  n "Do you wish to generate another form letter? "        read answer        if [[ "$answer" != [Yy]* ]]        then               break        else 2             nawk  f form.awk  data.form  > formletter$num        fi        (( num+=1 ))     done     nawk  f form.awk data.form > formletter$num

EXPLANATION

The UNIX cal command is piped to awk. The first line that contains the current month and year is printed.
The nawk script form.awk generates form letters, which are redirected to a UNIX file.

7.15 Review

7.15.1 String Functions

% cat datafile northwest          NW   Joel Craig          3.0    .98    3    4 western            WE   Sharon Kelly        5.3    .97    5    23 southwest          SW   Chris Foster        2.7    .8     2    18 southern           SO   May Chin            5.1    .95    4    15 southeast          SE   Derek Johnson       4.0    .7     4    17 eastern            EA   Susan Beal          4.4    .84    5    20 northeast          NE   TJ Nichols          5.1    .94    3    13 north              NO   Val Shultz          4.5    .89    5    9 central            CT   Sheri Watson        5.7    .94    5    13

Example 7.89

% nawk 'NR==1{gsub(/northwest/,"southeast", $1) ;print}' datafile southeast     NW     Joel Craig         3.0  .98  3   4

EXPLANATION

If this is the first record (NR == 1), globally substitute the regular expression northwest with southeast, if northwest is found in the first field.

Example 7.90

% nawk 'NR==1{print substr($3, 1, 3)}' datafile Joe

EXPLANATION

If this is the first record, display the substring of the third field, starting at the first character, and extracting a length of 3 characters. The substring Joe is printed.

Example 7.91

% nawk 'NR==1{print length($1)}' datafile 9

EXPLANATION

If this is the first record, the length (number of characters) in the first field is printed.

Example 7.92

% nawk 'NR==1{print index($1,"west")}' datafile 6

EXPLANATION

If this is the first record, print the first position where the substring west is found in the first field. The string west starts at the sixth position (index) in the string northwest.

Example 7.93

% nawk '{if(match($1,/^no/)){print substr($1,RSTART,RLENGTH)}}'\   datafile no no no

EXPLANATION

If the match function finds the regular expression /^no/ in the first field, the index position of the leftmost character is returned. The built-in variable RSTART is set to the index position and the RLENGTH variable is set to the length of the matched substring. The substr function returns the string in the first field starting at position RSTART, RLENGTH number of characters.

Example 7.94

% nawk 'BEGIN{split("10/14/01",now,"/");print now[1],now[2],now[3]}' 10 14 01

EXPLANATION

The string 10/14/01 is split into an array called now. The delimiter is the forward slash. The elements of the array are printed, starting at the first element of the array.

% cat datafile2 Joel Craig:northwest:NW:3.0:.98:3:4 Sharon Kelly:western:WE:5.3:.97:5:23 Chris Foster:southwest:SW:2.7:.8:2:18 May Chin:southern:SO:5.1:.95:4:15 Derek Johnson:southeast:SE:4.0:.7:4:17 Susan Beal:eastern:EA:4.4:.84:5:20 TJ Nichols:northeast:NE:5.1:.94:3:13 Val Shultz:north:NO:4.5:.89:5:9 Sheri Watson:central:CT:5.7:.94:5:13

Example 7.95

% nawk -F: '/north/{split($1, name, " ");\   print "First name: "name[1];\   print "Last name: " name[2];\   print "\n--------------------"}' datafile2 First name: Joel Last name: Craig -------------------- First name: TJ Last name: Nichols -------------------- First name: Val last name: Shultz --------------------

EXPLANATION

The input field separator is set to a colon ( F:). If the record contains the regular expression north, the first field is split into an array called name, where a space is the delimiter. The elements of the array are printed.

% cat datafile northwest          NW   Joel Craig          3.0    .98    3    4 western            WE   Sharon Kelly        5.3    .97    5    23 southwest          SW   Chris Foster        2.7    .8     2    18 southern           SO   May Chin            5.1    .95    4    15 southeast          SE   Derek Johnson       4.0    .7     4    17 eastern            EA   Susan Beal          4.4    .84    5    20 northeast          NE   TJ Nichols          5.1    .94    3    13 north              NO   Val Shultz          4.5    .89    5    9 central            CT   Sheri Watson        5.7    .94    5    13

Example 7.96

% nawk '{line=sprintf("%10.2f%5s\n",$7,$2); print line}' datafile 3.00   NW 5.00   WE 2.00   SW 4.00   SO 4.00   SE 5.00   EA 3.00   NE 5.00   NO 5.00   CT

EXPLANATION

The sprintf function formats the seventh and the second fields ($7, $2) using the formatting conventions of the printf function. The formatted string is returned and assigned to the user-defined variable line and printed.

7.15.2 Command Line Arguments

Example 7.97

% cat argvs.sc # Testing command line arguments with ARGV and ARGC using a for loop. BEGIN{     for(i=0;i < ARGC;i++)        printf("argv[%d] is %s\n", i, ARGV[i])        printf("The number of arguments, ARGC=%d\n", ARGC) } % nawk -f argvs.sc datafile argv[0] is nawk argv[1] is datafile The number of arguments, ARGC=2

EXPLANATION

The BEGIN block contains a for loop to process the command line arguments. ARGC is the number of arguments and ARGV is an array that contains the actual arguments. Nawk does not count options as arguments. The only valid arguments in this example are the nawk command and the input file, datafile.

Example 7.98

% nawk 'BEGIN{name=ARGV[1]};\   $0 ~ name {print $3 , $4}'  "Derek" datafile nawk: can't open Derek source line number 1 % nawk 'BEGIN{name=ARGV[1]; delete ARGV[1]};\   $0 ~ name {print $3, $4}'  "Derek" datafile Derek Johnson

EXPLANATION

The name "Derek" was set to the variable name in the BEGIN block. In the pattern-action block, nawk attempted to open "Derek" as an input file and failed.
After assigning "Derek" to the variable name, ARGV[1] is deleted. When starting the pattern-action block, nawk does not try to open "Derek" as the input file, but opens datafile instead.

% cat datafile northwest          NW   Joel Craig          3.0    .98    3    4 western            WE   Sharon Kelly        5.3    .97    5    23 southwest          SW   Chris Foster        2.7    .8     2    18 southern           SO   May Chin            5.1    .95    4    15 southeast          SE   Derek Johnson       4.0    .7     4    17 eastern            EA   Susan Beal          4.4    .84    5    20 northeast          NE   TJ Nichols          5.1    .94    3    13 north              NO   Val Shultz          4.5    .89    5    9 central            CT   Sheri Watson        5.7    .94    5    13

7.15.3 Reading Input (getline)

Example 7.99

% nawk 'BEGIN{ "date" | getline d; print d}' datafile Mon Jan 15 11:24:24 PST 2001

EXPLANATION

The UNIX date command is piped to the getline function. The results are stored in the variable d and printed.

Example 7.100

% nawk 'BEGIN{ "date " | getline d; split( d, mon) ;print mon[2]}'\   datafile Jan

EXPLANATION

The UNIX date command is piped to the getline function and the results are stored in d. The split function splits the string d into an array called mon. The second element of the array is printed.

Example 7.101

% nawk 'BEGIN{ printf "Who are you looking for?" ; \   getline name < "/dev/tty"};\

EXPLANATION

Input is read from the terminal, /dev/tty, and stored in the array called name.

Example 7.102

% nawk 'BEGIN{while(getline < "/etc/passwd"  > 0 ){lc++}; print lc}'\   datafile 16

EXPLANATION

The while loop is used to loop through the /etc/passwd file one line at a time. Each time the loop is entered, a line is read by getline and the value of the variable lc is incremented. When the loop exits, the value of lc is printed, i.e., the number of lines in the /etc/passwd file. As long as the return value from getline is not 0, i.e., a line has been read, the looping continues.

7.15.4 Control Functions

Example 7.103

% nawk '{if ( $5 > 4.5) next; print $1}' datafile northwest southwest southeast eastern north

EXPLANATION

If the fifth field is greater than 4.5, the next line is read from the input file (datafile) and processing starts at the beginning of the awk script (after the BEGIN block). Otherwise, the first field is printed.

Example 7.104

% nawk '{if ($2 ~ /S/){print ; exit 0}}' datafile southwest       SW      Chris   Foster  2.7     .8      2       18 % echo $status ( csh ) or echo $? (sh or ksh) 0

EXPLANATION

If the second field contains an S, the record is printed and the awk program exits. The C shell status variable contains the exit value. If using the Bourne or Korn shells, the $? variable contains the exit status.

% cat datafile northwest          NW   Joel Craig          3.0    .98    3    4 western            WE   Sharon Kelly        5.3    .97    5    23 southwest          SW   Chris Foster        2.7    .8     2    18 southern           SO   May Chin            5.1    .95    4    15 southeast          SE   Derek Johnson       4.0    .7     4    17 eastern            EA   Susan Beal          4.4    .84    5    20 northeast          NE   TJ Nichols          5.1    .94    3    13 north              NO   Val Shultz          4.5    .89    5    9 central            CT   Sheri Watson        5.7    .94    5    13

7.15.5 User-Defined Functions

Example 7.105

(The Command Line) % cat nawk.sc7 1  BEGIN{largest=0} 2  {maximum=max($5)} 3  function max ( num ) { 4     if ( num > largest){ largest=num }       return largest 5  } 6  END{ print "The maximum is " maximum "."} % nawk -f nawk.sc7 datafile The maximum is 5.7.

EXPLANATION

In the BEGIN block, the user-defined variable largest is initialized to zero.
For each line in the file, the variable maximum is assigned the value returned from the function max. The function max is given $5 as its argument.
The user-defined function max is defined. The function statements are enclosed in curly braces. Each time a new record is read from the input file, datafile, the function max will be called.
It will compare the values in num and largest and return the larger of the two numbers.
The function definition block ends.
The END block prints the final value in maximum.

UNIX TOOLS LAB EXERCISE

Lab 7: nawk Exercise

Mike Harrington:(510) 548-1278:250:100:175

Christian Dobbins:(408) 538-2358:155:90:201

Susan Dalsass:(206) 654-6279:250:60:50

Archie McNichol:(206) 548-1348:250:100:175

Jody Savage:(206) 548-1278:15:188:150

Guy Quigley:(916) 343-6410:250:100:175

Dan Savage:(406) 298-7744:450:300:275

Nancy McNeil:(206) 548-1278:250:80:75

John Goldenrod:(916) 348-4278:250:100:175

Chet Main:(510) 548-5258:50:95:135

Tom Savage:(408) 926-3456:250:168:200

Elizabeth Stachelin:(916) 440-1763:175:75:300

(Refer to the database called lab7.data on the CD.)

The database above contains the names, phone numbers, and money contributions to the party campaign for the past three months.

1:	Write a user-defined function to return the average of all the contributions for a given month. The month will be passed in at the command line.

[1] For details on how commas are added back into the program, see Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger, The AWK Programming Language (Boston: Addison-Wesley, 1988), p. 72.
[2] Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger, The AWK Programming Language (Boston: Addison-Wesley, 1988). 1988 Bell Telephone Laboratories, Inc. Reprinted by permission of Pearson Education, Inc.

CONTENTS

Chapter 7. The awk Utility: awk Programming