Examples

< Day Day Up >

cars data file

Many of the examples in this section work with the cars data file. From left to right the columns in the file contain each car's make, model, year of manufacture, mileage in thousands of miles, and price. All whitespace in this file is composed of single TAB s (the file does not contain any SPACEs).

 $ cat cars plym     fury    1970    73      2500 chevy    malibu  1999    60      3000 ford     mustang 1965    45      10000 volvo    s80     1998    102     9850 ford     thundbd 2003    15      10500 chevy    malibu  2000    50      3500 bmw      325i    1985    115     450 honda    accord  2001    30      6000 ford     taurus  2004    10      17000 toyota   rav4    2002    180     750 chevy    impala  1985    85      1550 ford     explor  2003    25      9500

Missing pattern

A simple gawk program is

 { print }

This program consists of one program line that is an action. Because the pattern is missing, gawk selects all lines of input. When used without any arguments the print command displays each selected line in its entirety. This program copies the input to standard output.

 $ gawk '{ print }' cars plym     fury    1970    73      2500 chevy    malibu  1999    60      3000 ford     mustang 1965    45      10000 volvo    s80     1998    102     9850 ...

Missing action

The next program has a pattern but no explicit action. The slashes indicate that chevy is a regular expression.

 /chevy/

In this case gawk selects from the input all lines that contain the string chevy. When you do not specify an action, gawk assumes that the action is print. The following example copies to standard output all lines from the input that contain the string chevy:

 $ gawk '/chevy/' cars chevy    malibu  1999    60      3000 chevy    malibu  2000    50      3500 chevy    impala  1985    85      1550

Single quotation marks

Although neither gawk nor shell syntax requires single quotation marks on the command line, it is still a good idea to use them because they can prevent problems. If the gawk program you create on the command line includes SPACE s or special shell characters, you must quote them. Always enclosing the program in single quotation marks is the easiest way of making sure that you have quoted any characters that need to be quoted.

Fields

The next example selects all lines from the file (it has no pattern). The braces enclose the action; you must always use braces to delimit the action so that gawk can distinguish it from the pattern. This example displays the third field ($3), a SPACE (the output field separator, indicated by the comma), and the first field ($1) of each selected line:

 $ gawk '{print $3, $1}' cars 1970 plym 1999 chevy 1965 ford 1998 volvo ...

The next example, which includes both a pattern and an action, selects all lines that contain the string chevy and displays the third and first fields from the lines it selects:

 $ gawk '/chevy/ {print $3, $1}' cars 1999 chevy 2000 chevy 1985 chevy

Next gawk selects lines that contain a match for the regular expression h. Because there is no explicit action, gawk displays all the lines it selects:

 $ gawk '/h/' cars chevy   malibu  1999    60      3000 ford    thundbd 2003    15      10500 chevy   malibu  2000    50      3500 honda   accord  2001    30      6000 chevy   impala  1985    85      1550

~ (matches operator)

The next pattern uses the matches operator (~) to select all lines that contain the letter h in the first field:

 $ gawk '$1 ~ /h/' cars chevy   malibu  1999    60      3000 chevy   malibu  2000    50      3500 honda   accord  2001    30      6000 chevy   impala  1985    85      1550

The caret (^) in a regular expression forces a match at the beginning of the line (page 830) or, in this case, the beginning of the first field:

 $ gawk '$1 ~ /^h/' cars honda    accord  2001    30      6000

Brackets surround a character-class definition (page 829). In the next example, gawk selects lines that have a second field that begins with t or m and displays the third and second fields, a dollar sign, and the fifth field. Because there is no comma between the "$" and the $5, gawk does not put a SPACE between them in the output.

 $ gawk '$2 ~ /^[tm]/ {print $3, $2, "$"  $5}' cars 1999 malibu $3000 1965 mustang $10000 2003 thundbd $10500 2000 malibu $3500 2004 taurus $17000

Dollar signs

The next example shows three roles a dollar sign can play in a gawk program. A dollar sign followed by a number names a field. Within a regular expression a dollar sign forces a match at the end of a line or field (5$). Within a string a dollar sign represents itself.

 $ gawk '$3 ~ /5$/ {print $3, $1, "$"  $5}' cars 1965 ford $10000 1985 bmw $450 1985 chevy $1550

In the next example, the equal-to relational operator (= =) causes gawk to perform a numeric comparison between the third field in each line and the number 1985. The gawk command takes the default action, print, on each line where the comparison is true.

 $ gawk '$3 == 1985' cars bmw     325i    1985    115     450 chevy   impala  1985    85      1550

The next example finds all cars priced at or less than $3,000:

 $ gawk '$5 <= 3000' cars plym    fury    1970    73      2500 chevy   malibu  1999    60      3000 bmw     325i    1985    115     450 toyota  rav4    2002    180     750 chevy   impala  1985    85      1550

Textual comparisons

When you use double quotation marks, gawk performs textual comparisons by using the ASCII (or other local) collating sequence as the basis of the comparison. In the following example, gawk shows that the strings 450 and 750 fall in the range that lies between the strings 2000 and 9000, which is probably not the intended result:

 $ gawk '"2000" <= $5 && $5 < "9000"' cars plym     fury    1970    73      2500 chevy    malibu  1999    60      3000 chevy    malibu  2000    50      3500 bmw      325i    1985    115     450 honda    accord  2001    30      6000 toyota   rav4    2002    180     750

When you need to perform a numeric comparison, do not use quotation marks. The next example gives the intended result. It is the same as the previous example except that it omits the double quotation marks.

 $ gawk '2000 <= $5 && $5 < 9000' cars plym    fury    1970    73      2500 chevy   malibu  1999    60      3000 chevy   malibu  2000    50      3500 honda   accord  2001    30      6000

, (range operator)

The range operator ( , ) selects a group of lines. The first line it selects is the one specified by the pattern before the comma. The last line is the one selected by the pattern after the comma. If no line matches the pattern after the comma, gawk selects every line through the end of the input. The next example selects all lines, starting with the line that contains volvo and concluding with the line that contains bmw:

 $ gawk '/volvo/ , /bmw/' cars volvo   s80     1998    102     9850 ford    thundbd 2003    15      10500 chevy   malibu  2000    50      3500 bmw     325i    1985    115     450

After the range operator finds its first group of lines, it begins the process again, looking for a line that matches the pattern before the comma. In the following example, gawk finds three groups of lines that fall between chevy and ford.

Although the fifth line of input contains ford, gawk does not select it because at the time it is processing the fifth line, it is searching for chevy.

 $ gawk '/chevy/ , /ford/' cars chevy   malibu  1999    60      3000 ford    mustang 1965    45      10000 chevy   malibu  2000    50      3500 bmw     325i    1985    115     450 honda   accord  2001    30      6000 ford    taurus  2004    10      17000 chevy   impala  1985    85      1550 ford    explor  2003    25      9500

file option

When you are writing a longer gawk program, it is convenient to put the program in a file and reference the file on the command line. Use the f or file option followed by the name of the file containing the gawk program.

BEGIN

The following gawk program, stored in a file named pr_header, has two actions and uses the BEGIN pattern. The gawk utility performs the action associated with BEGIN before processing any lines of the data file: It displays a header. The second action, {print}, has no pattern part and displays all the lines from the input.

 $ cat pr_header BEGIN   {print "Make    Model   Year    Miles   Price"}         {print} $ gawk -f pr_header cars Make    Model   Year    Miles   Price plym    fury    1970    73      2500 chevy   malibu  1999    60      3000 ford    mustang 1965    45      10000 volvo   s80     1998    102     9850 ...

The next example expands the action associated with the BEGIN pattern. In the previous and following examples, the whitespace in the headers is composed of single TABs, so that the titles line up with the columns of data.

 $ cat pr_header2 BEGIN   { print "Make     Model   Year    Miles   Price" print "----------------------------------------" }         {print} $ gawk -f pr_header2 cars Make    Model   Year    Miles   Price ---------------------------------------- plym    fury    1970    73      2500 chevy   malibu  1999    60      3000 ford    mustang 1965    45      10000 volvo   s80     1998    102     9850 ...

length function

When you call the length function without an argument, it returns the number of characters in the current line, including field separators. The $0 variable always contains the value of the current line. In the next example, gawk prepends the line length to each line and then a pipe sends the output from gawk to sort (the n option specifies a numeric sort; page 762) so that the lines of the cars file appear in order of length:

 $ gawk '{print length, $0}' cars | sort -n 21 bmw  325i     1985    115     450 22 plym fury     1970    73      2500 23 volvo         s80     1998    102     9850 24 ford explor   2003    25      9500 24 toyota        rav4    2002    180     750 25 chevy         impala  1985    85      1550 25 chevy         malibu  1999    60      3000 25 chevy         malibu  2000    50      3500 25 ford taurus   2004    10      17000 25 honda         accord  2001    30      6000 26 ford mustang  1965    45      10000 26 ford thundbd  2003    15      10500

The formatting of this report depends on TAB s for horizontal alignment. The three extra characters at the beginning of each line throw off the format of several lines. A remedy for this situation is covered shortly.

NR (record number)

The NR variable contains the record (line) number of the current line. The following pattern selects all lines that contain more than 24 characters. The action displays the line number of each of the selected lines.

 $ gawk 'length > 24 {print NR}' cars 2 3 5 6 8 9 11

You can combine the range operator ( , ) and the NR variable to display a group of lines of a file based on their line numbers. The next example displays lines 2 through 4:

 $ gawk 'NR == 2 , NR == 4' cars chevy   malibu  1999    60      3000 ford    mustang 1965    45      10000 volvo   s80     1998    102     9850

END

The END pattern works in a manner similar to the BEGIN pattern, except that gawk takes the actions associated with it after processing the last line of input. The following report displays information only after it has processed all the input. The NR variable retains its value after gawk finishes processing the data file, so that an action associated with an END pattern can use it:

 $ gawk 'END {print NR, "cars for sale." }' cars 12 cars for sale.

The next example uses if control structures to expand the abbreviations used in some of the first fields. As long as gawk does not change a record, it leaves the entire record including separators intact. Once it makes a change to a record, gawk changes all separators in that record to the value of the output field separator. The default output field separator is a SPACE.

 $ cat separ_demo         {         if ($1 ~ /ply/)  $1 = "plymouth"         if ($1 ~ /chev/) $1 = "chevrolet"         print         } $ gawk -f separ_demo cars plymouth fury 1970 73 2500 chevrolet malibu 1999 60 3000 ford    mustang 1965    45      10000 volvo   s80     1998    102     9850 ford    thundbd 2003    15      10500 chevrolet malibu 2000 50 3500 bmw     325i    1985    115     450 honda   accord  2001    30      6000 ford    taurus  2004    10      17000 toyota  rav4    2002    180     750 chevrolet impala 1985 85 1550 ford    explor  2003    25      9500

Stand-alone script

Instead of calling gawk from the command line with the f option and the name of the program you want to run, you can write a script that calls gawk with the commands you want to run. The next example is a stand-alone script that runs the same program as the previous example. The #!/bin/gawk f command (page 265) runs the gawk utility directly. You need both read and execute permission to the file holding the script (page 263).

 $ chmod u+rx separ_demo2 $ cat separ_demo2 #!/bin/gawk -f         {         if ($1 ~ /ply/)  $1 = "plymouth"         if ($1 ~ /chev/) $1 = "chevrolet"         print         } $ separ_demo2 cars plymouth fury 1970 73 2500 chevrolet malibu 1999 60 3000 ford    mustang 1965    45      10000 ...

OFS variable

You can change the value of the output field separator by assigning a value to the OFS variable. The following example assigns a TAB character to OFS, using the backslash escape sequence \t. This fix improves the appearance of the report but does not line up the columns properly.

 $ cat ofs_demo BEGIN   {OFS = "\t"}         {         if ($1 ~ /ply/)  $1 = "plymouth"         if ($1 ~ /chev/) $1 = "chevrolet"         print         } $ gawk -f ofs_demo cars plymouth        fury    1970    73      2500 chevrolet       malibu  1999    60      3000 ford    mustang 1965    45      10000 volvo   s80     1998    102     9850 ford    thundbd 2003    15      10500 chevrolet       malibu  2000    50      3500 bmw     325i    1985    115     450 honda   accord  2001    30      6000 ford    taurus  2004    10      17000 toyota  rav4    2002    180     750 chevrolet       impala  1985    85      1550 ford    explor  2003    25      9500

printf

You can use printf (page 534) to refine the output format. The following example uses a backslash at the end of several program lines to quote the following NEWLINE. You can use this technique to continue a long line over one or more lines without affecting the outcome of the program.

 $ cat printf_demo BEGIN   {     print "                                 Miles"     print "Make       Model       Year      (000)       Price"     print \     "--------------------------------------------------"     }     {     if ($1 ~ /ply/)  $1 = "plymouth"     if ($1 ~ /chev/) $1 = "chevrolet"     printf "%-10s %-8s    %2d   %5d     $ %8.2f\n",\         $1, $2, $3, $4, $5     } $ gawk -f printf_demo cars                                  Miles Make       Model       Year      (000)       Price -------------------------------------------------- plymouth   fury        1970      73     $  2500.00 chevrolet  malibu      1999      60     $  3000.00 ford       mustang     1965      45     $ 10000.00 volvo      s80         1998     102     $  9850.00 ford       thundbd     2003      15     $ 10500.00 chevrolet  malibu      2000      50     $  3500.00 bmw        325i        1985     115     $   450.00 honda      accord      2001      30     $  6000.00 ford       taurus      2004      10     $ 17000.00 toyota     rav4        2002     180     $   750.00 chevrolet  impala      1985      85     $  1550.00 ford       explor      2003      25     $  9500.00

Redirecting output

The next example creates two files: one with the lines that contain chevy and one with the lines that contain ford:

 $ cat redirect_out /chevy/    {print > "chevfile"} /ford/     {print > "fordfile"} END        {print "done."} $ gawk -f redirect_out cars done. $ cat chevfile chevy   malibu  1999    60      3000 chevy   malibu  2000    50      3500 chevy   impala  1985    85      1550

The summary program produces a summary report on all cars and newer cars. Although they are not required, the initializations at the beginning of the program represent good programming practice; gawk automatically declares and initializes variables as you use them. After reading all the input data, gawk computes and displays averages.

 $ cat summary BEGIN   {         yearsum = 0 ; costsum = 0         newcostsum = 0 ; newcount = 0         }         {         yearsum += $3         costsum += $5         } $3 > 2000 {newcostsum += $5 ; newcount ++} END     {         printf "Average age of cars is %4.1f years\n",\             2006 - (yearsum/NR)         printf "Average cost of cars is $%7.2f\n",\             costsum/NR             printf "Average cost of newer cars is $%7.2f\n",\                 newcostsum/newcount         } $ gawk -f summary cars Average age of cars is 13.1 years Average cost of cars is $6216.67 Average cost of newer cars is $8750.00

The following gawk command shows the format of a line from the passwd file that the next example uses:

 $ awk '/mark/ {print}' /etc/passwd mark:x:107:100:ext 112:/home/mark:/bin/tcsh

The next example demonstrates a technique for finding the largest number in a field. Because it works with the passwd file, which delimits fields with colons (:), the example changes the input field separator (FS) before reading any data. It reads the passwd file and determines the next available user ID number (field 3). The numbers do not have to be in order in the passwd file for this program to work.

The pattern ($3 > saveit) causes gawk to select records that contain a user ID number greater than any previous user ID number that it has processed. Each time it selects a record, gawk assigns the value of the new user ID number to the saveit variable. Then gawk uses the new value of saveit to test the user IDs of all subsequent records. Finally gawk adds 1 to the value of saveit and displays the result.

 $ cat find_uid BEGIN           {FS = ":"                 saveit = 0} $3 > saveit     {saveit = $3} END             {print "Next available UID is " saveit + 1} $ gawk -f find_uid /etc/passwd Next available UID is 192

The next example produces another report based on the cars file. This report uses nested if...else control structures to substitute values based on the contents of the price field. The program has no pattern part; it processes every record.

 $ cat price_range     {     if             ($5 <= 5000)               $5 = "inexpensive"         else if    (5000 < $5 && $5 < 10000)  $5 = "please ask"         else if    (10000 <= $5)              $5 = "expensive"     #     printf "%-10s %-8s    %2d    %5d    %-12s\n",\     $1, $2, $3, $4, $5     } $ gawk -f price_range cars plym       fury         1970       73    inexpensive chevy      malibu       1999       60    inexpensive ford       mustang      1965       45    expensive volvo      s80          1998      102    please ask ford       thundbd      2003       15    expensive chevy      malibu       2000       50    inexpensive bmw        325i         1985      115    inexpensive honda      accord       2001       30    please ask ford       taurus       2004       10    expensive toyota     rav4         2002      180    inexpensive chevy      impala       1985       85    inexpensive ford       explor       2003       25    please ask

Associative arrays

Next the manuf associative array uses the contents of the first field of each record in the cars file as an index. The array is composed of the elements manuf[plym], manuf[chevy], manuf[ford], and so on. Each new element is initialized to 0 (zero) as it is created. The C language operator ++ increments the variable that it follows.

The action following the END pattern is the special for structure that loops through the elements of an associative array. A pipe sends the output through sort to produce an alphabetical list of cars and the quantities in stock. Because it is a shell script and not a gawk program file, you must have both read and execute permission to the manuf file to execute it as a command. Depending on how the PATH variable (page 284) is set, you may have to execute the script as ./manuf.

 $ cat manuf gawk ' {manuf[$1]++} END    {for (name in manuf) print name, manuf[name]} ' cars | sort $ manuf bmw 1 chevy 3 ford 4 honda 1 plym 1 toyota 1 volvo 1

The next program, named manuf.sh, is a more general shell script that includes some error checking. This script lists and counts the contents of a column in a file, with both the column number and the name of the file being specified on the command line.

The first action (the one that starts with {count) uses the shell variable $1 in the middle of the gawk program to specify an array index. Because of the way the single quotation marks are paired, the $1 that appears to be within single quotation marks is actually not quoted: The two quoted strings in the gawk program surround, but do not include, the $1. Because the $1 is not quoted, and because this is a shell script, the shell substitutes the value of the first command line argument in place of $1 (page 481). As a result the $1 is interpreted before the gawk command is invoked. The leading dollar sign (the one before the first single quotation mark on that line) causes gawk to interpret what the shell substitutes as a field number.

 $ cat manuf.sh if [ $# != 2 ]     then         echo "Usage: manuf.sh field file"         exit 1 fi gawk < $2 '         {count[$'$1']++} END     {for (item in count) printf "%-20s%-20s\n",\             item, count[item]}' | sort $ manuf.sh Usage: manuf.sh field file $ manuf.sh 1 cars bmw                 1 chevy               3 ford                4 honda               1 plym                1 toyota              1 volvo               1 $ manuf.sh 3 cars 1965                1 1970                1 1985                2 1998                1 1999                1 2000                1 2001                1 2002                1 2003                2 2004                1

A way around the tricky use of quotation marks that allow parameter expansion within the gawk program is to use the v option on the command line to pass the field number to gawk as a variable. This change makes it easier for someone else to read and debug the script. You call the manuf2.sh script the same way you call manuf.sh:

 $ cat manuf2.sh if [ $# != 2 ]         then                 echo "Usage: manuf.sh field file"                 exit 1 fi gawk -v "field=$1" < $2 '                 {count[$field]++} END             {for (item in count) printf "%-20s%-20s\n",\                         item, count[item]}' | sort

The word_usage script displays a word usage list for a file you specify on the command line. The TR utility (page 804) lists the words from standard input, one to a line. The sort utility orders the file, with the most frequently used words first. This script sorts groups of words that are used the same number of times in alphabetical order.

 $ cat word_usage tr -cs 'a-zA-Z' '[\n*]' < $1 | gawk    '         {count[$1]++} END     {for (item in count) printf "%-15s%3s\n", item, count[item]}' | sort +1nr +0f -1 $ word_usage textfile the             42 file            29 fsck            27 system          22 you             22 to              21 it              17 SIZE            14 and             13 MODE            13 ...

Following is a similar program in a different format. The style mimics that of a C program and may be easier to read and work with for more complex gawk programs:

 $ cat word_count tr -cs 'a-zA-Z' '[\n*]' < $1 | gawk '  {         count[$1]++ } END     {         for (item in count)             {             if (count[item] > 4)                 {                 printf "%-15s%3s\n", item, count[item]                 }         } } ' | sort +1nr +0f -1

The tail utility displays the last ten lines of output, illustrating that words occurring fewer than five times are not listed:

 $ word_count textfile | tail directories      5 if               5 information      5 INODE            5 more             5 no               5 on               5 response         5 this             5 will             5

The next example shows one way to put a date on a report. The first line of input to the gawk program comes from date. The program reads this line as record number 1 (NR = = 1), processes it accordingly, and processes all subsequent lines with the action associated with the next pattern (NR > 1).

 $ cat report if (test $# = 0) then     echo "You must supply a filename."     exit 1 fi (date; cat $1) | gawk ' NR == 1    {print "Report for", $1, $2, $3 ", " $6} NR >  1    {print $5 "\t" $1}' $ report cars Report for Mon Jan 31, 2005 2500    plym 3000    chevy 10000   ford 9850    volvo 10500   ford 3500    chevy 450     bmw 6000    honda 17000   ford 750     toyota 1550    chevy 9500    ford

The next example sums each of the columns in a file you specify on the command line; it takes its input from the numbers file. The program performs error checking, reporting on and discarding rows that contain nonnumeric entries. It uses the next command (with the comment skip bad records) to skip the rest of the commands for the current record if the record contains a nonnumeric entry. At the end of the program, gawk displays a grand total for the file.

                    $ cat numbers                    10      20      30.3    40.5                    20      30      45.7    66.1                    30      xyz     50      70                    40      75      107.2   55.6                    50      20      30.3    40.5                    60      30      45.O    66.1                    70      1134.7  50      70                    80      75      107.2   55.6                    90      176     30.3    40.5                    100     1027.45 45.7    66.1                    110     123     50      57a.5                    120     75      107.2   55.6 $ cat tally gawk ' BEGIN    {                 ORS = ""                 } NR == 1 {                                        # first record only     nfields = NF                                 # set nfields to number of     }                                            # fields in the record (NF)     {     if ($0 ~ /[^0-9. \t]/)                       # check each record to see if it contains         {                                        # any characters that are not numbers,         print "\nRecord " NR " skipped:\n\t"     # periods, spaces, or TABs         print $0 "\n"         next                                     # skip bad records         }     else         {         for (count = 1; count <= nfields; count++)# for good records loop through fields             {             printf "%10.2f", $count > "tally.out"             sum[count] += $count             gtotal += $count             }         print "\n" > "tally.out"         }     } END        {                                     # after processing last record     for (count = 1; count <= nfields; count++)   # print summary         {         print "   -------" > "tally.out"         }     print "\n" > "tally.out"     for (count = 1; count <= nfields; count++)         {         printf "%10.2f", sum[count] > "tally.out"         }     print "\n\n        Grand Total " gtotal "\n" > "tally.out" } ' < numbers                    $ tally                    Record 3 skipped:                            30      xyz     50      70                    Record 6 skipped:                            60      30      45.O    66.1                    Record 11 skipped:                            110     123     50      57a.5                    $ cat tally.out                         10.00     20.00     30.30     40.50                         20.00     30.00     45.70     66.10                         40.00     75.00    107.20     55.60                         50.00     20.00     30.30     40.50                         70.00   1134.70     50.00     70.00                         80.00     75.00    107.20     55.60                         90.00    176.00     30.30     40.50                        100.00   1027.45     45.70     66.10                        120.00     75.00    107.20     55.60                       -------   -------   -------   -------                        580.00   2633.15    553.90    490.50                            Grand Total 4257.55

The next example reads the passwd file, listing users who do not have passwords and users who have duplicate user ID numbers. (The pwck utility performs similar checks.)

                    $ cat /etc/passwd                    bill::102:100:ext 123:/home/bill:/bin/bash                    roy:x:104:100:ext 475:/home/roy:/bin/bash                    tom:x:105:100:ext 476:/home/tom:/bin/bash                    lynn:x:166:100:ext 500:/home/lynn:/bin/bash                    mark:x:107:100:ext 112:/home/mark:/bin/bash                    sales:x:108:100:ext 102:/m/market:/bin/bash                    anne:x:109:100:ext 355:/home/anne:/bin/bash                    toni::164:100:ext 357:/home/toni:/bin/bash                    ginny:x:115:100:ext 109:/home/ginny:/bin/bash                    chuck:x:116:100:ext 146:/home/chuck:/bin/bash                    neil:x:164:100:ext 159:/home/neil:/bin/bash                    rmi:x:118:100:ext 178:/home/rmi:/bin/bash                    vern:x:119:100:ext 201:/home/vern:/bin/bash                    bob:x:120:100:ext 227:/home/bob:/bin/bash                    janet:x:122:100:ext 229:/home/janet:/bin/bash                    maggie:x:124:100:ext 244:/home/maggie:/bin/bash                    dan::126:100::/home/dan:/bin/bash                    dave:x:108:100:ext 427:/home/dave:/bin/bash                    mary:x:129:100:ext 303:/home/mary:/bin/bash $ cat passwd_check gawk < /etc/passwd '     BEGIN   {     uid[void] = ""                            # tell gawk that uid is an array     }     {                                         # no pattern indicates process all records     dup = 0                                   # initialize duplicate flag     split($0, field, ":")                     # split into fields delimited by ":"     if (field[2] == "")                       # check for null password field         {         if (field[5] == "")                   # check for null info field             {             print field[1] " has no password."             }         else             {             print field[1] " ("field[5]") has no password."             }         }     for (name in uid)                         # loop through uid array         {         if (uid[name] == field[3])            # check for second use of UID             {             print field[1] " has the same UID as " name " : UID = " uid[name]             dup = 1                           # set duplicate flag             }         }     if (!dup)                                 # same as if (dup == 0)                                               # assign UID and login name to uid array         {         uid[field[1]] = field[3]         }     }'                    $ passwd_check                    bill (ext 123) has no password.                    toni (ext 357) has no password.                    neil has the same UID as toni : UID = 164                    dan has no password.                    dave has the same UID as sales : UID = 108

The next example shows a complete interactive shell script that uses gawk to generate a report on the cars file based on price ranges:

 $ cat list_cars trap 'rm -f $$.tem > /dev/null;echo $0 aborted.;exit 1' 1 2 15 echo -n "Price range (for example, 5000 7500):" read lowrange hirange echo '                                Miles Make       Model       Year    (000)         Price --------------------------------------------------' > $$.tem gawk < cars ' $5 >= '$lowrange' && $5 <= '$hirange' {         if ($1 ~ /ply/)  $1 = "plymouth"         if ($1 ~ /chev/) $1 = "chevrolet"         printf "%-10s %-8s    %2d    %5d    $ %8.2f\n", $1, $2, $3, $4, $5         }' | sort -n +5 >> $$.tem cat $$.tem rm $$.tem $ list_cars Price range (for example, 5000 7500):3000 8000                               Miles Make      Model       Year    (000)         Price -------------------------------------------------- chevrolet malibu      1999       60    $  3000.00 chevrolet malibu      2000       50    $  3500.00 honda     accord      2001       30    $  6000.00 $ list_cars Price range (for example, 5000 7500):0 2000                                Miles Make       Model       Year    (000)         Price -------------------------------------------------- bmw        325i        1985      115    $   450.00 toyota     rav4        2002      180    $   750.00 chevrolet  impala      1985       85    $  1550.00 $ list_cars Price range (for example, 5000 7500):15000 100000                                Miles Make       Model       Year    (000)         Price -------------------------------------------------- ford       taurus      2004       10    $ 17000.00

optional: Advanced `Gawk` Programming

This section discusses some of the advanced features that the GNU developers added when they rewrote awk to create gawk. It covers how to control input using the getline statement, how to use a coprocess to exchange information between gawk and a program running in the background, and how to use a coprocess to exchange data over a network.

getline: CONTROLLING INPUT

The getline statement gives you more control over the data gawk reads than other methods of input do. When you give a variable name as an argument to getline, it reads data into that variable. The BEGIN block of the g1 program uses getline to read one line into the variable aa from standard input:

 $ cat g1 BEGIN   {         getline aa         print aa         } $ echo aaaa | gawk -f g1 aaaa

The alpha file is used in the next few examples:

 $ cat alpha aaaaaaaaa bbbbbbbbb ccccccccc ddddddddd

Even when g1 is given more than one line of input, it processes only the first line:

 $ gawk -f g1 < alpha aaaaaaaaa

When getline is not given an argument, it reads into $0 and modifies the field variables ($1, $2, . . .):

 $ gawk 'BEGIN {getline;print $1}' < alpha aaaaaaaaa

The g2 program uses a while loop in the BEGIN block to loop over the lines in standard input. The getline statement reads each line into holdme and print outputs each value of holdme.

 $ cat g2 BEGIN        {         while (getline holdme)             print holdme         } $ gawk -f g2 < alpha aaaaaaaaa bbbbbbbbb ccccccccc ddddddddd

The g3 program demonstrates that gawk automatically reads each line of input into $0 when it has statements in its body (and not just a BEGIN block). This program outputs the record number (NR), the string $0:, and the value of $0 (the current record) for each line of input.

 $ cat g3         {print NR, "$0:", $0} $ gawk -f g3 < alpha 1 $0: aaaaaaaaa 2 $0: bbbbbbbbb 3 $0: ccccccccc 4 $0: ddddddddd

Next g4 demonstrates that getline works independently of gawk's automatic reads and $0. When getline reads into a variable, it does not modify $0 nor does it modify any of the fields in the current record ($1, $2, . . .). The first statement in g4 is the same as the statement in g3 and outputs the line that gawk has automatically read. The getline statement reads the next line of input into the variable named aa. The third statement outputs the record number, the string aa:, and the value of aa. The output from g4 shows that getline processes records independently of gawk's automatic reads.

 $ cat g4         {         print NR, "$0:", $0         getline aa         print NR, "aa:", aa         } $ gawk -f g4 < alpha 1 $0: aaaaaaaaa 2 aa: bbbbbbbbb 3 $0: ccccccccc 4 aa: ddddddddd

The g5 program outputs each line of input except that it skips lines that begin with the letter b. The first print statement outputs each line that gawk reads automatically. Next the /^b/ pattern selects all lines that begin with b for special processing. The action uses getline to read the next line of input into the variable hold, outputs the string skip this line: followed by the value of hold, and outputs the value of $1. The $1 holds the value of the first field of the record that gawk read automatically, not the record read by getline. The final statement displays a string and the value of NR, the current record number. Even though getline does not change $0 when it reads into a variable, gawk increments NR.

 $ cat g5         # print all lines except those read with getline         {print "line #", NR, $0} # if line begins with "b" process it specially /^b/    {         # use getline to read the next line into variable named hold         getline hold         # print value of hold         print "skip this line:", hold         # $0 is not affected when getline reads into a variable         # $1 still holds previous value         print "previous line began with:", $1         }         {         print ">>>> finished processing line #", NR         print ""         }         $ gawk -f g5 < alpha         line # 1 aaaaaaaaa         >>>> finished processing line # 1         line # 2 bbbbbbbbb         skip this line: ccccccccc         previous line began with: bbbbbbbbb         >>>> finished processing line # 3         line # 4 ddddddddd         >>>> finished processing line # 4

COPROCESS: TWO-WAY I/O

A coprocess is a process that runs in parallel with another process. Starting with version 3.1, gawk can invoke a coprocess to exchange information directly with a background process. A coprocess can be useful when you are working in a client/server environment, setting up an SQL (902) front end/back end, or exchanging data with a remote system over a network. The gawk syntax identifies a coprocess by preceding the name of the program that starts the background process with a |& operator.

The coprocess command must be a filter (i.e., it reads from standard input and writes to standard output) and must flush its output whenever it has a complete line rather than accumulating lines for subsequent output. When a command is invoked as a coprocess, it is connected via a two-way pipe to a gawk program so that you can read from and write to the coprocess.

to_upper

When used alone the TR utility (page 804) does not flush its output after each line. The to_upper shell script is a wrapper for tr that does flush its output; this filter can be run as a coprocess. For each line read, to_upper writes the line, translated to uppercase, to standard output. Remove the # before set x if you want to_upper to display debugging output.

 $ cat to_upper #!/bin/bash #set -x while read arg do     echo "$arg" | tr '[a-z]' '[A-Z]' done $ echo abcdef | to_upper ABCDEF

The g6 program invokes to_upper as a coprocess. This gawk program reads standard input or a file specified on the command line, translates the input to uppercase, and writes the translated data to standard output.

 $ cat g6     {     print $0 |& "to_upper"     "to_upper" |& getline hold     print hold     } $ gawk -f g6 < alpha AAAAAAAAA BBBBBBBBB CCCCCCCCC DDDDDDDDD

The g6 program has one compound statement, enclosed within braces, comprising three statements. Because there is no pattern, gawk executes the compound statement once for each line of input.

In the first statement, print $0 sends the current record to standard output. The |& operator redirects standard output to the program named to_upper, which is running as a coprocess. The quotation marks around the name of the program are required. The second statement redirects standard output from to_upper to a getline statement, which copies its standard input to the variable named hold. The third statement, print hold, sends the contents of the hold variable to standard output.

GETTING INPUTFROM A NETWORK

Building on the concept of a coprocess, gawk can exchange information with a process on another system via an IP network connection. When you specify one of the special filenames that begins with /inet/, gawk processes your request using a network connection. The format of these special filenames is

 /inet/protocol/local-port/remote-host/remote-port

where protocol is usually tcp but can be udp, local-port is 0 (zero) if you want gawk to pick a port (otherwise it is the number of the port you want to use), remote-host is the IP address (page 882) or fully qualified domain name (page 876) of the remote host, and remote-port is the port number on the remote host. Instead of a port number in local-port and remote-port, you can specify a service name such as http or ftp.

The g7 program reads the cars file from the server at www.sobell.com; the author has set up this file for you to experiment with. On www.sobell.com the file is located at /CMDREF1/code/chapter_12/cars. The first statement in g7 assigns the special filename to the server variable. The filename specifies a TCP connection, allows the local system to select an appropriate port, and connects to www.sobell.com on port 80. You can use http in place of 80 to specify the standard HTTP port.

The second statement uses a coprocess to send a GET request to the remote server. This request includes the pathname of the file gawk is requesting. A while loop uses a coprocess to redirect lines from the server to getline. Because getline has no variable name as an argument, it saves its input in the current record buffer $0. The final print statement sends each record to standard output. Experiment with this script, replacing the final print statement with gawk statements that process the file.

 $ cat g7 BEGIN   {         # set variable named server         # to special networking filename         server = "/inet/tcp/0/www.sobell.com/80"         # use coprocess to send GET request to remote server         print "GET /CMDREF1/code/chapter_12/cars" |& server         # while loop uses coprocess to redirect         # output from server to getline         while (server |& getline)             print $0         } $ gawk -f g7 plym    fury    1970    73      2500 chevy   malibu  1999    60      3000 ford    mustang 1965    45      10000 volvo   s80     1998    102     9850 ...