cars data file Many of the examples in this section work with the cars data file. From left to right the columns in the file contain each car's make, model, year of manufacture, mileage in thousands of miles, and price. All whitespace in this file is composed of single TABs (the file does not contain any SPACEs). $ cat cars plym fury 1970 73 2500 chevy malibu 1999 60 3000 ford mustang 1965 45 10000 volvo s80 1998 102 9850 ford thundbd 2003 15 10500 chevy malibu 2000 50 3500 bmw 325i 1985 115 450 honda accord 2001 30 6000 ford taurus 2004 10 17000 toyota rav4 2002 180 750 chevy impala 1985 85 1550 ford explor 2003 25 9500 Missing pattern A simple awk program is { print } This program consists of one program line that is an action. Because the pattern is missing, awk selects all lines of input. When used without any arguments the print command displays each selected line in its entirety. This program copies the input to standard output. $ awk '{ print }' cars plym fury 1970 73 2500 chevy malibu 1999 60 3000 ford mustang 1965 45 10000 volvo s80 1998 102 9850 ... Missing action The next program has a pattern but no explicit action. The slashes indicate that chevy is a regular expression. /chevy/ In this case awk selects from the input all lines that contain the string chevy. When you do not specify an action, awk assumes that the action is print. The following example copies to standard output all lines from the input that contain the string chevy: $ awk '/chevy/' cars chevy malibu 1999 60 3000 chevy malibu 2000 50 3500 chevy impala 1985 85 1550 Single quotation marks Although neither awk nor shell syntax requires single quotation marks on the command line, it is still a good idea to use them because they can prevent problems. If the awk program you create on the command line includes SPACEs or special shell characters, you must quote them. Always enclosing the program in single quotation marks is the easiest way of making sure that you have quoted any characters that need to be quoted. Fields The next example selects all lines from the file (it has no pattern). The braces enclose the action; you must always use braces to delimit the action so that awk can distinguish it from the pattern. This example displays the third field ($3), a SPACE (the output field separator, indicated by the comma), and the first field ($1) of each selected line: $ awk '{print $3, $1}' cars 1970 plym 1999 chevy 1965 ford 1998 volvo ... The next example, which includes both a pattern and an action, selects all lines that contain the string chevy and displays the third and first fields from the lines it selects: $ awk '/chevy/ {print $3, $1}' cars 1999 chevy 2000 chevy 1985 chevy Next awk selects lines that contain a match for the regular expression h. Because there is no explicit action, awk displays all the lines it selects: $ awk '/h/' cars chevy malibu 1999 60 3000 ford thundbd 2003 15 10500 chevy malibu 2000 50 3500 honda accord 2001 30 6000 chevy impala 1985 85 1550 ~ (matches operator) The next pattern uses the matches operator (~) to select all lines that contain the letter h in the first field: $ awk '$1 ~ /h/' cars chevy malibu 1999 60 3000 chevy malibu 2000 50 3500 honda accord 2001 30 6000 chevy impala 1985 85 1550 The caret (^) in a regular expression forces a match at the beginning of the line (page 899) or, in this case, the beginning of the first field: $ awk '$1 ~ /^h/' cars honda accord 2001 30 6000 Brackets surround a character-class definition (page 899). In the next example, awk selects lines that have a second field that begins with t or m and displays the third and second fields, a dollar sign, and the fifth field. Because there is no comma between the "$" and the $5, awk does not put a SPACE between them in the output. $ awk '$2 ~ /^[tm]/ {print $3, $2, "$" $5}' cars 1999 malibu $3000 1965 mustang $10000 2003 thundbd $10500 2000 malibu $3500 2004 taurus $17000 Dollar signs The next example shows three roles a dollar sign can play in an awk program. A dollar sign followed by a number names a field. Within a regular expression a dollar sign forces a match at the end of a line or field (5$). Within a string a dollar sign represents itself. $ awk '$3 ~ /5$/ {print $3, $1, "$" $5}' cars 1965 ford $10000 1985 bmw $450 1985 chevy $1550 In the next example, the equal-to relational operator (= =) causes awk to perform a numeric comparison between the third field in each line and the number 1985. The awk command takes the default action, print, on each line where the comparison is true. $ awk '$3 == 1985' cars bmw 325i 1985 115 450 chevy impala 1985 85 1550 The next example finds all cars priced at or less than $3,000: $ awk '$5 <= 3000' cars plym fury 1970 73 2500 chevy malibu 1999 60 3000 bmw 325i 1985 115 450 toyota rav4 2002 180 750 chevy impala 1985 85 1550 Textual comparisons When you use double quotation marks, awk performs textual comparisons by using the ASCII (or other local) collating sequence as the basis of the comparison. In the following example, awk shows that the strings 450 and 750 fall in the range that lies between the strings 2000 and 9000, which is probably not the intended result: $ awk '"2000" <= $5 && $5 < "9000"' cars plym fury 1970 73 2500 chevy malibu 1999 60 3000 chevy malibu 2000 50 3500 bmw 325i 1985 115 450 honda accord 2001 30 6000 toyota rav4 2002 180 750 When you need to perform a numeric comparison, do not use quotation marks. The next example gives the intended result. It is the same as the previous example except that it omits the double quotation marks. $ awk '2000 <= $5 && $5 < 9000' cars plym fury 1970 73 2500 chevy malibu 1999 60 3000 chevy malibu 2000 50 3500 honda accord 2001 30 6000 , (range operator) The range operator (,) selects a group of lines. The first line it selects is the one specified by the pattern before the comma. The last line is the one selected by the pattern after the comma. If no line matches the pattern after the comma, awk selects every line through the end of the input. The next example selects all lines, starting with the line that contains volvo and concluding with the line that contains bmw: $ awk '/volvo/ , /bmw/' cars volvo s80 1998 102 9850 ford thundbd 2003 15 10500 chevy malibu 2000 50 3500 bmw 325i 1985 115 450 After the range operator finds its first group of lines, it begins the process again, looking for a line that matches the pattern before the comma. In the following example, awk finds three groups of lines that fall between chevy and ford. Although the fifth line of input contains ford, awk does not select it because at the time it is processing the fifth line, it is searching for chevy. $ awk '/chevy/ , /ford/' cars chevy malibu 1999 60 3000 ford mustang 1965 45 10000 chevy malibu 2000 50 3500 bmw 325i 1985 115 450 honda accord 2001 30 6000 ford taurus 2004 10 17000 chevy impala 1985 85 1550 ford explor 2003 25 9500 f option When you are writing a longer awk program, it is convenient to put the program in a file and reference the file on the command line. Use the f option followed by the name of the file containing the awk program. BEGIN The following awk program, stored in a file named pr_header, has two actions and uses the BEGIN pattern. The awk utility performs the action associated with BEGIN before processing any lines of the data file: It displays a header. The second action, {print}, has no pattern part and displays all the lines from the input. $ cat pr_header BEGIN {print "Make Model Year Miles Price"} {print} $ awk -f pr_header cars Make Model Year Miles Price plym fury 1970 73 2500 chevy malibu 1999 60 3000 ford mustang 1965 45 10000 volvo s80 1998 102 9850 ... The next example expands the action associated with the BEGIN pattern. In the previous and following examples, the whitespace in the headers is composed of single TABs, so that the titles line up with the columns of data. $ cat pr_header2 BEGIN { print "Make Model Year Miles Price" print "----------------------------------------" } {print} $ awk -f pr_header2 cars Make Model Year Miles Price ---------------------------------------- plym fury 1970 73 2500 chevy malibu 1999 60 3000 ford mustang 1965 45 10000 volvo s80 1998 102 9850 ... length function When you call the length function without an argument, it returns the number of characters in the current line, including field separators. The $0 variable always contains the value of the current line. In the next example, awk prepends the line length to each line and then a pipe sends the output from awk to sort (the n option specifies a numeric sort; page 837) so that the lines of the cars file appear in order of length: $ awk '{print length, $0}' cars | sort -n 21 bmw 325i 1985 115 450 22 plym fury 1970 73 2500 23 volvo s80 1998 102 9850 24 ford explor 2003 25 9500 24 toyota rav4 2002 180 750 25 chevy impala 1985 85 1550 25 chevy malibu 1999 60 3000 25 chevy malibu 2000 50 3500 25 ford taurus 2004 10 17000 25 honda accord 2001 30 6000 26 ford mustang 1965 45 10000 26 ford thundbd 2003 15 10500 The formatting of this report depends on TABs for horizontal alignment. The three extra characters at the beginning of each line throw off the format of several lines. A remedy for this situation is covered shortly. NR (record number) The NR variable contains the record (line) number of the current line. The following pattern selects all lines that contain more than 24 characters. The action displays the line number of each of the selected lines. $ awk 'length > 24 {print NR}' cars 2 3 5 6 8 9 11 You can combine the range operator (,) and the NR variable to display a group of lines of a file based on their line numbers. The next example displays lines 2 through 4: $ awk 'NR == 2 , NR == 4' cars chevy malibu 1999 60 3000 ford mustang 1965 45 10000 volvo s80 1998 102 9850 END The END pattern works in a manner similar to the BEGIN pattern, except that awk takes the actions associated with it after processing the last line of input. The following report displays information only after it has processed all the input. The NR variable retains its value after awk finishes processing the data file, so that an action associated with an END pattern can use it: $ awk 'END {print NR, "cars for sale." }' cars 12 cars for sale. The next example uses if control structures to expand the abbreviations used in some of the first fields. As long as awk does not change a record, it leaves the entire recordincluding separatorsintact. Once it makes a change to a record, awk changes all separators in that record to the value of the output field separator. The default output field separator is a SPACE. $ cat separ_demo { if ($1 ~ /ply/) $1 = "plymouth" if ($1 ~ /chev/) $1 = "chevrolet" print } $ awk -f separ_demo cars plymouth fury 1970 73 2500 chevrolet malibu 1999 60 3000 ford mustang 1965 45 10000 volvo s80 1998 102 9850 ford thundbd 2003 15 10500 chevrolet malibu 2000 50 3500 bmw 325i 1985 115 450 honda accord 2001 30 6000 ford taurus 2004 10 17000 toyota rav4 2002 180 750 chevrolet impala 1985 85 1550 ford explor 2003 25 9500 Stand-alone script Instead of calling awk from the command line with the f option and the name of the program you want to run, you can write a script that calls awk with the commands you want to run. The next example is a stand-alone script that runs the same program as the previous example. The #!/usr/bin/awkf command (page 267) runs the awk utility directly. You need both read and execute permission to the file holding the script (page 265). $ chmod u+rx separ_demo2 $ cat separ_demo2 #!/usr/bin/awk -f { if ($1 ~ /ply/) $1 = "plymouth" if ($1 ~ /chev/) $1 = "chevrolet" print } $ separ_demo2 cars plymouth fury 1970 73 2500 chevrolet malibu 1999 60 3000 ford mustang 1965 45 10000 ... OFS variable You can change the value of the output field separator by assigning a value to the OFS variable. The following example assigns a TAB character to OFS, using the backslash escape sequence \t. This fix improves the appearance of the report but does not line up the columns properly. $ cat ofs_demo BEGIN {OFS = "\t"} { if ($1 ~ /ply/) $1 = "plymouth" if ($1 ~ /chev/) $1 = "chevrolet" print } $ awk -f ofs_demo cars plymouth fury 1970 73 2500 chevrolet malibu 1999 60 3000 ford mustang 1965 45 10000 volvo s80 1998 102 9850 ford thundbd 2003 15 10500 chevrolet malibu 2000 50 3500 bmw 325i 1985 115 450 honda accord 2001 30 6000 ford taurus 2004 10 17000 toyota rav4 2002 180 750 chevrolet impala 1985 85 1550 ford explor 2003 25 9500 printf You can use printf (page 615) to refine the output format. The following example uses a backslash at the end of several program lines to quote the following NEWLINE. You can use this technique to continue a long line over one or more lines without affecting the outcome of the program. $ cat printf_demo BEGIN { print " Miles" print "Make Model Year (000) Price" print \ "--------------------------------------------------" } { if ($1 ~ /ply/) $1 = "plymouth" if ($1 ~ /chev/) $1 = "chevrolet" printf "%-10s %-8s %2d %5d $ %8.2f\n",\ $1, $2, $3, $4, $5 } $ awk -f printf_demo cars Miles Make Model Year (000) Price -------------------------------------------------- plymouth fury 1970 73 $ 2500.00 chevrolet malibu 1999 60 $ 3000.00 ford mustang 1965 45 $ 10000.00 volvo s80 1998 102 $ 9850.00 ford thundbd 2003 15 $ 10500.00 chevrolet malibu 2000 50 $ 3500.00 bmw 325i 1985 115 $ 450.00 honda accord 2001 30 $ 6000.00 ford taurus 2004 10 $ 17000.00 toyota rav4 2002 180 $ 750.00 chevrolet impala 1985 85 $ 1550.00 ford explor 2003 25 $ 9500.00 Redirecting output The next example creates two files: one with the lines that contain chevy and one with the lines that contain ford: $ cat redirect_out /chevy/ {print > "chevfile"} /ford/ {print > "fordfile"} END {print "done."} $ awk -f redirect_out cars done. $ cat chevfile chevy malibu 1999 60 3000 chevy malibu 2000 50 3500 chevy impala 1985 85 1550 The summary program produces a summary report on all cars and newer cars. Although they are not required, the initializations at the beginning of the program represent good programming practice; awk automatically declares and initializes variables as you use them. After reading all the input data, awk computes and displays averages. $ cat summary BEGIN { yearsum = 0 ; costsum = 0 newcostsum = 0 ; newcount = 0 } { yearsum += $3 costsum += $5 } $3 > 2000 {newcostsum += $5 ; newcount ++} END { printf "Average age of cars is %4.1f years\n",\ 2006 - (yearsum/NR) printf "Average cost of cars is $%7.2f\n",\ costsum/NR printf "Average cost of newer cars is $%7.2f\n",\ newcostsum/newcount } $ awk -f summary cars Average age of cars is 13.1 years Average cost of cars is $6216.67 Average cost of newer cars is $8750.00 The following awk command shows the format of a line from the passwd database that the next example uses: $ nidump passwd / | awk '/mark/ {print}' mark:********:107:100::0:0:ext 112:/Users/mark:/bin/tcsh In this example the nidump utility (page 796) sends the passwd database to standard output. The next example demonstrates a technique for finding the largest number in a field. Because it works with the passwd database, which delimits fields with colons (:), the example changes the input field separator (FS) before reading any data. It reads the passwd database and determines the next available user ID number (field 3). The numbers do not have to be in order in the passwd database for this program to work. The pattern ($3 > saveit) causes awk to select records that contain a user ID number greater than any previous user ID number that it has processed. Each time it selects a record, awk assigns the value of the new user ID number to the saveit variable. Then awk uses the new value of saveit to test the user IDs of all subsequent records. Finally awk adds 1 to the value of saveit and displays the result. $ cat find_uid BEGIN {FS = ":" saveit = 0} $3 > saveit {saveit = $3} END {print "Next available UID is " saveit + 1} $ nidump passwd / | awk -f find_uid Next available UID is 1029 The next example produces another report based on the cars file. This report uses nested if...else control structures to substitute values based on the contents of the price field. The program has no pattern part; it processes every record. $ cat price_range { if ($5 <= 5000) $5 = "inexpensive" else if (5000 < $5 && $5 < 10000) $5 = "please ask" else if (10000 <= $5) $5 = "expensive" # printf "%-10s %-8s %2d %5d %-12s\n",\ $1, $2, $3, $4, $5 } $ awk -f price_range cars plym fury 1970 73 inexpensive chevy malibu 1999 60 inexpensive ford mustang 1965 45 expensive volvo s80 1998 102 please ask ford thundbd 2003 15 expensive chevy malibu 2000 50 inexpensive bmw 325i 1985 115 inexpensive honda accord 2001 30 please ask ford taurus 2004 10 expensive toyota rav4 2002 180 inexpensive chevy impala 1985 85 inexpensive ford explor 2003 25 please ask Associative arrays Next the manuf associative array uses the contents of the first field of each record in the cars file as an index. The array is composed of the elements manuf[plym], manuf[chevy], manuf[ford], and so on. Each new element is initialized to 0 (zero) as it is created. The C language operator ++ increments the variable that it follows. The action following the END pattern is the special for structure that loops through the elements of an associative array. A pipe sends the output through sort to produce an alphabetical list of cars and the quantities in stock. Because it is a shell script and not an awk program file, you must have both read and execute permission to the manuf file to execute it as a command. Depending on how the PATH variable (page 285) is set, you may have to execute the script as ./manuf. $ cat manuf awk ' {manuf[$1]++} END {for (name in manuf) print name, manuf[name]} ' cars | sort $ manuf bmw 1 chevy 3 ford 4 honda 1 plym 1 toyota 1 volvo 1 The next program, named manuf.sh, is a more general shell script that includes some error checking. This script lists and counts the contents of a column in a file, with both the column number and the name of the file being specified on the command line. The first action (the one that starts with {count) uses the shell variable $1 in the middle of the awk program to specify an array index. Because of the way the single quotation marks are paired, the $1 that appears to be within single quotation marks is actually not quoted: The two quoted strings in the awk program surround, but do not include, the $1. Because the $1 is not quoted, and because this is a shell script, the shell substitutes the value of the first command line argument in place of $1 (page 565). As a result the $1 is interpreted before the awk command is invoked. The leading dollar sign (the one before the first single quotation mark on that line) causes awk to interpret what the shell substitutes as a field number. $ cat manuf.sh if [ $# != 2 ] then echo "Usage: manuf.sh field file" exit 1 fi awk < $2 ' {count[$'$1']++} END {for (item in count) printf "%-20s%-20s\n",\ item, count[item]}' | sort $ manuf.sh Usage: manuf.sh field file $ manuf.sh 1 cars bmw 1 chevy 3 ford 4 honda 1 plym 1 toyota 1 volvo 1 $ manuf.sh 3 cars 1965 1 1970 1 1985 2 1998 1 1999 1 2000 1 2001 1 2002 1 2003 2 2004 1 A way around the tricky use of quotation marks that allow parameter expansion within the awk program is to use the v option on the command line to pass the field number to awk as a variable. This change makes it easier for someone else to read and debug the script. You call the manuf2.sh script the same way you call manuf.sh: $ cat manuf2.sh if [ $# != 2 ] then echo "Usage: manuf.sh field file" gexit 1 fi awk -v "field=$1" < $2 ' {count[$field]++} END {for (item in count) printf "%-20s%-20s\n",\ item, count[item]}' | sort The word_usage script displays a word usage list for a file you specify on the command line. The TR utility (page 879) lists the words from standard input, one to a line. The sort utility orders the file, with the most frequently used words first. This script sorts groups of words that are used the same number of times in alphabetical order. $ cat word_usage tr -cs 'a-zA-Z' '[\n*]' < $1 | awk ' {count[$1]++} END {for (item in count) printf "%-15s%3s\n", item, count[item]}' | sort +1nr +0f -1 $ word_usage textfile the 42 file 29 fsck 27 system 22 you 22 to 21 it 17 SIZE 14 and 13 MODE 13 ... Following is a similar program in a different format. The style mimics that of a C program and may be easier to read and work with for more complex awk programs: $ cat word_count tr -cs 'a-zA-Z' '[\n*]' < $1 | awk ' { count[$1]++ } END { for (item in count) { if (count[item] > 4) { printf "%-15s%3s\n", item, count[item] } } } ' | sort +1nr +0f -1 The tail utility displays the last ten lines of output, illustrating that words occurring fewer than five times are not listed: $ word_count textfile | tail directories 5 if 5 information 5 INODE 5 more 5 no 5 on 5 response 5 this 5 will 5 The next example shows one way to put a date on a report. The first line of input to the awk program comes from date. The program reads this line as record number 1 (NR == 1), processes it accordingly, and processes all subsequent lines with the action associated with the next pattern (NR > 1). $ cat report if (test $# = 0) then echo "You must supply a filename." exit 1 fi (date; cat $1) | awk ' NR == 1 {print "Report for", $1, $2, $3 ", " $6} NR > 1 {print $5 "\t" $1}' $ report cars Report for Mon Jan 31, 2005 2500 plym 3000 chevy 10000 ford 9850 volvo 10500 ford 3500 chevy 450 bmw 6000 honda 17000 ford 750 toyota 1550 chevy 9500 ford The next example sums each of the columns in a file you specify on the command line; it takes its input from the numbers file. The program performs error checking, reporting on and discarding rows that contain nonnumeric entries. It uses the next command (with the comment skip bad records) to skip the rest of the commands for the current record if the record contains a nonnumeric entry. At the end of the program, awk displays a grand total for the file. $ cat numbers 10 20 30.3 40.5 20 30 45.7 66.1 30 xyz 50 70 40 75 107.2 55.6 50 20 30.3 40.5 60 30 45.0 66.1 70 1134.7 50 70 80 75 107.2 55.6 90 176 30.3 40.5 100 1027.45 45.7 66.1 110 123 50 57a.5 120 75 107.2 55.6 $ cat tally awk ' BEGIN { ORS = "" } NR == 1 { # first record only nfields = NF # set nfields to number of } # fields in the record (NF) { if ($0 ~ /[^0-9. \t]/) # check each record to see if it contains { # any characters that are not numbers, print "\nRecord " NR " skipped:\n\t" # periods, spaces, or TABs print $0 "\n" next # skip bad records } else { for (count = 1; count <= nfields; count++)# for good records loop through fields { printf "%10.2f", $count > "tally.out" sum[count] += $count gtotal += $count } print "\n" > "tally.out" } } END { # after processing last record for (count = 1; count <= nfields; count++) # print summary { print " -------" > "tally.out" } print "\n" > "tally.out" for (count = 1; count <= nfields; count++) { printf "%10.2f", sum[count] > "tally.out" } print "\n\n Grand Total" gtotal "\n" > "tally.out" } ' < numbers $ tally Record 3 skipped: 30 xyz 50 70 Record 6 skipped: 60 30 45.0 66.1 Record 11 skipped: 110 123 50 57a.5 $ cat tally.out 10.00 20.00 30.30 40.50 20.00 30.00 45.70 66.10 40.00 75.00 107.20 55.60 50.00 20.00 30.30 40.50 70.00 1134.70 50.00 70.00 80.00 75.00 107.20 55.60 90.00 176.00 30.30 40.50 100.00 1027.45 45.70 66.10 120.00 75.00 107.20 55.60 ------- ------- ------- ------- 580.00 2633.15 553.90 490.50 Grand Total 4257.55 The next example shows a complete interactive shell script that uses awk to generate a report on the cars file based on price ranges: $ cat list_cars trap 'rm -f $$.tem > /dev/null;echo $0 aborted.;exit 1' 1 2 15 echo -n "Price range (for example, 5000 7500):" read lowrange hirange echo ' Miles Make Model Year (000) Price --------------------------------------------------' > $$.tem awk < cars ' $5 >= '$lowrange' && $5 <= '$hirange' { if ($1 ~ /ply/) $1 = "plymouth" if ($1 ~ /chev/) $1 = "chevrolet" printf "%-10s %-8s %2d %5d $ %8.2f\n", $1, $2, $3, $4, $5 }' | sort -n +5 >> $$.tem cat $$.tem rm $$.tem $ list_cars Price range (for example, 5000 7500):3000 8000 Miles Make Model Year (000) Price ------------------------------------------------- chevrolet malibu 1999 60 $ 3000.00 chevrolet malibu 2000 50 $ 3500.00 honda accord 2001 30 $ 6000.00 $ list_cars Price range (for example, 5000 7500):0 2000 Miles Make Model Year (000) Price ------------------------------------------------- bmw 325i 1985 115 $ 450.00 toyota rav4 2002 180 $ 750.00 chevrolet impala 1985 85 $ 1550.00 $ list_cars Price range (for example, 5000 7500):15000 100000 Miles Make Model Year (000) Price -------------------------------------------------- ford taurus 2004 10 $ 17000.00
|