Section 4.8. Programmable Text Processing: gawk


[Page 132]

4.8. Programmable Text Processing: gawk

The gawk utility scans one or more files and performs an action on all of the lines that match a particular condition. The actions and conditions are described by a gawk program, and range from the very simple to the complex.

gawk is a GNU reimplementation of the UNIX awk command. awk got its name from the combined first letters of its authors' surnames: Aho, Weinberger, and Kernighan. It borrows its control structures and expression syntax from the C language. If you already know C, then learning awk/gawk is quite straightforward.

awk/gawk is a comprehensive utilityso comprehensive, in fact, that there's a book on it! Because of this, I've attempted to describe only the main features and options of gawk; however, the material in this section will allow you to write a good number of useful applications. Figure 4-20 provides a synopsis of gawk.

Figure 4-20. Description of the gawk command.

Utility: gawk -Fc [ -f fileName ] program { variable=value }* { fileName }*

gawk is a programmable text-processing utility that scans the lines of its input and performs actions on every line that matches a particular criterion. A gawk program may be included on the command line, in which case it should be surrounded by single quotes; alternatively, it may be stored in a file and specified using the -f option. The initial values of variables may be specified on the command line. The default field separators are tabs and spaces. To override this, use the -F option followed by the new field separator. If no filenames are specified, gawk reads from standard input.


The next few subsections describe the various gawk features and include many examples.

4.8.1. gawk Programs

A gawk program may be supplied on the command line, but it's much more common to place it in a text file and specify the file using the -f option. If you decide to place a gawk program on the command line, surround it by single quotes.

When gawk reads a line, it breaks it into fields that are separated by tabs and/or spaces. The field separator may be overridden by using the -F option, as you'll see later in this section. A gawk program is a list of one or more commands of the form:

[ condition ] [ \{ action \} ] 


where condition is one of the following:

  • the special token BEGIN or END

  • an expression involving logical operators, relational operators, and/or regular expressions and action is a list of one or more of the following kinds of C-like statements, terminated by semicolons:

    • if (conditional) statement [ else statement ]

    • while (conditional) statement

    • for (expression; conditional; expression) statement

    • break

    • continue

    • variable=expression

    • print [ list of expressions ] [ > expression ]

    • printf format [, list of expressions ] [ > expression ]

    • next (skips the remaining patterns on the current line of input)

    • exit (skips the rest of the current line)

    • { list of statements }


    [Page 133]

action is performed on every line that matches condition. If condition is missing, action is performed on every line. If action is missing, then all matching lines are simply sent to standard output. The statements in a gawk program may be indented and formatted using spaces, tabs, and newlines.

4.8.2. Accessing Individual Fields

The first field of the current line may be accessed by $1, the second by $2, etc. $0 stands for the entire line. The built-in variable NF is equal to the number of fields in the current line. In the following example, I ran a simple gawk program on the text file "float" to insert the number of fields into each line:

$ cat float                       ...look at the original file. Wish I was floating in blue across the sky, My imagination is strong, And I often visit the days When everything seemed so clear. Now I wonder what I'm doing here at all... $ gawk '{ print NF, $0 }' float    ...execute the command. 9 Wish I was floating in blue across the sky, 4 My imagination is strong, 6 And I often visit the days 5 When everything seemed so clear. 9 Now I wonder what I'm doing here at all... $ _ 


4.8.3. BEGIN and END

The special condition BEGIN is triggered before the first line is read, and the special condition END is triggered after the last line has been read. When expressions are listed in a print statement, no space is placed between them, and a newline is printed by default. The built-in variable FILENAME is equal to the name of the input file. In the following example, I ran a program that displayed the first, third, and last fields of every line:


[Page 134]

$ cat gawk2                    ...look at the gawk script. BEGIN { print "Start of file:", FILENAME } { print $1 $3 $NF }    ...print 1st, 3rd, and last field. END { print "End of file" } $ gawk -f gawk2 float          ...execute the script. Start of file: float Wishwassky, Myisstrong, Andoftendays Whenseemedclear. Nowwonderall... End of file $ _ 


4.8.4. Operators

When commas are placed between the expressions in a print statement, a space is printed. All of the usual C operators are available in gawk. The built-in variable NR contains the line number of the current line. In the next example, I ran a program that displayed the first, third, and last fields of lines 2..3 of "float":

$ cat gawk3               ...look at the gawk script. NR > 1 && NR < 4 { print NR, $1, $3, $NF } $ gawk -f gawk3 float     ...execute the script. 2 My is strong, 3 And often days $ _ 


4.8.5. Variables

gawk supports user-defined variables. There is no need to declare a variable. A variable's initial value is a null string or zero, depending on how you use it. In the next example, the program counted the number of lines and words in a file as it echoed the lines to standard output:

$ cat gawk4               ...look at the gawk script. BEGIN { print "Scanning file" } {  printf "line %d: %s\n", NR, $0;  lineCount++;  wordCount += NF; } END { printf "lines = %d, words = %d\n", lineCount, wordCount } $ gawk -f gawk4 float     ...execute the script. 
[Page 135]
Scanning file line 1: Wish I was floating in blue across the sky, line 2: My imagination is strong, line 3: And I often visit the days line 4: When everything seemed so clear. line 5: Now I wonder what I'm doing here at all... lines = 5, words = 33 $ _


4.8.6. Control Structures

gawk supports most of the standard C control structures. In the following example, I printed the fields in each line backward:

$ cat gawk5              ...look at the gawk script. {  for (i = NF; i >= 1; i--)    printf "%s ", $i;  printf "\n"; } $ gawk -f gawk5 float     ...execute the script. sky, the across blue in floating was I Wish strong, is imagination My days the visit often I And clear. so seemed everything When all... at here doing I'm what wonder I Now $ _ 


4.8.7. Extended Regular Expressions

The condition for line matching can be an extended regular expression, which is defined in the Appendix of this book. Regular expressions must be placed between / characters. In the next example, I displayed all of the lines that contained a "t" followed by an "e," with any number of characters in between. For the sake of clarity, I've italicized the character sequences of the output lines that satisfied the condition.

$ cat gawk6               ...look at the script. /t.*e/ { print $0 } $ gawk -f gawk6 float     ...execute the script. Wish I was floating in blue across the sky, And I often visit the days When everything seemed so clear. Now I wonder what I'm doing here at all... $ _ 



[Page 136]

4.8.8. Condition Ranges

A condition may be two expressions separated by a comma. In this case, gawk performs action on every line from the first line that matches the first condition to the next line that satisfies the second condition:

$ cat gawk7               ...look at the gawk script. /strong/ , /clear/ { print $0 } $ gawk -f gawk7 float     ...execute the script. My imagination is strong, And I often visit the days When everything seemed so clear. $ _ 


4.8.9. Field Separators

If the field separators are not spaces, use the -F option to specify the separator character. In the next example, I processed a file whose fields were separated by colons:

$ cat gawk3                    ...look at the awk script. NR > 1 && NR < 4 { print $1, $3, $NF } $ cat float2                   ...look at the input file. Wish:I:was:floating:in:blue:across:the:sky, My:imagination:is:strong, And:I:often:visit:the:days When:everything:seemed:so:clear. Now:I:wonder:what:I'm:doing:here:at:all... $ gawk -F: -f gawk3 float2      ...execute the script. My is strong, And often days $ _ 


4.8.10. Built-in Functions

gawk supports several built-in functions, including exp (), log (), sqrt (), int (), and substr (). The first four functions work just like their standard C counterparts. The substr (str, x, y) function returns the substring of str from the xth character and extending y characters. Here's an example of these functions:

$ cat test              ...look at the input file. 1.1 a 2.2 at 3.3 eat 4.4 beat $ cat gawk8             ...look at the gawk script. 
[Page 137]
{ printf "$1 = %g ", $1; printf "exp = %.2g ", exp ($1); printf "log = %.2g ", log ($1); printf "sqrt = %.2g ", sqrt ($1); printf "int = %d ", int ($1); printf "substr (%s, 2, 2) = %s\n", $2, substr($2, 2, 2); } $ gawk -f gawk8 test ...execute the script. $1 = 1.1 exp = 3 log = 0.095 sqrt = 1 int = 1 substr (a, 2, 2) = $1 = 2.2 exp = 9 log = 0.79 sqrt = 1.5 int = 2 substr (at, 2, 2) = t $1 = 3.3 exp = 27 log = 1.2 sqrt = 1.8 int = 3 substr (eat, 2, 2) = at $1 = 4.4 exp = 81 log = 1.5 sqrt = 2.1 int = 4 substr (beat, 2, 2) = ea $ _





Linux for Programmers and Users
Linux for Programmers and Users
ISBN: 0131857487
EAN: 2147483647
Year: 2007
Pages: 339

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net