sed and awk | Sun Certified Solaris(tm) 9 System and Network Administrator All-in-One Exam Guide

So far, we ve looked at some fairly simple examples of text processing. However, the power of Solaris-style text processing lies with advanced tools like sed and awk. sed is a command-line editing program, which can be used to perform search and replace operations on very large files, as well as other kinds of noninteractive editing. awk, on the other hand, is a complete text processing programming language and has a C-like syntax, and can be used in conjunction with sed to program repetitive text processing and editing operations on large files. These combined operations include double and triple spacing files, printing line numbers , left- and right-justifying text, performing field extraction and field substitution, and filtering on specific strings and pattern specifications. We ll examine some of these applications below.

To start this example, we ll create a set of customer address records stored in a flat text, tab-delimited database file called test.dat :

 $ cat test.dat Bloggs  Joe     24 City Rd      Richmond        VA      23227 Lee     Yat Sen 72 King St      Amherst MA      01002 Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074 Sakura  Akira   1 Madison Ave   New York        NY      10017

This is a fairly common type of record, storing a customer s surname , first name , street address, city, state, and ZIP code. For presentation, we can double space the records in this file by redirecting the contents of the test.dat file through the sed command, with the G option:

 $ sed G < test.dat Bloggs  Joe     24 City Rd      Richmond        VA      23227 Lee     Yat Sen 72 King St      Amherst MA      01002 Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074 Sakura  Akira   1 Madison Ave   New York        NY      10017

The power of sed lies in its ability to be used in pipelines; thus, an action can literally be performed in conjunction with many other operations. For example, to insert double spacing and then remove it, we simply invoke sed twice with the appropriate commands:

 $ sed G < test.dat  sed 'n;d' Bloggs  Joe     24 City Rd      Richmond        VA      23227 Lee     Yat Sen 72 King St      Amherst MA      01002 Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074 Sakura  Akira   1 Madison Ave   New York        NY      10017

When printing reports , you ll probably be using line numbering at some point to uniquely identify records. You can generate line numbers dynamically for display by using sed :

 $ sed '/./=' test.dat  sed '/./N; s/\n/ /' 1 Bloggs        Joe     24 City Rd      Richmond        VA      23227 2 Lee   Yat Sen 72 King St      Amherst MA      01002 3 Rowe  Sarah   3454 Capitol St Los Angeles     CA      90074 4 Sakura        Akira   1 Madison Ave   New York        NY      10017

For large files, it s often useful to be able to count the number of lines. While the wc command can be used for this purpose, sed can also be used in situations where wc is not available:

 $ cat test.dat  sed -n '$=' 4

When you re printing databases for display, you might want comments and titles left-justified, but all records being displayed with two blank spaces before each line. This can be achieved by using sed :

 $ cat test.dat  sed 's/^/  /'   Bloggs        Joe     24 City Rd      Richmond        VA      23227   Lee   Yat Sen 72 King St      Amherst                 MA      01002   Rowe  Sarah   3454 Capitol St Los Angeles             CA      90074   Sakura        Akira   1 Madison Ave   New York        NY      10017

Imagine that due to some municipal reorganization, all cities currently located in CT were being reassigned to MA. sed would be the perfect tool to identify all instances of CT in the data file and replace them with MA:

 $ cat test.dat  sed 's/MA/CT/g' Bloggs  Joe     24 City Rd      Richmond        VA      23227 Lee     Yat Sen 72 King St      Amherst         CT      01002 Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074 Sakura  Akira   1 Madison Ave   New York        NY      10017

If a data file has been entered as a first in last out (FILO) stack, then you ll generally be reading records from the file from top to bottom. However, if the data file is to be treated as a last in first out (LIFO) stack, then it would be useful to be able to reorder the records from the last to the first:

 $ cat test.dat  sed '1\!G;h;$\!d' Sakura  Akira   1 Madison Ave   New York        NY      10017 Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074 Lee     Yat Sen 72 King St      Amherst MA      01002 Bloggs  Joe     24 City Rd      Richmond        VA      23227

Some data hiding applications require that data be encoded in some way that is nontrivial for another application to detect a file s contents. One way to foil such programs is to reverse the character strings that comprise each record, which can be achieved by using sed :

 $ cat test.dat  sed '/\n/\!G;s/\(.\)\(.*\n\)/&/;//D;s/.//' 72232   AV      dnomhciR        dR ytiC 42      eoJ     sggolB 20010   AM      tsrehmA tS gniK 27      neS taY eeL 47009   AC      selegnA soL     tS lotipaC 4543 haraS   ewoR 71001   YN      kroY weN        evA nosidaM 1   arikA   arukaS

Some reporting applications might require that the first line of a file be processed before deletion. Although the head command can be used for this purpose, sed can also be used:

 $ sed q < test.dat Bloggs  Joe     24 City Rd      Richmond        VA      23227

Alternatively, if a certain number of lines are to be printed, sed can be used to extract the first q lines:

 $ sed 2q < test.dat Bloggs  Joe     24 City Rd      Richmond        VA      23227 Lee     Yat Sen 72 King St      Amherst MA      01002

The grep command is often used to detect strings within files. However, sed can also be used for this purpose, as shown in the following example where the string CA (representing California) is searched for:

 $ cat test.dat  sed '/CA/\!d' Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074

However, this is a fairly gross and inaccurate method, because CA might match a street address like 1 CALGARY Rd, or 23 Green CAPE. Thus, it s necessary to use the field extraction features of awk . In the following example, we use awk to extract and print the fifth column in the data file, representing the state:

 $ cat test.dat  awk 'BEGIN {FS = "\t"}{print }' VA MA CA NY

Note that the tab character \t is specified as the field delimiter . Now, if we combine the field extraction capability of awk with the string searching facility of sed , we should be able to print out a list of all occurrences of the state CA:

 $ cat test.dat  awk 'BEGIN {FS = "\t"}{print }'  sed '/CA/\!d' CA

Alternatively, we could simply count the number of records that contained CA in the State field:

 $ cat test.dat  awk 'BEGIN {FS = "\t"}{print }'  sed '/CA/\!d'  sed -n '$=' 1

When producing reports, it s useful to be able to selectively display fields in a different order. For example, while surname is typically used as a primary key, and is generally the first field, most reports would display the first name before the surname, which can be achieved by using awk :

 $ cat test.dat  awk 'BEGIN {FS = "\t"}{print ,}' Joe Bloggs Yat Sen Lee Sarah Rowe Akira Sakura

It s also possible to split such reordered fields across different lines and use different format specifiers. For example, the following script prints the first name and surname on one line and the state on the following line. Such code is the basis of many mail merge and bulk printing programs:

 $ cat test.dat  awk 'BEGIN {FS = "\t"}{print ,,"\n"}' Joe Bloggs VA Yat Sen Lee MA Sarah Rowe CA Akira Sakura NY

Since awk is a complete programming language, it contains many common constructs, like if/then/else evaluations of logical states. These states can be used to test business logic cases. For example, in a mailing program, the bounds of valid ZIP codes could be checked by determining whether the ZIP code lay within a valid range. For example, the following routine checks to see whether a ZIP code is less than 9999 and rejects it as invalid if it is greater than 9999:

 $ cat test.dat  awk 'BEGIN {FS = "\t"}{print ,}{if(<9999) {print  "Valid zipcode"} else {print "Invalid zipcode"}}' Joe Bloggs Invalid zipcode Yat Sen Lee Valid zipcode Sarah Rowe Invalid zipcode Akira Sakura Invalid zipcode

sed

The standard options for sed are shown here:

-n Prevents display of pattern space.
-e filename Executes the script contained in the file filename .
-V Displays the version number.

awk

The standard POSIX options for awk are shown here:

-f filename Where filename is the name of the awk file to process.
-F field where field is the field separator.
-v x=y Where x is a variable and y is a value.
-W lint turns on lint checking.
-W lint-old Uses old-style lint checking.
-W traditional enforces traditional usage.
-W version Displays the version number.