Using awk

Programming for NDS disaster recovery need not always involve the Novell Developer Kit (NDK) or any form of NDS or the NetWare application programming interface (API) set. Some of the most effective programming techniques available are nothing more than text file manipulations. This type of manipulation is good for data conversions such as converting NList output to a format suitable for UImport or ICE to use for input.

Several programming languages include text (or string ) manipulation. BASIC, C, Pascal, Perl, and even assembly language include interfaces and libraries for performing string manipulation. One of the best languages for this sort of work, however, is a programming language called awk .

The awk language was developed in 1977 by Alfred Aho, Brian Kernighan, and Peter Weinberger at AT&T Bell Labs; the name of the language comes from the last- name initials of the three authors. The original development of awk was done in a Unix environment, and because of this, many of the concepts are familiar if you have had exposure to utilities such as grep or sed . grep is very similar to the Windows find utility because it searches one or more input files for lines containing a match to a specified pattern. By default, grep prints the matching lines. sed (stream editor) is a non-interactive editor that works from a command line; it's not a GUI Windows application. sed changes blocks of text on-the-fly , without requiring you to open a screen, push a cursor around, and press Delete , Insert , Enter , or function keys.

awk is an interpreted scripting language; this means that there is no compiler or other means to turn an awk program into a self-sufficient executable program as you would a C or Pascal program.

NOTE

The examples in this book are all interpreted using the GNU version of awk (called gawk ), available from the Free Software Foundation. (GNU is a recursive acronym for "GNU's Not Unix"; it is pronounced "guh-noo.") The discussion presented in this section is based on version 3.0.3, but newer versions are available for DOS, Windows 32-bit operating systems, Unix/Linux, and other operating system platforms. An updated list of sites that maintain awk source code and binaries is available from the comp.lang.awk FAQ at ftp://rtfm.mit.edu/pub/usenet/comp.lang.awk/faq.

Why awk?

You might be wondering why we recommend awk for string manipulation.

There are several reasons awk might be a better choice than other languages for the sort of rapid development that may be necessary in a disaster situation:

awk interpreters are available for many operating system platforms, including DOS, Windows 32-bit operating systems, and Linux.
awk interpreters do not require any sort of special memory manager. Many Intel-platform Perl interpreters require DOS extenders of different sorts.
awk is not a compiled language; therefore, you can readily read and modify its script.
The awk interpreter is very small (typically around 30KB or so), and it can be put on a disk along with a number of standard scripts for disaster recovery purposes.
Unlike C or C++, awk does not require that you understand pointers when manipulating strings.
The awk user interface is very straightforward. If you are recovering from a very serious disaster, a minimal workstation configuration can be used ”DOS 6.22 with a Novell Client installed is sufficient to start parsing NList outputs.
awk 's regular expression parsing capabilities exceed the capabilities of many of the traditional programming languages, such as C and Pascal.

How Does awk Work?

An awk program takes input a line at a time, parses and processes the input, and produces an output file. This is usually done through command-line pipes or redirection . Normally, this process involves the use of three files: an awk script, an input file, and an output file.

The awk script itself is a set of rules for processing lines of text from the input file. These rules are written using a pattern-matching technique that is common in the Unix world, called a regular expression ( regex ). Regex pattern matching enables you to specify the format of a line of text in the input file; if a line matches the regex, the text is processed in a manner described by that portion of the script.

Regex pattern matching uses this basic format:

 /  pattern  /

where pattern is replaced by a string that represents an input format. Table 10.1 shows special sequences of characters that can be used in the pattern.

Table 10.1. Sample Regular Expressions

REGULAR EXPRESSION	MEANING	EXAMPLES OF MATCHES
`/User:/`	Match lines containing the case-sensitive string `User:`	User: Jim
`/L* Name:/`	Match lines containing a string with `L` , any other characters, and the string `Name:`	Last Name: Henderson
`/ [JK][iu][mo]/`	Match lines containing the letter `J` or `K` , then the letter `i` or `u` , and then the letter `m` or `o`	Jim, Kuo, Juo, Kim, Jio
`/[Jj][Ii][Mm]/`	Match lines containing `J` or `j` , `I` or `i` , and `M` or `m` .	Jim, jIm, JIm, jIM
`/^Jim[0-9][0-9]/`	Match lines starting with `Jim` followed by a two-digit number	Jim00, Jim01, Jim90, Jim42
`/Jim$/`	Match lines ending with `Jim` .	Hi, Jim
`/^$/`	Match blank lines

For example, if you execute this command:

 NLIST USER SHOW Name /S /R /C > OUTPUT.TXT

and use the file OUTPUT.TXT as the input file to the script:

 /User:/ { print "Found a user ID" }

awk searches the input file for the case-sensitive string User: , and if this text is found, it prints the string "Found a user ID."

awk supports two special commands: BEGIN and END . These are not really patterns but are used to include special instructions ”such as variable initialization and final output options ”in the script. Here's an example:

 BEGIN { count = 0 }     /User:/ { count++ } END { printf("Total users found:  %d\n", count) }

NOTE

If you are familiar with the C/C++, Perl, or Java programming languages, resist the urge to put a semicolon ( ; ) at the end of each statement line as it is not required by awk !

This script searches for instances of the case-sensitive string User: (as in the previous example), but rather than print a string, it increments the variable count, and when it finishes that, it prints a total count of the user objects listed in the input file.

An awk script parses the input line based on a field separator. By default, the field separator is whitespace. Whitespace includes any number of spaces or tabs between the data. The line is then split out into internal variables based on the field separator found.

For example, this line consists of a total of four fields:

 Full Name:   Jim Henderson

These fields are referred to by the names $1 , $2 , $3 , and $4 , with these values:

Field	Value
`$1`	`Full`
`$2`	`Name:`
`$3`	`Jim`
`$4`	`Henderson`

The entire line of text is referred to by the variable $0 . This variable always represents the entire line up to the record separator , which is typically a carriage return.

TIP

Another special value is the NF value. This value reports the number of fields in the line. If you are uncertain about the number of fields but need the last value from the line, you can reference this as $NF . In the preceding example, $4 has the value Henderson , as does $NF ; therefore, NF would have the value 4 .

You can change the defaults for the various separators, such as field separator and record separator. For example, if you have a tab-delimited file, you would want the field separator (FS) to be the tab character rather than a space. You typically make this change in the beginning of the script, in the BEGIN segment:

 BEGIN { FS = "\t" }

NOTE

As with C, C++, and other programming languages, awk uses escaped characters for nonprintable characters. The \t sequence refers to the tab character, the \n sequence refers to the newline character, and so on.

The output field separator (OFS) denotes the character used to separate the fields when awk prints them. The default is a blank or space character. To output a comma-separated variable (CSV) formatted file, you can change the OFS to a comma as follows :

 BEGIN { OFS = "," }

You can also change the record separator (RS) by using the RS variable:

 BEGIN { RS = "\t" }

Typically, you do not want to change this, but at times it might make sense to do so.

awk also has a number of string manipulation functions. Table 10.2 shows some of the commonly used functions and what they can be used for.

Table 10.2. `awk` String Manipulation Functions

FUNCTION	USE	RETURN VALUE
`gsub (` `SearchFor` `,` `Replace` `)`	Replace all occurrences of `SearchFor` with `Replace` in `$0` (the input line).	The number of replacements made.
`gsub (` `SearchFor` `, )` `Replace, SearchIn`	Replace all occurrences of `SearchFor` with `Replace` in `SearchIn` .	The number of replace-ments made.
`index (` `String` `,` `Text` `)`	Locate the first occurrence of `String` in `Text` .	The offset in the string where the occurrence is; if not found, returns .
`length (` `String` `)`	Determine the length (in characters) of `String` .	The number of characters in `String` .
`match (` `String` `,` `Text` `)`	Locate the first occurrence of `String` in `Text` (case-sensitive).	The offset in the string where the occurrence is; if not found, returns . This function also sets the variables `RStart` and `RLength` , which are the start index and length of the substring.
`split (` `String` `,` `Array` `)`	Breaks `String` into the array `Array` on the default field separator (specified by `FS` ).	The number of fields. Values in the array can be referred to with a sub-script; if the array name is `A` , the first element is `A[1]` , the second is `A[2],` and so on.
`split (` `String` `,` `Array` `,` `FieldSeparator` `)`	Breaks `String` into the array `Array` on the specified field separator, `FieldSeparator` .	The number of fields.
`sprintf (` `Format` `,` `Expression list` `)`	Prints output using a specified output format. This is similar to the C function `sprintf()` , except that the output is on the left side of the equals sign instead of inside the parentheses.	The formatted string.
`sub (` `Replace` `,` `String` `)`	Substitutes the first instance of `Replace` with `String` in `$0` .	The number of replacements made (should always be 1).
`sub (` `Replace` `,` `String, Input` `)`	Substitutes the first instance of `Replace` with `String` in the input `Input` .	The number of replacements made (should always be 1).
`substr (` `String` `,` `Position` `)` .	Returns the suffix of the string `String` , starting at position `Position`	A string value with the result.
`substr (` `String` `,` `Position,Length` `)`	Returns the suffix of `String` , starting at `Position` of `Length`	A string value with the result.

In addition to doing string manipulation, you can also use awk for numeric manipulation. This type of manipulation of data is done in the same manner as C/C++ numeric manipulation. If you need to manipulate a number with an initial value, you can initialize it in the BEGIN section of the script.

When you're scanning a line and breaking it into the initial subcomponents ( $1 through $NF ), or when you're breaking it down using the split() function, if a numeric value is found, it is automatically treated as a number.

However, if you need to perform string manipulations on it, you are also able to do this. In this respect, awk provides greater flexibility than most other programming languages.

awk also supports the use of several other statements and structures. Table 10.3 lists the most common of these.

Table 10.3. `awk` Language Actions

ACTION	EXAMPLE/FUNCTION
Assignment	`x = 25`
`Print`	`print "The user name is username"`
`printf (` `format, expression list` `)`	`printf ("The user name is %s\n", username)`
`if (` `expression` `)` `statement`	`if (username == "Jim") print "Found Jim!"`
`if (` `expression` `)` `statement` `else` `statement`	`if (username == "Jim") print "Found Jim!" else print "Found someone else"`
`while (` `expression` `)` `statement`	`while (NF > 10) printf("Too many fields, line %d, %d fields\n", NR, NF)`
`for` ( `initialization` ; `while` `expression` ; `initialization variable modification` ) `statement`	`for (x=0;x<10;x++) printf("x=%d\n", x)`
`Exit`	Exits the interpreter
`Break`	Breaks out of the current `for...while...do` loop

TIP

awk uses the pound sign ( # ) to indicate the start of a comment. Anything after a # will be ignored. Therefore, you can include inline comments anywhere within your script.

NOTE

awk resources and documentation are available on the Internet, and the following are two examples of good Web sites for this type of information: www.ling.ed.ac.uk/facilities/help/gawk/gawk_13.html#SEC97 and www.sunsite.ualberta.ca/Documentation/Gnu/gawk-3.1.0/html_chapter/gawk_10.html.

An additional feature of the awk language that is occasionally useful is the capability to create specialized functions for repeated operations. If you are coding an involved script, you can package the code so as to minimize your coding time. However, in disaster recovery situations, you will generally find that the scripts you write will perform very specific manipulations, and as a general rule, you do not need to reuse code within such scripts.

Creating functions within awk is a simple matter. For example, the following would be a function to return the minimum value of two passed-in parameters:

 function min(a, b){   if (a < b)     return a   else     return b }

Because awk supports the abbreviated if structure that C provides, the preceding can also be coded as follows:

 function min(a, b){   return a < b ? a : b }

This code sample provides the same functionality as the previous sample.

After you have written awk script, you need to give it an input file. When you're using the gawk interpreter, you do this by using a command in the following format:

 gawk  awkfile.awk  <  inputfile  [>  outputfile  ]

This tells gawk to use awkfile.awk as the script and to pipe the contents of inputfile into the script. The input file can be left out, but if it is, you must type the input file by hand. The output file contains the results of the script; if an output file is not specified, the output is written to the screen.

TIP

Writing the output to the screen can be very useful during the development process. By exiting the script with the exit command after processing the first line of text, you can get an idea as to whether the program is working the way you want it to work without processing the entire input file.

In a disaster-recovery situation, it is not absolutely necessary that the script work completely correctly when you finish it. Using awk with a combination of other tools may get you to the end you need more quickly. Consider this example. Say you have an awk script that outputs a UImport file with group membership information in it. It is missing a few of the members because of an extra embedded space for those entries. Instead of trying to perfect the script to handle that small number of exceptions, you can use your favorite text editor to either remove those entries from the file or to make the correction using a search-and-replace function. Remember that during disaster recovery, it doesn't matter if your script is elegant or pretty; it just matters that you get the job done quickly and correctly.

Creating a UImport Data File by Using awk

The example in this section shows a full awk program designed to convert the output from the NLIST command into a format suitable for importing into a spreadsheet or database program. For simplicity, the scope of this example is limited to a single context. (Chapter 11, "Examples from the Real World," examines a case study that builds a UImport file based on information from the entire tree.)

For this example, the input file is generated by using the following command:

 NLIST USER SHOW SURNAME, "FULL NAME", "GIVEN NAME" > USERS.TXT

The output file USERS.TXT contains the same information you would normally see on the screen. The contents of the file in this example are as follows:

 Object Class: User Current context: east.xyzcorp User: JimH    Last Name: Henderson    Full Name: Jim Henderson    Given Name: Jim User: PeterK    Last Name: Kuo    Full Name: Peter Kuo    Given Name: Peter User: PKuo    Last Name: Kuo    Full Name: Peter Kuo    Given Name: Peter User: JHenderson    Full Name: Jim Henderson    Given Name: Jim    Last Name: Henderson A total of 4 User objects was found in this context. A total of 4 User objects was found.

From this information, we want to generate a comma-delimited file with the fields User ID, Context, First Name, Last Name, and Full Name. The output file also contains a header line with the field names in it.

This awk script performs the conversion to a comma-delimited file:

 BEGIN { flag = 0         printf("\"User ID\",\"Context\",\"First Name\",                  \"Last Name\",\"Full Name\"\n") } /User:/ {   if (flag == 1) {     printf("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n",            user, context, gn, ln, fn)     gn = ""     ln = ""     fn = ""   }   user =    gsub(" ", "", user)   flag = 1 } /Full Name:/ { gsub(/Full Name: /,"")                 gsub(/\t/, "")                 fn =  BEGIN { flag = 0 printf("\"User ID\",\"Context\",\"First Name\", \"Last Name\",\"Full Name\"\n") } /User:/ { if (flag == 1) { printf("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n", user, context, gn, ln, fn) gn = "" ln = "" fn = "" } user = $2 gsub(" ", "", user) flag = 1 } /Full Name:/ { gsub(/Full Name: /,"") gsub(/\t/, "") fn = $0 # Cleans up leading blanks (ONLY) # from the full name nfn = length(fn) for (i=1;i<nfn;i++) { if ( match(fn, " ") == 1 ) sub(" ", "", fn) else break } } /Last Name:/ { gsub(/Last Name: /,"") gsub(/\t/, "") ln = $0 gsub(" ", "", ln) } /Given Name:/ { gsub(/Given Name: /,"") gsub(/\t/, "") gn = $0 gsub(" ", "", gn) } /Current context:/ { gsub(/Current context: /, "") context = $0 gsub(" ", "", context) } END { printf("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n", user, context, gn, ln, fn) } 
 # Cleans up leading blanks (ONLY)                 # from the full name                 nfn = length(fn)                 for (i=1;i<nfn;i++) {                      if ( match(fn, " ") == 1 )                           sub(" ", "", fn)                      else                           break                 } } /Last Name:/ { gsub(/Last Name: /,"")                 gsub(/\t/, "")                 ln =  BEGIN { flag = 0 printf("\"User ID\",\"Context\",\"First Name\", \"Last Name\",\"Full Name\"\n") } /User:/ { if (flag == 1) { printf("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n", user, context, gn, ln, fn) gn = "" ln = "" fn = "" } user = $2 gsub(" ", "", user) flag = 1 } /Full Name:/ { gsub(/Full Name: /,"") gsub(/\t/, "") fn = $0 # Cleans up leading blanks (ONLY) # from the full name nfn = length(fn) for (i=1;i<nfn;i++) { if ( match(fn, " ") == 1 ) sub(" ", "", fn) else break } } /Last Name:/ { gsub(/Last Name: /,"") gsub(/\t/, "") ln = $0 gsub(" ", "", ln) } /Given Name:/ { gsub(/Given Name: /,"") gsub(/\t/, "") gn = $0 gsub(" ", "", gn) } /Current context:/ { gsub(/Current context: /, "") context = $0 gsub(" ", "", context) } END { printf("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n", user, context, gn, ln, fn) } 
 gsub(" ", "", ln) } /Given Name:/ { gsub(/Given Name: /,"")                  gsub(/\t/, "")                  gn =  BEGIN { flag = 0 printf("\"User ID\",\"Context\",\"First Name\", \"Last Name\",\"Full Name\"\n") } /User:/ { if (flag == 1) { printf("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n", user, context, gn, ln, fn) gn = "" ln = "" fn = "" } user = $2 gsub(" ", "", user) flag = 1 } /Full Name:/ { gsub(/Full Name: /,"") gsub(/\t/, "") fn = $0 # Cleans up leading blanks (ONLY) # from the full name nfn = length(fn) for (i=1;i<nfn;i++) { if ( match(fn, " ") == 1 ) sub(" ", "", fn) else break } } /Last Name:/ { gsub(/Last Name: /,"") gsub(/\t/, "") ln = $0 gsub(" ", "", ln) } /Given Name:/ { gsub(/Given Name: /,"") gsub(/\t/, "") gn = $0 gsub(" ", "", gn) } /Current context:/ { gsub(/Current context: /, "") context = $0 gsub(" ", "", context) } END { printf("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n", user, context, gn, ln, fn) } 
 gsub(" ", "", gn) } /Current context:/ { gsub(/Current context: /, "")                        context =  BEGIN { flag = 0 printf("\"User ID\",\"Context\",\"First Name\", \"Last Name\",\"Full Name\"\n") } /User:/ { if (flag == 1) { printf("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n", user, context, gn, ln, fn) gn = "" ln = "" fn = "" } user = $2 gsub(" ", "", user) flag = 1 } /Full Name:/ { gsub(/Full Name: /,"") gsub(/\t/, "") fn = $0 # Cleans up leading blanks (ONLY) # from the full name nfn = length(fn) for (i=1;i<nfn;i++) { if ( match(fn, " ") == 1 ) sub(" ", "", fn) else break } } /Last Name:/ { gsub(/Last Name: /,"") gsub(/\t/, "") ln = $0 gsub(" ", "", ln) } /Given Name:/ { gsub(/Given Name: /,"") gsub(/\t/, "") gn = $0 gsub(" ", "", gn) } /Current context:/ { gsub(/Current context: /, "") context = $0 gsub(" ", "", context) } END { printf("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n", user, context, gn, ln, fn) } 
 gsub(" ", "", context) } END { printf("\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n",                user, context, gn, ln, fn) }

The script starts with the BEGIN statement. It executes before any lines of the input file USERS.TXT are read. It sets a flag value to in order to avoid printing a blank first line. The field headers are then printed to the output device.

Next , the first line is read. This line contains the object class information. Each line in the script that performs a pattern match on the data file is evaluated, in order. First, the line is checked for the presence of the User: string. This line, however, does not contain that specific string, so the next pattern is evaluated. That line also does not contain the other specified strings ( Full Name: , Last Name: , Given Name: , or Current context: ). As a result, the line is not processed and is not sent to output.

The second line contains the string Current context: , so the code written to handle that is used to process the line. The first line of code (the gsub line) replaces the string Current context: with nothing, effectively removing it from the $0 variable. The variable context is then assigned to the string contained in the line, which contains the actual context. This variable is preserved from one line to the next and is printed each time a new user is read in and at the end of the script.

After that line is processed, the next line is processed similarly. It contains a user ID and assigns the value. It also sets the flag variable to 1 , but because the variable was when the script started, the information gathered is not printed out. As soon as the flag is set to 1 , each subsequent time a user ID is found, the previous user information is printed, and all variables except context are reset to empty strings.

After the last line of the file is read, the last user's information is printed. The reason for using a BEGIN / END block is because NLIST may return the attributes in arbitrary order; therefore, you should not print out the results until the script has encountered the next User: data block.

The result of this script, using the output from the earlier NLIST command, is as follows:

 "User ID","Context","First Name","Last Name","Full Name" "JimH","east.xyzcorp","Jim","Henderson","Jim Henderson" "PeterK","east.xyzcorp","Peter","Kuo","Peter Kuo" "PKuo","east.xyzcorp","Peter","Kuo","Peter Kuo" "JHenderson","east.xyzcorp","Jim","Henderson","Jim Henderson"

Why awk?

How Does awk Work?

Table 10.1. Sample Regular Expressions

Table 10.2. awk String Manipulation Functions

Table 10.3. awk Language Actions

Creating a UImport Data File by Using awk

Table 10.2. `awk` String Manipulation Functions

Table 10.3. `awk` Language Actions