4.3. String Operators

< Day Day Up >

The curly-bracket syntax allows for the shell's string operators. String operators allow you to manipulate values of variables in various useful ways without having to write full-blown programs or resort to external UNIX utilities. You can do a lot with string-handling operators even if you haven't yet mastered the programming features we'll see in later chapters.

In particular, string operators let you do the following:

Ensure that variables exist (i.e., are defined and have non-null values)
Set default values for variables
Catch errors that result from variables not being set
Remove portions of variables' values that match patterns

4.3.1. Syntax of String Operators

The basic idea behind the syntax of string operators is that special characters that denote operations are inserted between the variable's name and the right curly bracket. Any argument that the operator may need is inserted to the operator's right.

The first group of string-handling operators tests for the existence of variables and allows substitutions of default values under certain conditions. These are listed in Table 4-1.^[4]

^[4] The colon (:) in all but the last of these operators is actually optional. If the colon is omitted, then change "exists and isn't null" to "exists" in each definition, i.e., the operator tests for existence only.

Table 4-1. Substitution operators
Operator	Substitution
${varname:-word}	If varname exists and isn't null, return its value; otherwise return word. Purpose: Returning a default value if the variable is undefined. Example: ${count:-0} evaluates to 0 if count is undefined.
${varname:=word}	If varname exists and isn't null, return its value; otherwise set it to word and then return its value. Positional and special parameters cannot be assigned this way. Purpose: Setting a variable to a default value if it is undefined. Example: ${count:=0} sets count to 0 if it is undefined.
${varname:?message}	If varname exists and isn't null, return its value; otherwise print varname: followed by message, and abort the current command or script (non-interactive shells only). Omitting message produces the default message parameter null or not set. Purpose: Catching errors that result from variables being undefined. Example: {count:?"undefined!"} prints "count: undefined!" and exits if count is undefined.
${varname:+word}	If varname exists and isn't null, return word; otherwise return null. Purpose: Testing for the existence of a variable. Example: ${count:+1} returns 1 (which could mean "true") if count is defined.
${varname:offset:length}	Performs substring expansion.^[5] It returns the substring of $varname starting at offset and up to length characters. The first character in $varname is position 0. If length is omitted, the substring starts at offset and continues to the end of $varname. If offset is less than 0 then the position is taken from the end of $varname. If varname is @, the length is the number of positional parameters starting at parameter offset. Purpose: Returning parts of a string (substrings or slices). Example: If count is set to frogfootman, ${count:4} returns footman. ${count:4:4} returns foot.

^[5] The substring expansion operator is not available in versions of bash prior to 2.0.

The first of these operators is ideal for setting defaults for command-line arguments in case the user omits them. We'll use this technique in our first programming task.

Task 4-1

You have a large album collection, and you want to write some software to keep track of it. Assume that you have a file of data on how many albums you have by each artist. Lines in the file look like this:

5        Depeche Mode 2        Split Enz 3        Simple Minds 1        Vivaldi, Antonio

Write a program that prints the N highest lines, i.e., the N artists by whom you have the most albums. The default for N should be 10. The program should take one argument for the name of the input file and an optional second argument for how many lines to print.

By far the best approach to this type of script is to use built-in UNIX utilities, combining them with I/O redirectors and pipes. This is the classic "building-block" philosophy of UNIX that is another reason for its great popularity with programmers. The building-block technique lets us write a first version of the script that is only one line long:

sort -nr $1 | head -${2:-10}

Here is how this works: the sort program sorts the data in the file whose name is given as the first argument ($1). The -n option tells sort to interpret the first word on each line as a number (instead of as a character string); the -r tells it to reverse the comparisons, so as to sort in descending order.

The output of sort is piped into the head utility, which, when given the argument -N, prints the first N lines of its input on the standard output. The expression -${2:-10} evaluates to a dash (-) followed by the second argument if it is given, or to -10 if it's not; notice that the variable in this expression is 2, which is the second positional parameter.

Assume the script we want to write is called highest. Then if the user types highest myfile, the line that actually runs is:

sort -nr myfile | head -10

Or if the user types highest myfile 22, the line that runs is:

sort -nr myfile | head -22

Make sure you understand how the :- string operator provides a default value.

This is a perfectly good, runnable script but it has a few problems. First, its one line is a bit cryptic. While this isn't much of a problem for such a tiny script, it's not wise to write long, elaborate scripts in this manner. A few minor changes will make the code more readable.

First, we can add comments to the code; anything between # and the end of a line is a comment. At a minimum, the script should start with a few comment lines that indicate what the script does and what arguments it accepts. Second, we can improve the variable names by assigning the values of the positional parameters to regular variables with mnemonic names. Finally, we can add blank lines to space things out; blank lines, like comments, are ignored. Here is a more readable version:

# #        highest filename [howmany] # #        Print howmany highest-numbered lines in file filename. #        The input file is assumed to have lines that start with #        numbers.  Default for howmany is 10. #       filename=$1 howmany=${2:-10}       sort -nr $filename | head -$howmany

The square brackets around howmany in the comments adhere to the convention in UNIX documentation that square brackets denote optional arguments.

The changes we just made improve the code's readability but not how it runs. What if the user were to invoke the script without any arguments? Remember that positional parameters default to null if they aren't defined. If there are no arguments, then $1 and $2 are both null. The variable howmany ($2) is set up to default to 10, but there is no default for filename ($1). The result would be that this command runs:

sort -nr | head -10

As it happens, if sort is called without a filename argument, it expects input to come from standard input, e.g., a pipe (|) or a user's terminal. Since it doesn't have the pipe, it will expect the terminal. This means that the script will appear to hang! Although you could always hit CTRL-D or CTRL-C to get out of the script, a naive user might not know this.

Therefore we need to make sure that the user supplies at least one argument. There are a few ways of doing this; one of them involves another string operator. We'll replace the line:

filename=$1

with:

filename=${1:?"filename missing."}

This will cause two things to happen if a user invokes the script without any arguments: first the shell will print the somewhat unfortunate message:

highest: 1: filename missing.

to the standard error output. Second, the script will exit without running the remaining code. With a somewhat "kludgy" modification, we can get a slightly better error message.

Consider this code:

filename=$1 filename=${filename:?"missing."}

This results in the message:

highest: filename: missing.

(Make sure you understand why.) Of course, there are ways of printing whatever message is desired; we'll find out how in Chapter 5.

Before we move on, we'll look more closely at the three remaining operators in Table 4-1 and see how we can incorporate them into our task solution. The := operator does roughly the same thing as :-, except that it has the "side effect" of setting the value of the variable to the given word if the variable doesn't exist.

Therefore we would like to use := in our script in place of :-, but we can't; we'd be trying to set the value of a positional parameter, which is not allowed. But if we replaced:

howmany=${2:-10}

with just:

howmany=$2

and moved the substitution down to the actual command line (as we did at the start), then we could use the := operator:

sort -nr $filename | head -${howmany:=10}

The operator :+ substitutes a value if the given variable exists and isn't null. Here is how we can use it in our example: let's say we want to give the user the option of adding a header line to the script's output. If she types the option -h, then the output will be preceded by the line:

ALBUMS  ARTIST

Assume further that this option ends up in the variable header, i.e., $header is -h if the option is set or null if not. (Later we will see how to do this without disturbing the other positional parameters.)

The following expression yields null if the variable header is null, or ALBUMSARTIST\n if it is non-null:

${header:+"ALBUMSARTIST\n"}

This means that we can put the line:

echo -e -n ${header:+"ALBUMSARTIST\n"}

right before the command line that does the actual work. The -n option to echo causes it not to print a LINEFEED after printing its arguments. Therefore this echo statement will print nothing not even a blank line if header is null; otherwise it will print the header line and a LINEFEED (\n). The -e option makes echo interpret the \n as a LINEFEED rather than literally.

The final operator, substring expansion, returns sections of a string. We can use it to "pick out" parts of a string that are of interest. Assume that our script is able to assign lines of the sorted list, one at a time, to the variable album_line. If we want to print out just the album name and ignore the number of albums, we can use substring expansion:

echo ${album_line:8}

This prints everything from character position 8, which is the start of each album name, onwards.

If we just want to print the numbers and not the album names, we can do so by supplying the length of the substring:

echo ${album_line:0:7}

Although this example may seem rather useless, it should give you a feel for how to use substrings. When combined with some of the programming features discussed later in the book, substrings can be extremely useful.

4.3.2. Patterns and Pattern Matching

We'll continue refining our solution to Task 4-1 later in this chapter. The next type of string operator is used to match portions of a variable's string value against patterns. Patterns, as we saw in Chapter 1, are strings that can contain wildcard characters (*, ?, and [] for character sets and ranges).

Table 4-2 lists bash's pattern-matching operators.

Table 4-2. Pattern-matching operators
Operator	Meaning
${variable#pattern}	If the pattern matches the beginning of the variable's value, delete the shortest part that matches and return the rest.
${variable##pattern}	If the pattern matches the beginning of the variable's value, delete the longest part that matches and return the rest.
${variable%pattern}	If the pattern matches the end of the variable's value, delete the shortest part that matches and return the rest.
${variable%%pattern}	If the pattern matches the end of the variable's value, delete the longest part that matches and return the rest.
${variable/pattern/string}${variable//pattern/string}	The longest match to pattern in variable is replaced by string. In the first form, only the first match is replaced. In the second form, all matches are replaced. If the pattern begins with a #, it must match at the start of the variable. If it begins with a %, it must match with the end of the variable. If string is null, the matches are deleted. If variable is @ or *, the operation is applied to each positional parameter in turn and the expansion is the resultant list.^[6]

^[6] The pattern-matching and replacement operator is not available in versions of bash prior to 2.0.

These can be hard to remember; here's a handy mnemonic device: # matches the front because number signs precede numbers; % matches the rear because percent signs follow numbers.

The classic use for pattern-matching operators is in stripping off components of pathnames, such as directory prefixes and filename suffixes. With that in mind, here is an example that shows how all of the operators work. Assume that the variable path has the value /home/cam/book/long.file.name; then:

Expression                   Result ${path##/*/}                      long.file.name ${path#/*/}              cam/book/long.file.name $path              /home/cam/book/long.file.name ${path%.*}         /home/cam/book/long.file ${path%%.*}        /home/cam/book/long

The two patterns used here are /*/, which matches anything between two slashes, and .*, which matches a dot followed by anything.

The longest and shortest pattern-matching operators produce the same output unless they are used with the * wildcard operator. As an example, if filename had the value alicece, then both ${filename%ce} and ${filename%%ce} would produce the result alice. This is because ce is an exact match; for a match to occur, the string ce must appear on the end $filename. Both the short and long matches will then match the last grouping of ce and delete it. If, however, we had used the * wildcard, then ${filename%ce*} would produce alice because it matches the shortest occurrence of ce followed by anything else. ${filename%%ce*} would return ali because it matches the longest occurrence of ce followed by anything else; in this case the first and second ce.

The next task will incorporate one of these pattern-matching operators.

Task 4-2

You are writing a graphics file conversion utility for use in creating a web page. You want to be able to take a PCX file and convert it to a JPEG file for use on the web page.^[7]

^[7] PCX is a popular graphics file format under Microsoft Windows. JPEG (Joint Photographic Expert Group) is a common graphics format on the Internet and is used to a great extent on web pages.

Graphics file conversion utilities are quite common because of the plethora of different graphics formats and file types. They allow you to specify an input file, usually from a range of different formats, and convert it to an output file of a different format. In this case, we want to take a PCX file, which can't be displayed with a web browser, and convert it to a JPEG which can be displayed by nearly all browsers. Part of this process is taking the filename of the PCX file, which ends in .pcx, and changing it to one ending in .jpg for the output file. In essence, you want to take the original filename and strip off the .pcx, then append .jpg. A single shell statement will do this:

outfile=${filename%.pcx}.jpg

The shell takes the filename and looks for .pcx on the end of the string. If it is found, .pcx is stripped off and the rest of the string is returned. For example, if filename had the value alice.pcx, the expression ${filename%.pcx} would return alice. The .jpg is appended to form the desired alice.jpg, which is then stored in the variable outfile.

If filename had an inappropriate value (without the .pcx) such as alice.xpm, the above expression would evaluate to alice.xpm.jpg: since there was no match, nothing is deleted from the value of filename, and .jpg is appended anyway. Note, however, that if filename contained more than one dot (e.g., if it were alice.1.pcx the expression would still produce the desired value alice.1.jpg).

The next task uses the longest pattern-matching operator.

Task 4-3

You are implementing a filter that prepares a text file for printer output. You want to put the file's name without any directory prefix on the "banner" page. Assume that, in your script, you have the pathname of the file to be printed stored in the variable pathname.

Clearly, the objective is to remove the directory prefix from the pathname. The following line will do it:

bannername=${pathname##*/}

This solution is similar to the first line in the examples shown before. If pathname were just a filename, the pattern */ (anything followed by a slash) would not match and the value of the expression would be pathname untouched. If pathname were something like book/wonderland, the prefix book/ would match the pattern and be deleted, leaving just wonderland as the expression's value. The same thing would happen if pathname were something like /home/cam/ book/wonderland: since the ## deletes the longest match, it deletes the entire /home/cam/book/.

If we used #*/ instead of ##*/, the expression would have the incorrect value home/cam/book/wonderland, because the shortest instance of "anything followed by a slash" at the beginning of the string is just a slash (/).

The construct ${variable##*/} is actually equivalent to the UNIX utility basename. basename takes a pathname as argument and returns the filename only; it is meant to be used with the shell's command substitution mechanism (see the following explanation). basename is less efficient than ${variable##*/} because it runs in its own separate process rather than within the shell. Another utility, dirname, does essentially the opposite of basename: it returns the directory prefix only. It is equivalent to the bash expression ${variable%/*} and is less efficient for the same reason.

The last operator in the table matches patterns and performs substitutions. Task 4-4 is a simple task where it comes in useful.

Task 4-4

The directories in PATH can be hard to distinguish when printed out as one line with colon delimiters. You want a simple way to display them, one to a line.

As directory names are separated by colons, the easiest way would be to replace each colon with a LINEFEED:

$ echo -e ${PATH//:/'\n'} /home/cam/bin /usr/local/bin /bin /usr/bin /usr/X11R6/bin

Each occurrence of the colon is replaced by \n. As we saw earlier, the -e option allows echo to interpret \n as a LINEFEED. In this case we used the second of the two substitution forms. If we'd used the first form, only the first colon would have been replaced with a \n.

4.3.3. Length Operator

There is one remaining operator on variables. It is ${#varname}, which returns the length of the value of the variable as a character string. (In Chapter 6, we will see how to treat this and similar values as actual numbers so they can be used in arithmetic expressions.) For example, if filename has the value alice.c, then ${#filename} would have the value 7.

4.3.4. Extended Pattern Matching

Bash provides a further set of pattern matching operators if the shopt option extglob is switched on. Each operator takes one or more patterns, normally strings, separated by the vertical bar ( | ). The extended pattern matching operators are given in Table 4-3.^[8]

^[8] Be aware that these are not available in early releases of bash 2.0.

Table 4-3. Pattern-matching operators
Operator	Meaning
*(patternlist)	Matches zero or more occurrences of the given patterns.
+(patternlist)	Matches one or more occurrences of the given patterns.
?(patternlist)	Matches zero or one occurrences of the given patterns.
@(patternlist)	Matches exactly one of the given patterns.
!(patternlist)	Matches anything except one of the given patterns.

Some examples of these include:

*(alice|hatter|hare) would match zero or more occurrences of alice, hatter, and hare. So it would match the null string, alice, alicehatter, etc.
+(alice|hatter|hare) would do the same except not match the null string.
?(alice|hatter|hare) would only match the null string, alice, hatter, or hare.
@(alice|hatter|hare) would only match alice, hatter, or hare.
!(alice|hatter|hare) matches everything except alice, hatter, and hare.

The values provided can contain shell wildcards too. So, for example, +([0-9]) matches a number of one or more digits. The patterns can also be nested, so you could remove all files except those beginning with vt followed by a number by doing rm !(vt+([0-9])).