Section 5.8. Objective 7: Search Text Files Using Regular Expressions


5.8. Objective 7: Search Text Files Using Regular Expressions

Linux offers many tools for system administrators to use for processing text. Many, such as sed, awk, and perl, are capable of automatically editing multiple files, providing you with a wide range of text-processing capability. To harness that capability, you need to be able to define and delineate specific text segments from within files, text streams, and string variables. Once the text you're after is identified, you can use one of these tools or languages to do useful things to it.

These tools and others understand a loosely defined pattern language. The language and the patterns themselves are collectively called regular expressions (often abbreviated just regexp or regex). While regular expressions are similar in concept to file globs, many more special characters exist for regular expressions, extending the utility and capability of tools that understand them.

Two tools that are important for the LPIC Level 1 exams and that make use of regular expressions are grep and sed. These tools are useful for text searches. There are many other tools that make use of regular expressions, including the awk, Perl, and Python languages and other utilities, but you don't need to be concerned with them for the purpose of the LPIC Level 1 exams.

Regular expressions are the topic of entire books, such as Mastering Regular Expressions (O'Reilly). Exam 101 requires the use of simple regular expressions and related tools, specifically to perform searches from text sources. This section covers only the basics of regular expressions, but it goes without saying that their power warrants a full understanding. Digging deeper into the regular expression world is highly recommended when you have the chance.

5.8.1. Regular Expression Syntax

It would not be unreasonable to assume that some specification defines how regular expressions are constructed. Unfortunately, there isn't one. Regular expressions have been incorporated as a feature in a number of tools over the years, with varying degrees of consistency and completeness. The result is a cart-before-the-horse scenario, in which utilities and languages have defined their own flavor of regular expression syntax , each with its own extensions and idiosyncrasies. Formally defining the regular expression syntax came later, as did efforts to make it more consistent. Regular expressions are defined by arranging strings of text, or patterns. Those patterns are composed of two types of characters, literals (plain text or literal text) and metacharacters.

Like the special file globbing characters, regular expression metacharacters take on a special meaning in the context of the tool in which they're used. There are a few metacharacters that are generally thought of to be among the "extended set" of metacharacters, specifically those introduced into egrep after grep was created.

The egrep command on Linux systems is simply a wrapper that runs grep -E. Examples of metacharacters include the ^ symbol, which means "the beginning of a line," and the $ symbol, which means "the end of a line." A complete listing of metacharacters follows in Tables Tables 5-6, 5-7, and 5-8.


Tip: The backslash character (\) turns off (escapes) the special meaning of the character that follows, turning metacharacters into literals. For non-metacharacters, it often turns on some special meaning.

Table 5-6. Regular expression position anchors

Regular expression

Description

^

Match at the beginning of a line. This interpretation makes sense only when the ^ character is at the left-hand side of the regex.

$

Match at the end of a line. This interpretation makes sense only when the $ character is at the right-hand side of the regex.

\<\>

Match word boundaries. Word boundaries are defined as whitespace, the start of line, the end of line, or punctuation marks. The backslashes are required and enable this interpretation of < and >.


Table 5-7. Regular expression character sets

Regular expression

Description

[abc][a-z]

Single-character groups and ranges. In the first form, match any single character from among the enclosed characters a, b, or c. In the second form, match any single character from among the range of characters bounded by a and z (POSIX character classes can also be used, so [a-z] can be replaced with [[:lower:]]). The brackets are for grouping only and are not matched themselves.

[^abc][^a-z]

Inverse match. Match any single character not among the enclosed characters a, b, and c or in the range a-z. Be careful not to confuse this inversion with the anchor character ^, described earlier.

.

Match any single character except a newline.


Table 5-8. Regular expression modifiers

Basic regular expression

Extended regular expression (egrep)

Description

*

*

Match an unknown number (zero or more) of the single character (or single-character regex) that precedes it.

\?

?

Match zero or one instance of the preceding regex.

\+

+

Match one or more instances of the preceding regex.

\{n,m\}

{n,m}

Match a range of occurrences of the single character or regex that precedes this construct. \{n\} matches n occurrences, \{n,\} matches at least n occurrences, and \{n,m\} matches any number of occurrences from n to m, inclusively.

\|

|

Alternation. Match either the regex specified before or after the vertical bar.

\(regex\)

(regex)

Grouping. Matches regex, but it can be modified as a whole and used in back-references. (\1 expands to the contents of the first \(\) and so on up to \9.)


It is often helpful to consider regular expressions as their own language, where literal text acts as words and phrases. The "grammar" of the language is defined by the use of metacharacters. The two are combined according to specific rules (which, as mentioned earlier, may differ slightly among various tools) to communicate ideas and get real work done. When you construct regular expressions, you use metacharacters and literals to specify three basic ideas about your input text:


Position anchors

A position anchor is used to specify the position of one or more character sets in relation to the entire line of text (such as the beginning of a line).


Character sets

A character set matches text. It could be a series of literals, metacharacters that match individual or multiple characters, or combinations of these.


Quantity modifiers

Quantity modifiers follow a character set and indicate the number of times the set should be repeated.


Using grep

A long time ago, as the idea of regular expressions was catching on, the line editor ed contained a command to display lines of a file being edited that matched a given regular expression. The command is:

 g/regular expression/p 

That is, "on a global basis, print the current line when a match for regular expression is found," or more simply, "global regular expression print." This function was so useful that it was made into a standalone utility named, appropriately, grep. Later, the regular expression grammar of grep was expanded in a new command called egrep (for "extended grep"). You'll find both commands on your Linux system today, and they differ slightly in the way they handle regular expressions. For the purposes of Exam 101, we'll stick with grep, which can also make use of the "extended" regular expressions when used with the -E option. You will find some form of grep on just about every Unix or Unix-like system available.


Syntax

 grep [options  ] regex [files] 


Description

Search files or standard input for lines containing a match to regular expression regex. By default, matching lines will be displayed and nonmatching lines will not be displayed. When multiple files are specified, grep displays the filename as a prefix to the output lines (use the -h option to suppress filename prefixes).


Frequently used options


-c

Display only a count of matched lines, but not the lines themselves.


-h

Display matched lines, but do not include filenames for multiple file input.


-i

Ignore uppercase and lowercase distinctions, allowing abc to match both abc and ABC.


-n

Display matched lines prefixed with their line numbers. When used with multiple files, both the filename and line number are prefixed.


-v

Print all lines that do not match regex. This is an important and useful option. You'll want to use regular expressions, not only to select information but also to eliminate information. Using -v inverts the output this way.


-E

Interpret regex as an extended regular expression. This makes grep behave as if it were egrep.

5.8.1.1. Examples

Since regular expressions can contain both metacharacters and literals, grep can be used with an entirely literal regex. For example, to find all lines in file1 that contain either Linux or linux, you could use grep like this:

 $ grep -i linux file1 

In this example, the regex is simply linux. The uppercase L in Linux will match since the command-line option -i was specified. This is fine for literal expressions that are common. However, in situations in which regex includes regular expression metacharacters that are also shell special characters (such as $ or *), the regex must be quoted to prevent shell expansion and pass the metacharacters on to grep.

As a simplistic example of this, suppose you have files in your local directory named abc, abc1, and abc2. When combined with bash's echo built-in command, the abc* wildcard expression lists all files that begin with abc, as follows:

 $ echo abc* 

abc abc1 abc2

Now, suppose that these files contain lines with the strings abc, abcc, abccc, and so on, and you wish to use grep to find them. You can use the shell wildcard expression abc* to expand to all the files that start with abc as displayed with echo above, and you'd use an identical regular expression abc* to find all occurrences of lines containing abc, abcc, abccc, etc. Without using quotes to prevent shell expansion, the command would be:

 $ grep abc* abc* 

After shell expansion, this yields:

 grep abc abc1 abc2 abc abc1 abc2    no! 

This is not what you intended! grep would search for the literal expression abc, because it appears as the first command argument. Instead, quote the regular expression with single or double quotes to protect it (the difference between single quotes and double quotes on the command line is subtle and is explained later in this section):

 $ grep 'abc*' abc* 

or:

 $ grep "abc*" abc* 

After expansion, both examples yield the same results:

 grep abc* abc abc1 abc2 

Now this is what you're after. The three files abc, abc1, and abc2 will be searched for the regular expression abc*. It is good to stay in the habit of quoting regular expressions on the command line to avoid these problemsthey won't be at all obvious because the shell expansion is invisible to you unless you use the echo command.

On the Exam

The use of grep and its options is common. You should be familiar with what each option does, as well as the concept of piping the results of other commands into grep for matching.


5.8.2. Using sed

sed, the stream editor, is a powerful filtering program found on nearly every Unix system. The sed utility is usually used either to automate repetitive editing tasks or to process text in pipes of Unix commands (see "Objective 4: Use Streams, Pipes, and Redirects," earlier in this chapter). The scripts that sed executes can be single commands or more complex lists of editing instructions.


Syntax

 sed [options] 'command1' [files] sed [options] -e 'command1' [-e 'command2'...] [files] sed [options] -f script [files] 


Description

The first form invokes sed with a one-line command1. The second form invokes sed with two (or more) commands. Note that in this case the -e parameter is required for each command specified. The commands are specified in quotes to prevent the shell from interpreting and expanding them. The last form instructs sed to take editing commands from file script (which does not need to be executable). In all cases, if files are not specified, input is taken from standard input. If multiple files are specified, the edited output of each successive file is concatenated.


Frequently used options


-ecmd

The -e option specifies that the next argument (cmd) is a sed command (or a series of commands). When specifying only one string of commands, the -e is optional.


-f file

file is a sed script.


-g

Treat all substitutions as global.

The sed utility operates on text through the use of addresses and editing commands. The address is used to locate lines of text to be operated on, and editing commands modify text. During operation, each line (that is, text separated by newline characters) of input to sed is processed individually and without regard to adjacent lines. If multiple editing commands are to be used (through the use of a script file or multiple -e options), they are all applied in order to each line before moving on to the next line.


Addressing

Addresses in sed locate lines of text to which commands will be applied. The addresses can be:

  • A line number (note that sed counts lines continuously across multiple input files). The symbol $ can be used to indicate the last line of input. A range of line numbers can be given by separating the starting and ending lines with a comma (start,end), so for example the address for all input would be 1,$.

  • A regular expression delimited by forward slashes (/regex/).

  • A line number with an interval. The form is n~s, where n is the starting line number and s is the step, or interval, to apply. For example, to match every odd line in the input, the address specification would be 1~2 (start at line 1 and match every two lines thereafter). This feature is a GNU extension to sed.

If no address is given, commands are applied to all input lines by default. Any address may be followed by the ! character, and commands are applied to lines that do not match the address.

5.8.2.1. Commands

The sed command immediately follows the address specification if present. Commands generally consist of a single letter or symbol, unless they have arguments. Following are some basic sed editing commands to get you started.


d

Delete lines.


s

Make substitutions. This is a very popular sed command. The syntax is as follows:

s/pattern/replacement/[flags]

The following flags can be specified for the s command:


g

Replace all instances of pattern, not just the first.


n

Replace nth instance of pattern; the default is 1.


p

Print the line if a successful substitution is done. Generally used with the -n command-line option.


w file

Print the line to file if a successful substitution is done.


y

Translate characters. This command works in a fashion similar to the tr command, described earlier.


Example 1

Delete lines 3 through 5 of file1:

 $ sed '3,5d' file1 


Example 2

Delete lines of file1 that contain a # at the beginning of the line:

 $ sed '/^#/d' file1 


Example 3

Translate characters:

 y/abc/xyz/ 

Every instance of a is translated to x, b to y, and c to z.


Example 4

Write the @ symbol for all empty lines in file1 (that is, lines with only a newline character but nothing more):

 $ sed 's/^$/@/' file1 


Example 5

Remove all double quotation marks from all lines in file1:

 $ sed 's/"//g' file1 


Example 6

Using sed commands from external file sedcmds, replace the third and fourth double quotation marks with ( and ) on lines 1 through 10 in file1. Make no changes from line 11 to the end of the file. Script file sedcmds contains:

 1,10{ s/"/(/3 s/"/)/4 } 

The command is executed using the -f option:

 $ sed -f sedcmds file1 

This example employs the positional flag for the s (substitute) command. The first of the two commands substitutes ( for the third double-quote character. The next command substitutes ) for the fourth double-quote character. Note, however, that the position count is interpreted independently for each subsequent command in the script. This is important because each command operates on the results of the commands preceding it. In this example, since the third double quote has been replaced with (, it is no longer counted as a double quote by the second command. Thus, the second command will operate on the fifth double quote character in the original file1. If the input line starts out with the following:

 """""" 

after the first command, which operates on the third double quote, the result is this:

 ""(""" 

At this point, the numbering of the double-quote characters has changed, and the fourth double quote in the line is now the fifth character. Thus, after the second command executes, the output is as follows:

 ""(")" 

As you can see, creating scripts with sed requires that the sequential nature of the command execution be kept in mind.

If you find yourself making repetitive changes to many files on a regular basis, a sed script is probably warranted. Many more commands are available in sed than are listed here.

5.8.3. Examples

Now that the gory details are out of the way, here are some examples of simple regular expression usage that you may find useful.

5.8.3.1. Anchors

Anchors are used to describe position information. Table 5-6 lists anchor characters.


Example 1

Display all lines from file1 where the string Linux appears at the start of the line:

 $ grep '^Linux' file1 


Example 2

Display lines in file1 where the last character is an x:

 $ grep 'x$' file1 

Display the number of empty lines in file1 by finding lines with nothing between the beginning and the end:

 $ grep -c '^$' file1 

Display all lines from file1 containing only the word null by itself:

 $ grep '^null$' file1 

5.8.3.2. Groups and ranges

Characters can be placed into groups and ranges to make regular expressions more efficient, as shown in Table 5-7.


Example 1

Display all lines from file1 containing Linux, linux, TurboLinux, and so on:

 $ grep '[Ll]inux' file1 


Example 2

Display all lines from file1 which contain three adjacent digits:

 $ grep '[0-9][0-9][0-9]' file1 


Example 3

Display all lines from file1 beginning with any single character other than a digit:

 $ grep '^[^0-9]' file1 


Example 4

Display all lines from file1 that contain the whole word Linux or linux, but not LinuxOS or TurboLinux:

 $ grep '\<[Ll]inux\>' file1 


Example 5

Display all lines from file1 with five or more characters on a line (excluding the newline character):

 $ grep '.....' file1 


Example 6

Display all nonblank lines from file1 (i.e., that have at least one character):

 $ grep '.' file1 


Example 7

Display all lines from file1 that contain a period (normally a metacharacter) using an escape:

 $ grep '\.' file1 

5.8.3.3. Modifiers

Modifiers change the meaning of other characters in a regular expression. Table 5-8 lists these modifiers .


Example 1

Display all lines from file1 that contain ab, abc, abcc, abccc, and so on:

 $ grep 'abc*' file1 


Example 2

Display all lines from file1 that contain abc, abcc, abccc, and so on, but not ab:

 $ grep 'abcc*' file1 


Example 3

Display all lines from file1 that contain two or more adjacent digits:

 $ grep '[0-9][0-9][0-9]*' file1 

or:

 $ grep '[0-9]\{2,\}' file1 


Example 4

Display lines from file1 that contain file (because ? can match zero occurrences), file1, or file2:

 $ grep 'file[12]\?' file1 


Example 5

Display all lines from file1 containing at least one digit:

 $ grep '[0-9]\+' file1 


Example 6

Display all lines from file1 that contain 111, 1111, or 11111 on a line by itself:

 $ grep '^1\{3,5\}$' file1 


Example 7

Display all lines from file1 that contain any three-, four-, or five-digit number:

 $ grep '\<[0-9]\{3,5\}\>' file1 


Example 8

Display all lines from file1 that contain Happy, happy, Sad, sad, Angry, or angry:

 $ grep -E '[Hh]appy|[Ss]ad|[Aa]ngry' file1 


Example 9

Display all lines of file that contain any repeated sequence of abc (abcabc, abcabcabc, and so on):

 $ grep '\(abc\)\{2,\}' file 

5.8.3.4. Basic regular expression patterns

Example 1

Match any letter:

 [A-Za-z] 


Example 2

Match any symbol (not a letter or digit):

 [^0-9A-Za-z] 


Example 3

Match an uppercase letter, followed by zero or more lowercase letters:

 [A-Z][a-z]* 


Example 4

Match a U.S. Social Security Number (123-45-6789) by specifying groups of three, two, and four digits separated by dashes:

 [0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\} 


Example 5

Match a dollar amount, using an escaped dollar sign, zero or more spaces or digits, an escaped period, and two more digits:

 \$[ 0-9]*\.[0-9]\{2\} 


Example 6

Match the month of June and its abbreviation, Jun. The question mark matches zero or one instance of the e:

 June\? 

On the Exam

Make certain you are clear about the difference between file globbing and the use of regular expressions.


5.8.3.5. Using regular expressions as addresses in sed

These examples are commands you would issue to sed. For example, the commands could take the place of command1 in this usage:

 $ sed  [options]  'command1 ' [files]  

These commands could also appear in a standalone sed script.


Example 1

Delete blank lines:

 /^$/d 


Example 2

Delete any line that doesn't contain #keepme::

 /#keepme/!d 


Example 3

Delete lines containing only whitespace (spaces or tabs). In this example, Tab means the single tab character and is preceded by a single space:

 /^[ Tab]*$/d 

Because GNU sed also supports character classes, this example could be written as follows:

 /^[[:blank:]]*$/d 


Example 4

Delete lines beginning with periods or pound signs:

 /^[\.#]/d 


Example 5

Substitute a single space for any number of spaces wherever they occur on the line:

 s/  */ /g 

or

 s/ \{2,\}/ /g 


Example 6

Substitute def for abc from line 11 to 20, wherever it occurs on the line:

 11,20s/abc/def/g 


Example 7

Translate the characters a, b, and c to the @ character from line 11 to 20, wherever they occur on the line:

 11,20y/abc/@@@/ 



LPI Linux Certification in a Nutshell
LPI Linux Certification in a Nutshell (In a Nutshell (OReilly))
ISBN: 0596005288
EAN: 2147483647
Year: 2004
Pages: 257

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net