Finding Patterns in Files | UNIX: The Complete Reference, Second Edition (Complete Reference Series)

Among the most commonly used tools in the UNIX System are those for finding words in files, especially grep, fgrep, and egrep. These commands search for text that matches a target or pattern that you specify You can use them to extract information from files, to search the output of a command for lines relating to a particular item, and to locate files containing a particular key word.

The three commands in the grep family are very similar. All of them print lines matching a target. They differ, however, in how you specify the search targets.

grep is the most commonly used of the three commands. It lets you search for a target which may be one or more words or patterns containing wildcards and other regular expression elements.
fgrep (fixed grep) does not allow regular expressions but does allow you to search for multiple targets.
egrep (extended grep) takes a richer set of regular expressions, as well as allowing multiple target searches, and is considerably faster than grep.

grep

The grep command searches through one or more files for lines containing a target and then prints all of the matching lines it finds. For example, the following command prints all lines in the file mtg_note that contain the word “room”:

 $ grep room mtg_note will be held at 2:00 in room 1J303. We will discuss

Note that you specify the target as the first argument and follow it with the names of the files to search. Think of the command as “search for target in file.”

The target can be a phrase-that is, two or more words separated by spaces. If the target contains spaces, however, you have to enclose it in quotes to prevent the shell from treating the different words as separate arguments. The following searches for lines containing the phrase “boxing wizards” in the file pangrams:

 $ grep "boxing wizards" pangrams The five boxing wizards jump quickly.

Note that if the words “boxing” and “wizards” appear on different lines (separated by a newline character), grep will not find them, because it looks at only one line at a time.

If you give grep two or more files to search, it includes the name of the file before each line of output. For example, the following command searches for lines containing the string “vacation” in all of the files in the current directory:

 $ grep vacation * mbox: I'll be gone on vacation July 24–28, but we could meet mbox: so, the only week when we're all available for a vacation savemail: sounds like a great idea for a vacation. I'd love

The output lists the names of the two files that contain the target word “vacation”-mbox and savemail-and the line(s) containing the target in each file.

You can use this feature to locate a file when you have forgotten its name but remember a key word that would identify it. For example, if you keep copies of your saved e-mail in a particular directory, you can use grep to find the one dealing with a particular subject by searching for a word or phrase that you know is contained in it. The following command shows how you can use grep to find a mail from someone named Dan:

 $ grep Dan * savemail27: From: Dan N <dnidz> savemail43: well, sure. Dancing is pretty good exercise, so I

This shows you that the letter you were looking for is in the file savemail27.

Searching for Patterns Using Regular Expressions

The examples so far have used grep to search for specific words or strings of text, but grep also allows you to search for patterns that may match a number of different words or strings. The patterns for grep can be the same kinds of regular expressions that were described in Chapter 5. For example,

 $ grep 'ch.*se' recipes

will find entries containing “chinese” or “cheese”, or in fact any line that has a ch sometime before an se, including something like “Blanch for 45 seconds”.

In the preceding pattern, the dot (.) matches any character other than newline. The asterisk says that those characters may be repeated any number of times. Together, .* indicates any string of any characters. Note that in this example the target pattern “ch.*se” is enclosed in single quotation marks. This prevents the asterisk from being treated by the shell as a filename wildcard. In general, you need to use quotes around any regular expression containing a character that has special meaning for the shell. (Filename wildcards and other special shell symbols are discussed in Chapter 4.)

Other regular expression symbols that are often useful in specifying targets for grep include the caret (^) and dollar sign ($), which are used to anchor words to the beginning and end of lines, and brackets ([ ]), which are used to indicate a class of characters. The following example shows how these can be used to specify patterns as targets:

 $ grep '^Section [1–9]$' manuscript

This command finds all lines that contain just “Section n”, where n is a number from 1 to 9, in the file manuscript. The caret at the beginning and the dollar sign at the end indicate that the pattern must match the whole line. The brackets indicate that the target can include any one of the numbers from 1 to 9.

Table 19–1 lists regular expression symbols that are useful in forming grep search patterns.

Table 19–1: grep Regular Expressions
Symbol	Definition	Example	Matches
.	Matches any single character.	th.nk	think, thank, thunk, etc.
\	Quotes the following character.	script\.py	script.py
*	Matches zero or more repetitions of the previous item.	ap*le	ale, apple, etc.
[ ]	Matches any one of the characters inside.	[QqXx]	Q, q, X, or x
[a-z]	Matches any one of the characters in the range.	[0–9]*	any number: 0110, 27, 9876, etc.
^	Matches the beginning of a line.	^If	any line beginning with If
$	Matches the end of a line.	\.$	any line ending in a period

Options for grep

Normally, grep distinguishes between uppercase and lowercase. For example, the following command would find “Unix” but not “UNIX" or “unix”:

 $ grep Unix notes

You can use the −i (ignore case) option to find all lines containing a target regardless of uppercase and lowercase distinctions. This command finds all occurrences of the word “unix” regardless of capitalization:

 $ grep −i unix notes

The −r option causes grep to recursively search files in all the subdirectories of the current directory.

 $ grep −r "\.p[ly]" * PerlScripts/quickmail.pl: # usage: quickmail.pl recipient subject contents PythonScripts/zwrite.py: # usage: zwrite.py username

The backslash (\) prevents the dot (.) from being treated as a regular expression character-it represents a period here, so grep searches for a file containing “.pl” or “.py”. Be careful: if the directory contains many subdirectories with many files in them, it can take a very long time for a command like this to complete.

Another useful grep option, −n, allows you to list the line number on which the target (here, while) is found. For example,

 $ grep −n while perlsample.pl 4: while (<>){ 11: while ($n > −0) {

One of the common uses of grep is to find which of several files in a directory deals with a particular topic. If all you want is to identify the files that contain a particular word or pattern, there is no need to print out the matching lines. With the −l (list) option, grep suppresses the printing of matching lines and just prints the names of files that contain the target. The following example lists all files in the current directory that include the word “Duckpond”:

 $ grep −l Duckpond * about.html index.html report.cgi

You can use this option with the shell command substitution feature described in Chapter 4 to use these filenames as arguments to another UNIX System command. For example, the following command will use more to list all the files found by grep:

 more 'grep −l Duckpond *'

By default, grep finds all lines that match the target pattern. Sometimes, though, it is useful to find the lines that do not match a particular pattern. You can do this with the −v option, which tells grep to print all lines that do not contain the specified target. This provides a quick way to find entries in a file that are missing a required piece of information. For example, suppose the file phonenums contains your personal phone book. The following command will print all lines in phonenums that do not contain numbers:

 $ grep −v '[0–9]' phonenums

The −v option can also be useful for removing unwanted information from the output of another command. Chapter 3 described the file command and showed how you can use it to get a short description of the type of information contained in a file. Because the file command includes the word “directory” in its output for directories, you could list all files in the current directory that are not directories by piping the output of file to grep −v, as shown in the following example:

 $ file * | grep −v directory

fgrep

The fgrep command is similar to grep, but with three main differences: You can use it to search for several targets at once, it does not allow you to use regular expressions to search for patterns, and it is faster than grep. When you need to search many files or a very large file, the difference in speed can be significant.

With fgrep, you can search for lines containing any one of several targets. For example, the following command finds all entries in the phone_nums file that contain any of the words “saul”, “michelle”, or “anita”:

 $ fgrep "saul > michelle > anita" phone_nums

The output might look like this:

 saul           555–1122 saul (home)    555–1100 michelle       555–3344 anita          555–6677

When you give fgrep multiple search targets, each one must be on a separate line, and the entire search string must be in quotation marks. In this example, if you didn’t put michelle on a separate line you would be searching for saul michelle, and if you left out the quotes, the command would execute as soon as you hit ENTER.

With the −f (file) option, you can tell fgrep to take the search targets from a file, rather than having to enter them directly If you had a file in your home directory named .friends containing the usernames of your friends on the system, you could use fgrep to search the output of the finger command for the names on your list, like this:

 $ finger | fgrep −f −/.friends

egrep

The egrep command is the most powerful member of the grep command family You can use it like fgrep to search for multiple targets, and it provides a larger set of regular expressions than grep. In fact, if you find yourself using the extended features of egrep often, you may want to add an alias that replaces grep with egrep in your shell configuration file. (For example, if you are using bash, you could add the line “alias grep=egrep” to your .bashrc.)

You can tell egrep to search for several targets in two ways: by putting them on separate lines as in fgrep, or by separating them with the vertical bar or pipe symbol (|). For example, the following command uses the pipe symbol to tell egrep to search for the words dan, robin, ben, and mari in the file phone_list:

 $ egrep "dan|robin ben|mari" phone_list dan       dnidz       x1234 robin     rpelc       x3141 ben       bsquared    x9876 marissa   mbaskett    x2718

Note that there are no spaces between the pipe symbol and the targets. If there were, egrep would consider the spaces part of the target string. Also note the use of quotation marks to prevent the shell from interpreting the pipe symbol as an instruction to create a pipeline.

Table 19–2 summarizes the egrep extensions to the grep regular expression symbols.

Table 19–2: Additional egrep Regular Expressions
Symbol	Definition	Example	Matches
+	Matches one or more repetitions of the previous item.	.+	any non-empty line
?	Matches the previous item zero or one times.	index\.html?	index.htm, index.html
( )	Groups a portion of the pattern.	script(\.pl)?	script, script.pl
\|	Matches either the value before or after the \|.	(E\|e)xit	Exit, exit

The egrep command provides most of the basic options of both grep and fgrep. You can tell it to ignore uppercase and lowercase distinctions (−i), search recursively through subdirectories (−r), print the line number of each match (−n), print only the names of files containing target lines (−l), print lines that do not contain the target (−v), and take the list of targets from a file (−f).