Searching for Text Inside Files

Searching for Text Inside Files

It is extremely common when using Unix to want to search for specific words or strings of characters inside text files or to search the long output of some command. The main Unix command for this is grep .

Using grep

To search for a string in a text file:

  • grep string file

    For example,

    grep memory /etc/rc

    finds all the lines in the file /etc/rc that contain the string memory (in this case there was only one line, so only one line was returned) ( Figure 4.22 ).

    Figure 4.22. Using grep to find the string memory in a file.
     localhost:~ vanilla$  grep memory /etc/rc  echo "Starting virtual memory" localhost:~ vanilla$ 

To make the search case insensitive:

  • Use the -i option. For example,

    grep -i apple /etc/services

    Figure 4.23 shows part of the result. Note how both apple and Apple are found.

    Figure 4.23. Performing a case-insensitive search with grep . Your output depends on your machine's configuration. (Partial output shown.)
     localhost:~ vanilla$  grep -i apple /etc/services  . . . asip-webadmin    311/udp       # AppleShare IP WebAdmin asip-webadmin    311/tcp       # AppleShare IP WebAdmin #                              Ann Huang <annhuang@apple.com> aurp             387/udp       # Appletalk Update-Based Routing Pro. Aurp             387/tcp       # Appletalk Update-Based Routing Pro. Appleqtc         458/udp       # apple quick time Appleqtc         458/tcp       # apple quick time #                              <murali_ranganathan@quickmail.apple.com> appleqtcsrvr     545/udp       # appleqtcsrvr appleqtcsrvr     545/tcp       # appleqtcsrvr . . . localhost:~ vanilla$ 

Tips

  • To see the output one screenful at a time, pipe the output of grep through the less pager:

    grep -i apple /etc/services less

  • You can use to interrupt the command and get back to the prompt.


Not All greps Are the Same

We cover the version of grep that comes with Darwin/Mac OS X. Different flavors of Unix come with different versions of the grep family ( grep , egrep , fgrep , agrep ), so the exact behavior of each command will vary slightly depending on the version installed on your system. The best way to see the differences is to read the Unix man pages for each command.


To search for a string in multiple files:

  • Simply add more files to the argument list, perhaps by using wildcards. For example,

    grep -i network *

    As we discussed in Chapter 2, the * (asterisk) is a special character expanded by the shell to be a list of many files.

    This is probably the most common use of grep to find all the occurrences of a string throughout multiple files. When searching multiple files, grep adds the filename at the beginning of each line of output so that you know which file each line came from:

    grep " boot" /etc/rc*

    gives output as shown in Figure 4.24 .

    Notice the use of quotes to make the search string include a space character (searches for " boot" instead of just boot ); TRy removing the quotes to see the difference.

    Figure 4.24. Using grep to search multiple files for a string that includes a space.
     localhost:~ vanilla$  grep " boot" /etc/rc*  /etc/rc:    echo "CD-ROM boot procedure complete" /etc/rc:    echo "Configuring kernel extensions for safe boot" /etc/rc.netboot:# Prevent inadvertent problems caused by interrupting the shell during boot. localhost:~ vanilla$ 

Tip

  • If you give an argument to grep that is a directory (instead of a regular file), grep gives you an error message saying that the file "is a directory." You can tell grep to skip directories by adding the -d skip optionfor example,

    grep -d skip NETWORK */*


Where grep Gets Its Name

grep gets its name from g/RE/p , which is a representation of the commands in the old Unix editor ed to " g lobally search for a r egular e xpression and p rint."

Regular expressions make up a complex and powerful system for matching patterns and are available in many Unix programs. We'll cover a small part of regular expressions in this chapter.

The grep program is the main Unix command for searching text files or a stream of text (such as the output of another program). The output of grep is every line that contains the search string. The default is for searches to be case sensitive.


To recursively search all the files in a directory:

  • Use the - r option. For example,

     grep -ri network  /System/Library/StartupItems 

    performs a case-insensitive search (note the -i option) of all the files inside /System/Library/StartupItems and all its subdirectories.

Tip

  • Using the -r option to grep can cause the command to take a long time to complete, so you might want to redirect the output to a file and put the job in the background (review "Redirecting stdout " and "Running a Command in the Background" in Chapter 2):

     grep -ri network  /System/Library/StartupItems >  found.txt & 


To find the lines that do not match:

  • Use the -v option.

    grep -v tcp /etc/services

    finds all the lines in the directory /etc/services that do not contain the string tcp .

To search the output of another command:

  • Pipe the output of the other command through grep .

    last grep reboot

    shows all the reboots this month.

    (The last command shows a history of logins to your machine, as well as crashes and reboots. Type man last .)

Tip

  • Use multiple grep s in the pipeline to narrow your request.

    last grep reboot grep "Feb 16"

    finds all the reboots on February 16. See how the first grep filters the output of the last command, and the second grep filters the output further, narrowing down the result.


The grep and egrep ( e for extended ) programs have a huge number of options. Table 4.1 lists some of the more common ones. See the Unix manual for more (type man grep ).

Table 4.1. Options for grep and egrep

O PTION

M EANING

- i

Ignore case.

- v

Show only lines that do not match.

- n

Add line numbers .

- l

Show only the names of files in which matches were found.

- L

Show only the names of files without a match. [*]

- r

Recursively search directories. [*]

- d skip

Skip arguments that are directories. [*]


[*] These options are not as common as the others, but Mac OS X does have them.

Using patterns in your search

You will often want to search for something more complicated than a literal string of characters. You might want to search only for lines that begin with a certain string, or for lines that contain a range of dates, such as Feb 15 or Feb 16 .

The egrep command supports an extremely powerful (and complex) pattern-matching system called regular expressions. The re in egrep stands for regular expression . (The grep command also supports a small number of regular expressions. To avoid switching back and forth, we will stick with egrep here.)

Regular expressions are used in a large number of situations in Unix, not only with the grep and egrep commands. For example, the Unix programs sed , awk , and vi all use regular expressions, as do the C, Perl, Tcl, Python, and Java programming languages. The basic syntax of regular expressions is the same or very similar in a variety of situations, so once you learn how to use them in one area, you have a head start on using them in another.

Regular expressions are built up like mathematical formulas (see the sidebar "Learning More About Regular Expressions"). You can do a lot with the few rules we'll show you here.

An important concept to grasp in using regular expressions is that when you search for "hello," you are really searching for a pattern consisting of six atoms ( h, e, l, l, and o ). In regular expressions, an atom is a part of the overall expression that matches one character. The most common kind of atom is simply a literal character, so the atom h matches the letter h . But atoms don't stop there. For example, the atom [a-d] matches one letter from the range a, b, c, or d . (The [and] are used to define a set of characters.) So when you see the word atom used in the examples below, keep in mind that an atom can be as simple as one character, or it can be a more complex notation that matches one character from a list of possibilities.

Regular expressions have a few major rules, which are demonstrated in the examples below.

Also, note that you always enclose the pattern inside single quotes; this is to prevent the shell from misinterpreting any of the characters used in regexes (as they're traditionally called) that have special meaning to the shell: [] {} . * .

Compare with Aqua

Mac OS X provides a nice graphical interface for finding strings of text in files on your Mac, and in some ways it can do a better job than grep . Spotlight (introduced in Mac OS X 10.4) is a powerful and elegantly designed graphical interface for searching your hard disk. Spotlight searches the names and contents of files and presents the results grouped by types of files. It also builds indexes of files, which speeds up searching. And, of course, it is mostly a point-and-click interface.

In practice, grep is often easier to use than Spotlight (once you get used to the command line); for example, it is rather hard in Spotlight to focus your search on a few specific files, whereas with grep you can supply any arbitrary list of files as arguments on the command line.


To find lines starting with a specific string of characters:

  • Put a ^ at the beginning of the string;

    grep ^# ~/bin/reverse

    finds only lines beginning with # , and grep -v ^# script.pl would find all lines not beginning with # . Figure 4.25 shows the output from both command lines.

    Figure 4.25. Using a pattern that matches the character at the start of a line.
     localhost:~ vanilla$  cat ~/bin/reverse  #!/usr/bin/perl # reverse cat script # @file = <>; while ( @file ) {    print pop(@file); } localhost:~ vanilla$  grep ^# ~/bin/reverse  #!/usr/bin/perl # reverse cat script # localhost:~ vanilla$  grep -v ^# ~/bin/reverse  @file = <>; while ( @file ) {    print pop(@file); } localhost:~ vanilla$ 

    The ^ character in a regex is called an anchor because it anchors the search string to the start of the line.

To find lines ending with a string:

  • Putting a $ at the end of the string

    grep today$ filename

    finds only lines ending with today .

    The $ anchors the search string to the end of the line.

Tip

  • If you anchor a pattern to both the start and the end of linefor example, ^word$ then only lines that exactly match will be found. That is, only lines that consist solely of the pattern, with nothing before or after.


Testing regular expressions

Regular expressions can get quite complex, and learning how to use them takes practice. Luckily there is an easy way to test them to see if they match what you think they will match.

If you use one of the grep commands ( grep or egrep ) with a pattern but without giving it a filename or input from a pipe, then it waits for you to type input and repeats back to you any lines that match.

In most of our examples we use the egrep command because it supports extended regular expression. In cases where we don't need extended regexes, we will use plain old grep .

To test a regular expression:

1.
egrep ' regular expression '

For example:

egrep '^[hH]ello'

(Using both upper- and lowercase letters means you're looking for both instances.)

Notice you are not giving egrep a file to search. When you press , you get a blank line. egrep is waiting for you to type in a line of text, which it will check against the pattern. Figure 4.26 shows the next few steps in this task with the text you type highlighted (in bold ) and the Mac's response in plain text.

Figure 4.26. Testing a regular expression.
 localhost:~ vanilla$  egrep '^[hH]ello'   Hello, nice to meet you.  Hello, nice to meet you.  Say, Hello world   ^C  localhost:~ vanilla$ 

2.
Type in a line of text that you think should (or should not) match the pattern, and press . For example:

Hello, nice to meet you.

If the shell displays (repeats back to you) the line you typed, then the expression matched (the example above should match). Otherwise, it did not match.

3.
Type in another line of text to check:

Say, Hello world

This does not match the pattern in step 1 because the line does not match the ^ anchor ("look at the beginning of the line").

4.
To exit from the test, press .

Learning More About Regular Expressions

Regular expressions are used not only with the grep program but also with multiple Unix programs and programming languages.

Here are a few places to learn more about regexes (as they are known to Unix experts):

  • Learning to Use Regular Expressions (http://gnosis.cx/publish/programming/regular_expressions.html).

    A nice online tutorial, though it assumes you are working with regular expressions in one of the many programming languages that use them.

  • Electronic Text Center: Using Regular Expressions (http://etext.lib. virginia .edu/helpsheets/regex.html).

    An introduction to regular expressions that describes the history and main concepts, and gives examples of their use.

  • Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools , by Jeffrey E. F. Friedl (O'Reilly, 1997; www.oreilly.com/catalog/regex).

    Considered by many to be the standard in-depth work on regular expressions.


Tips

  • In each of the tasks below, try testing the regular expression with several different lines of input to see how each one behaves. Try to predict what will and will not match.

  • If you make a mistake while typing a test expression, you may notice that the key doesn't work the way you expect. Instead, press , which will erase the whole line. This is a very useful trick that works in many command-line situations.


To find lines containing a string in which one character can vary:

  • Use square brackets to create an atom from a list of characters. For example:

    egrep '[FNW]orm A-100'

    The atom [FNW] means "one character that matches any of the three atoms F , N , or W ." Figure 4.27 shows examples of testing this pattern with three different matching lines.

    Figure 4.27. Using an atom that matches any character in a list.
     localhost:~ vanilla$  egrep '[FNW]orm A-100'   Worm A-100 is a very virulent worm.  Worm A-100 is a very virulent worm.  Norm A-100 isn't really normal.  Norm A-100 isn't really normal.  Form A-100 must be filled out in pink ink.  Form A-100 must be filled out in pink ink.  ^C  localhost:~ vanilla$ 

To create an atom that is anything not in a list:

  • Use the ^ character as the first character in the list. For example:

    egrep '[^FNW]orm A-100'

    Figure 4.28 shows examples of testing this pattern (again, you type the text that's in bold). Matching lines will not contain any of the following:

     Form A-100 Norm A-100 Worm A-100 

    Figure 4.28. Using an atom that matches any character not in the list.
     localhost:~ vanilla$  egrep '[^FNW]orm A-100'   Worm A-100 no longer matches.   Either does Norm A-100.   Nor even Form A-100.   But form A-100 does because of the lowercase f.  But form A-100 does because of the lowercase f.  See you in Dorm A-100   See you in Dorm A-100   how about this?   ^C  localhost:~ vanilla$ /Applications (Mac OS 9)/FileMaker Pro 5 Folder/FileMaker Pro 

    This line will match:

    Dorm A-100

Tips

  • Notice how the last example is different from the -v option (described earlier). The -v option finds all lines that do not match the whole pattern. The example here finds lines that contain 'orm A-100' but only if the first letter before 'orm' is not F , N or W .

  • Notice that the use of ^ here is different from using ^ as the start-of-line anchor. The ^ behaves differently when it is the first character in a square- bracket list.


To create an atom from a range of numbers:

  • Use egrep and put square brackets around the range:

    egrep 'Feb 1[5-9]' mail.log

    The [5-9] means " 5 , 6 , 7 , 8 , or 9 " in regular-expression language.

Tip

  • You can use multiple lists, such as [2-3][0-5] (that means " 20-35 ").


To create an atom from a range of letters:

  • Use egrep and the square brackets, and a-z for lowercase, A-Z for uppercase. For example:

    egrep 'Appendix [B-D]' book.txt

Tip

  • You can make the range case insensitive by using lowercase and uppercase ranges in the atom:

    egrep 'Appendix [B-Db-d]' book.txt


To use a wildcard character:

  • Use the . character to mean "any single character." For example:

    egrep '.oy'

    behaves as shown in Figure 4.29 . Notice that the line that begins with oy doesn't match, because there is no character before the oy .

    Figure 4.29. Using a . (period) to match any single character.
     localhost:~ vanilla$  egrep '.oy'   toy should match  toy should match  so should boy  so should boy  even coy  even coy  but not this line   oy this one doesn't match either!   but oy, this one does.  but oy, this one does.  ^C  localhost:~ vanilla$ 

To find lines in which an atom is repeated zero or more times:

  • Use the * quantifier. (That's an asterisk on your keyboard, but it's generally referred to as a star in Unix speak.) For example,

    grep 'Form A-10*'

    behaves as shown in Figure 4.30 .

    Figure 4.30. Using * to match zero or more of an atom.
     localhost:~ vanilla$  grep 'Form A-10*'   Form A-100  Form A-100  Form A-10  Form A-10  Form A-1  Form A-1  ^C  localhost:~ vanilla$ 

    In this case the atom is the 0. The * quantifier means "zero or more of the preceding atom." Notice how it found the line in which did not appear at all.

Tip

  • Be careful when using the * character. If an argument contains a * and you don't want the shell to expand it to a list of filenames (see "Wildcards" in Chapter 2), then you must make sure that any use of * on the command line is enclosed in quotes or that you escape the * by preceding it with a backslash; for example:

    grep fo\*bar *.txt

    In that case the shell would expand the second * to match all the filenames that end in .txt, but the shell would pass the string fo*bar to grep as an argument without expanding the * .


More rules and tools for building regex atoms

The regex examples shown above allow you to perform some fairly sophisticated matching, but there are a lot more ways to create atoms and patterns. Table 4.2 describes several additional tools you will find useful in constructing more complex patterns. All the tools and rules in Table 4.2 require egrep . The real key is to experiment using the testing approach described above.

Table 4.2. Rules and Tools for Regex Atoms

R ULE

T OOL /M EANING

Match 1 or more

Use the + quantifier, "one or more of the preceding atom."

Match 0 or 1

Use the ? quantifier, "zero or one of the preceding atom."

Exact number

Put the number in braces; [a-c]{3} means "any character from the list a-c repeated exactly three times."

Alternatives

Put each alternative in parentheses, and separate them with the pipe character; '(Fox)(Hound)' means "match lines containing either Fox or Hound."

Match special characters

If you want to match characters that have special meanings in a regex such as [ or ^, then escape (that is, remove any special meaning from) them with a \ ; for example, \[ will match a literal [ . Inside a square-bracket list, you do not need to escape anything.

Match ^ inside a list

To include the ^ character in a square-bracket list, put it anywhere except first in the list; for example, [a-c^] matches a , b , c , or ^ .




Unix for Mac OS X 10. 4 Tiger. Visual QuickPro Guide
Unix for Mac OS X 10.4 Tiger: Visual QuickPro Guide (2nd Edition)
ISBN: 0321246683
EAN: 2147483647
Year: 2004
Pages: 161
Authors: Matisse Enzer

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net