Project23.Search File Content


Project 23. Search File Content

"How do I find all files containing the text Dear Janet before my wife does?"

This project shows how to search a file, or many files, for particular text. The search term can be straight text or a regular expression. The project covers the commands grep, wc, awk, and sed.

Learn More

To unleash the full power of grep, you must master regular expressions. Read Projects 77 and 78 to become a black belt in the art of /re/.


Use grep

This chapter puts the spotlight on grep and friends. The grep command searches through files to find particular text that matches a search pattern. A file is searched line by line, and a match occurs when a line contains the search pattern. It's important to realize that the search is done line by line and that to match, a line need only contain the search pattern, not be identical to it.

Let's search all text files (*.txt) in the current directory for the words Dear Janet.

$ grep "Dear Janet" *.txt hello.txt:Dear Janet, lets-meet.txt:Dear Janet, sauciest of vixens secret-liaison.txt:Dear Janet,


We see displayed all lines from all files that match, with the matching line of text preceded by the filename.

grep Options

The grep command has many options, the most useful of which are explained below.

To change the output format from filename:text of matching line, specify the following:

  • -l to display just filenames. Use this option when you are interested in which filenames match but not the matching lineswhen generating a list of files to process, for example.

  • -h to display just the matching line. Use this option when you want to process the lines of text and don't want filenames polluting the output.

  • -n to display line numbers too. This option is handy when you wish to edit the file later, as you can jump straight to the line in question.

  • -Cn to display n lines before and after the matching line. C is for context. This option is useful when you search text documents for information.

grep: It's an Odd Name!

The name grep comes from an ancient Unix text editor called ed, the forerunner of ex, which is the forerunner of vi, which is the forerunner of vim. To search for a regular expression from ed, you'd use the command sequence g/re/p. g is for global (the whole file), /re/ is the regular expression to search for delimited by /, and p says to print matching lines (for example, display onscreen). Back in the days of ed, CPU power and memory were expensive, so to avoid the overhead of running a general-purpose editor to perform what is a very common task, a new specialized command was written. It was called, as you have already guessed, grep.

And to prove it:

$ ed meeting.txt 110 g/Jan/p Dear Janet, Perhaps on Jan 31st? q



To change the pattern-matching rules, specify the following:

  • -i to ignore case. Hello will match hello and Hello.

  • -v to invert the sense of the match. Lines that do not contain the pattern will be displayed.

  • -w to match complete words only. Jan will match Dear Jan, hello but not Dear Janet, hello.

  • -x to match whole lines only. The line must equal the pattern, not just contain it. This option has the same effect as specifying start- and end-of-line anchors in the regular expression.

  • -E to activate extended regular expressions. By default, grep matches against basic regular expressions. Invoking grep as the command egrep is the same as using grep -E.

  • -F to match fixed strings only, not regular expressions. The grep command operates faster on fixed strings than on regular-expression patterns. Also, in this mode it's not necessary to escape characters like star that would otherwise be interpreted as pattern-matching operators. Invoking grep as the command fgrep is the same as using grep -F.

Use recursive mode:

  • -r to recursively search directories listed on the command line. In the following example, grep searches all files and directories in the current directory.

Learn More

For an explanation of recursive processing, see "Recursion" in Project 2.


$ grep -r "Janet" * archive/old-letter.txt:Dear Janet, hello.txt:Dear Janet, lets-meet.txt:Dear Janet, sauciest of vixens secret-liaison.txt:Dear Janet,


Tip

Project 18 tells you all about find with xargs, and Projects 15 and 17 show you how to use find.


The next example of recursion doesn't work as expected. We intended to say, "Search the current directory recursively for all *.txt files." What actually happens is that the shell expands *.txt to include all matching filenames (which does not include the directory archive); grep then searches each filename in the expansion, and if it's a directory, grep does so recursively. We can't specify to grep both a directory to search recursively and at the same time which files to consider.

$ grep -r "Janet" *.txt hello.txt:Dear Janet, lets-meet.txt:Dear Janet, sauciest of vixens secret-liaison.txt:Dear Janet,


The solution is to use find and xargs.

$ find . -iname "*.txt" -print0 | xargs -0 grep "Janet" ./archive/old-letter.txt:Dear Janet, ./hello.txt:Dear Janet, lets-meet.txt:Dear Janet, sauciest of vixens ./secret-liaison.txt:Dear Janet,


Some grep Examples

Mac OS X has a handy dictionary (a list of words, but bereft of definitions) located at /usr/share/dict/web2. Let's use grep to count how many words contain the sequence xy. We use option -c to count the number of matches instead of displaying them.

$ grep -c "xy" /usr/share/dict/web2 579


How many words start with xy? This requires the use of a regular expression that says "a line that starts xy".

$ grep -c "^xy" /usr/share/dict/web2 75


Name two of them! (Xylophone is the easy one.)

The grep command is often combined with command ps to look for specific processes. In the next example, grep filters the output from ps to display only those lines containing safari. (The ps command does not require its options to be preceded by dash.)

Tip

It's necessary to pass the option ww to ps; otherwise, long lines are truncated, possibly cutting off the command name we are searching for. Check the man page for ps for an explanation of all the options.


$ ps axww | grep -i safari 27946  ??  S     31:08.79 /Applications/Safari.app/ Contents/MacOS/Safari -psn_0_1739980801 16705 std  R+     0:00.00 grep -i safari


Learn More

Read Project 39 to learn more about Unix processes and command ps.


If you want to use the results of this command to extract the process ID of Safari, for example, the second line of output is unwelcome. This can be eliminated in either of two ways.

Use grep v.

$ ps axww | grep -i safari | grep -v grep 27946  ??  S     31:09.33 /Applications/Safari.app/ Contents/MacOS/Safari -psn_0_1739980801


Learn More

Project 18 has examples of grep used in conjunction with find and xargs.


Employ some clever regular-expression trickery.

$ ps axww | grep -i "safar[i]" 27946 ?? S 31:09.50 /Applications/Safari.app/ Contents/MacOS/Safari -psn_0_1739980801


How does this safar[i] TRick work? It's a regular expression that's equivalent to "safari", so it still matches "Safari". The grep command line, however, does not match now because it contains "safar[i]" and not "safari". Think about it.

Escape and Double Escape

Remember to enclose a regular expression in single quotes to avoid interpretation by the shell. The regular-expression sequence .* matches any string of characters, for example, but it must be escaped from the shell to stop the shell from treating the star as a globbing character and potentially expanding it. To match "line" and then any character sequence and then "1", we would type:

$ grep 'line.*1' *.txt


If we wish to search for the star character itself, star must also be escaped from regular-expression interpretation. To search for "line *1", we would type:

$ grep 'line \*1' *.txt


The escape character ensures that star is matched literally rather than being interpreted as a regular-expression operator. Refer to Project 77 if you are unfamiliar with regular expressions.

The next line is equivalent.

$ grep line\ \\\* *.txt


Remember fgrep? It searches for fixed patterns and does not activate regular expressions, so we can type simply

$ fgrep 'line *' *.txt


Zipped Files

Use a grep -based command to examine the contents of a zip- or bzip2-compressed file directly by using these commands:

  • zgrep

  • bzegrep

  • bzfgrep

  • bzgrep

These bz variants correspond to the versions of grep discussed in the "grep Options" section above.

Count Words

The wc command counts the number of characters, words, and lines in a text file. It's often used to count the number of results returned by a command or pipeline. We can repeat the dictionary example from earlier by using wc.

$ grep "xy" /usr/share/dict/web2 | wc -l      579 $ grep "^xy" /usr/share/dict/web2 | wc -l       75


Option -l says to count lines only, and you can guess at options -c and -w.

Note

An awk script takes print $0 to mean "display the whole line" and print $n to mean "display field n."


Note

Other shells, such as Tcsh, use the syntax `command` instead of $(command).


Use awk to Isolate and Format Text

The awk command (named after its authors, Aho, Weinberger, and Kernighan) is a powerful pattern-processing language. It's explored in detail in Projects 60 and 62, but one (very simple) way it can be used is to isolate a particular portion of each line of text it receives as input.

More specifically, this use of awk involves printing a selected field from the input textfield in this instance meaning a sequence of characters separated by white space. We can use awk to isolate Safari's process ID (PID) from the results of our earlier grep/ps search, for example. This example extends the earlier command with a pipeline to awk. An awk script, enclosed in single quotes, tells awk to print the value of the first field (field #1) of each input line. Because the first text string in a line of ps output is always a PID, this yields the PID of process Safari.

$ ps axww | grep -i "safar[i]" | awk '{print $1}' 27946


The number 27946 is the PID of Safari, and this number can be given as an argument to the kill command to abort the running process. We'll enclose the pipeline sequence in $(), which tells Bash to execute it, write the result back to the command line, and then execute the new command line.

Before we do any actual killing, use echo to demonstrate that the expression enclosed by $() still outputs the Terminal PID.

$ echo $(ps axww | grep -i "safar[i]" | awk '{print $1}') 27946


Learn More

Consult Project 52 if you wish to learn more about Bash functions.


Now run kill.

$ kill $(ps axww | grep -i "safar[i]" | awk '{print $1}')


For completeness, let's create a shell function killer to kill a given process by name.

$ killer () { kill $(ps axww | grep -i "$1" | ¬     grep -v "grep -i $1" | awk '{print $1}'); } $ killer safari


Tip

To learn about printf, type

$ man 3 printf


The man page documents the library call printf, which awk uses to implement its own printf.


The awk statement printf prints a formatted, or embellished, version of each input line. Here's a quick example of what can be done.

$ ls -l | awk '{printf("Date: %s %s, File %s\n",$7,$6,$9)}' Date: , File Date: 13 Sep, File csv Date: 13 Sep, File double-space Date: 30 Aug, File script


The first lineDate:, Fileresults from the first line written by ls -l. This can easily be removed with grep.

Use sed

The sed command is a stream editor and, like awk, processes its input lines based on matching patterns. It's covered in detail in Projects 59 and 61, and we'll use it here simply to search text files for lines that match a given pattern (Jan). Here are a couple of examples equivalent to the grep examples shown earlier in this project.

Option -n stops sed from echoing every input line, which it usually does. The construct /re/p searches for a regular expression (re) and displays the lines that contain it.

$ sed -n '/Jan/p' *.txt Dear Janet, Dear Janet, sauciest of vixens Dear Jan, Dear Janet, Perhaps on Jan 31st?


Next, we count the number of words starting with xy.

$ sed -n '/^xy/p' /usr/share/dict/web2 | wc -l       75


To filter the output from ps:

$ ps axww | sed -n "/Safar[i]/p"   470  ??  S      0:15.71 /Applications/Safari.app/ Contents/MacOS/Safari -psn_0_3407873


Ignoring case is less elegant. One has to convert all uppercase letters to lowercase (or vice versa) by using the awk function y and then match the pattern.

$ ps axww | sed -n "y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/ ¬     ;/safar[i]/p"   470  ??  s      0:15.71 /applications/safari.app/ contents/macos/safari -psn_0_3407873





Mac OS X UNIX 101 Byte-Sized Projects
Mac OS X Unix 101 Byte-Sized Projects
ISBN: 0321374118
EAN: 2147483647
Year: 2003
Pages: 153
Authors: Adrian Mayo

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net