The Basics of Searching Inside Text Files for Patterns


In the previous section, you learned that grep works by looking for the existence of a pattern in a group of files. Your first use of grep was extremely basic, but now you need to get a bit more complex, and to do that you need to understand the patterns for which grep searches. Those patterns are built using one of the most powerful tools in the Linux toolbox: regular expressions, or regex. To take full advantage of grep, you really need to grok regex; however, regex is a book all in itself, so we're only going to cover the basics here.

Tip

Want to learn more about regular expressions? You can search the Internet, and you'll find quite a bit of great stuff there, but Sams Teach Yourself Regular Expressions in 10 Minutes (by Ben Forta; ISBN: 0672325667) is a great book that'll really help you as you explore and learn regex.


One thing that confuses new users when they start playing with grep is that the command has several versions, as shown in Table 9.1.

Table 9.1. Different Versions of grep

Interpret Pattern As

grep Command Option

Separate Command

Basic regular expression

grep -G (or --basic-regexp)

grep

Extended regular expression

grep -E (or --extended-regexp)

egrep

List of fixed strings, any of which can be matched

grep -F (or --fixed-strings)

fgrep

Perl regular expression

grep -P (or --perl-regexp)

Not applicable


To summarize this table, grep all by itself works with basic regex. If you use the -E (or --extended-regexp) option or the egrep command, you can use extended regex. Much of the time, this is probably what you'll want to do, unless you're performing a very simple search. Two more complicated choices are grep with the -F (or --fixed-strings) option or the fgrep command, which allows you to use multiple search terms that could be matched, and grep with the -P (or --perl-regexp) option, which allows Perl programming mavens to use that language's sometimes unique approach to regex.

Note

In this book, unless otherwise stated, we're using just plain grep for basic regex.


A few possible points of confusion need to be covered before you continue. If you're unclear about any of these, use the listed resources as a jumping-off point to learn more.

Wildcards are not equivalent to regex. Yes, both wildcards and regular expressions use the * character, for instance, but they have completely different meanings. Where certain characters (? and *, for example) are used as wildcards to indicate substitution, the same characters in regex are used to indicate the number of times a preceding item is to be matched. For instance, with wildcards, the ? in c?t replaces one and only one letter, matching cat, cot, and cut, for instance, but not ct. With regex, the ? in c[a-z]?t indicates that the letters A through Z are to be matched both zero or one time(s), thereby corresponding to cat, cot, cut, and also ct.

Tip

To learn more about differences between wildcards and regular expressions, see "What Is a Regular Expression" (http://docs.kde.org/stable/en/kdeutils/KRegExpEditor/whatIsARegExp.html), "Regular Expressions Explained" (www.castaglia.org/proftpd/doc/contrib/regexp.html), and "Wildcards Gone Wild" (www.linux-mag.com/2003-12/power_01.html).


Another potentially confusing thing about grep is that you need to be aware of special characters in your grep regex. For instance, in regular expressions, the string [a-e] indicates a regex range, and means any one character matching a, b, c, d, or e. When using [ or ] with grep, you need to make it clear to your shell whether the [ and ] are there to delimit a regex range or are part of the words for which you're searching. Special characters of which you need to keep aware include the following:

.? [ ] ^ $ | \

Finally, there is a big difference between the use of single quotes and double quotes in regex. Single quotes (' and ') tell the shell that you are searching for a string of characters, while double quotes (" and ") let your shell know that you want to use shell variables. For instance, using grep and regex in the following way to look for all usages of the phrase "hey you!" in a friend's poetry wouldn't work:

$ grep hey you! * grep: you!: No such file or directory txt/pvzm/8 hours a day.txt:hey you! let 's run! txt/pvzm/friends & family.txt:in patience they wait txt/pvzm/speed of morning.txt:they say the force 


Because you simply wrote out "hey you!" with nothing around it, grep was confused. It first looked for the search term hey in a file called "you!" but it was unsuccessful, as that isn't the actual name of a file. Then it searched for hey in every file contained in the current working directory, as indicated by the * wildcard, with three good results. It's true that the first of those three contained the phrase you were searching for, so in that sense your search worked, but not really. This search was crude and does not always deliver the results you'd like. Let's try again.

This time you'll use double quotes around your search term. That should fix the problem you had when you didn't use anything at all.

$ grep " hey you! " * bash: ! " *: event not found 


Even worse! Actually, the quotation marks also cause a big problem and give even worse results than you just saw. What happened? The ! is a shell command that references your command history. Normally you'd use the ! by following it with a process ID (PID) number that represents an earlier command you ran, like !264.

Here, though, bash sees the !, looks for a PID after it, and then complains that it can't find an earlier command named "* (a double quote, a space, and an asterisk), which would be a very weird command indeed.

It turns out that quotation marks indicate that you are using shell variables in your search term, which is in fact not what you wanted at all. So double quotes don't work. Let's try single quotes.

$ grep 'hey!' * txt/pvzm/8 hours a day.txt:hey you! let 's run! 


Much better results! The single quotes told grep that your search term didn't contain any shell variables, and was just a string of characters that you wanted to match. Lo and behold, there was a single result, the exact one you wanted.

The lesson? Know when to use single quotes, when to use double quotes, and when to use nothing. If you're searching for an exact match, use single quotes, but if you want to incorporate shell variables into your search term (which will be rare indeed), use double quotes. If you're searching for a single word that contains just numbers and letters, though, it's safe to leave off all quotes entirely. If you want to be safe, go ahead and use single quotes, even around a single wordit can't hurt.



Linux Phrasebook
Linux Phrasebook
ISBN: 0672328380
EAN: 2147483647
Year: 2007
Pages: 288

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net