File and Text Processing


As you have probably noticed, the Unix command line is very file oriented. Almost everything you do involves at least one file, and often several. Here are a few of the most commonly used command-line tools for processing files. They function equally well in pipelines for processing text that comes directly from other commands without being saved to a file first.

wcCounting lines, words, and bytes

The wc command displays a count of lines, words, and bytes contained in its input. Input to wc can be one or more files specified as arguments; wc also takes input from stdin (see Chapter 2 to review stdin ).

To count the number of lines, words, and bytes in a file:

  • wc filename

    For example,

    wc /etc/hostconfig

    counts the contents of the file

    /etc/hostconfig

    showing 32 lines, 45 words, and 540 bytes ( Figure 4.5 ). (The file /etc/hostconfig is one of the many system-configuration files in the /etc directory.)

    Figure 4.5. Counting the number of lines, words, and bytes in a file with wc . Your output may have different numbers .
     localhost:~ vanilla$  wc /etc/hostconfig  32    45    540 /etc/hostconfig localhost:~ vanilla$ 

Tips

  • The -l option displays only the number of lines, -w the number of words, and -c the number of bytes. The default behavior is the same as when using the -lwc options together. Figure 4.6 shows a comparison of output from each option.

  • If you give wc more than one file as an argument, it gives you a line for each and a summary line adding up the contents of all of them ( Figure 4.7 ).


Figure 4.6. Comparing the results of using different options with wc .
 localhost:~ vanilla$  wc /etc/hostconfig  32    45   540 /etc/hostconfig localhost:~ vanilla$  wc -lwc /etc/hostconfig  32    45   540 /etc/hostconfig localhost:~ vanilla$  wc -l /etc/hostconfig  32    /etc/hostconfig localhost:~ vanilla$  wc -w /etc/hostconfig  45    /etc/hostconfig localhost:~ vanilla$  wc -c /etc/hostconfig  540   /etc/hostconfig localhost:~ vanilla$ 

Figure 4.7. Counting lines, words, and bytes in several files at once.
 localhost:~ vanilla$  wc /etc/*.conf  20     90    753  /etc/6to4.conf      22     47    576  /etc/gdb.conf      57    361   2544  /etc/inetd.conf       0      0      0  /etc/kern_loader.conf      46    199   1160  /etc/named.conf       1      6     44  /etc/ntp.conf       2      4     44  /etc/resolv.conf      21    144    983  /etc/rtadvd.conf       0     12     52  /etc/slpsa.conf      50    273   1602  /etc/smb.conf      18     66    724  /etc/syslog.conf      12     29    238  /etc/xinetd.conf     249   1231   8720  total localhost:~ vanilla$ 

sortAlphabetical or numerical sorting

The sort command takes its input either from files named in its arguments or from stdin and produces sorted output on stdout . The default sorting is in alphabetical order. You can also sort numerically .

For the following tasks , create two plain-text files (you may use the nano editor as described in Chapter 2). The first file should be called "data" and should contain three lines:

 100 pears 2 apples 1 orange 

The second file should be called "data2" and contain three lines:

 1 dog 1 cat 10 fish 

To sort alphabetically :

  • sort data data2

    This produces the output shown in Figure 4.8 . The results are correct, but something looks wrong. The problem is that 100 is alphabetically lower than 2 . The next task shows you how to sort numerically.

    Figure 4.8. Sorting alphabetically. Note how the line starting with 100 appears before the line starting with 2 .
     localhost:~ vanilla$  sort data data2  1 cat     1 dog     1 orange     10 fish     100 pears     2 apples localhost:~ vanilla$ 

To sort numerically:

  • sort -n data data2

    This produces the output shown in Figure 4.9 ( n stands for numeric ).

    Figure 4.9. Sorting numerically with sort .
     localhost:~ vanilla$  sort -n data data2  1 cat     1 dog     1 orange     2 apples     10 fish     100 pears localhost:~ vanilla$ 

Tips

  • Add the -r option to reverse the sort order:

    sort -nr data data2

  • You can, of course, use sort in a pipeline:

    ls -s /usr/bin sort -n

  • To save the output, use output redirection:

    sort -n data > sorted

  • If you give the uniq command one argument, it will use that as a filename to take its input. If you give it two arguments, uniq assumes that the second one is the name of an output file, and it will copy the results of its output to the second file, overwriting its contents :

    uniq infile outfile


uniqOnly one of each line

The uniq command takes sorted text as input and produces output with duplicate lines removed.

To display only the unique lines:

1.
Create a file called data containing the following lines:

 dog cat mongoose cat bird dog snake cat bird 

2.
sort data uniq

This produces the output shown in Figure 4.10 .

Figure 4.10. uniq deletes duplicates so that there is only one of each line.
 localhost:~ vanilla$  sort data  uniq  bird     cat     dog     mongoose     snake localhost:~ vanilla$ 

3.
But what if you want to know how many of each line there was? Adding the -c ( count ) option to uniq :

sort data uniq -c

produces the output shown in Figure 4.11 .

Figure 4.11. Getting a count of each entry, using uniq .
 localhost:~ vanilla$  sort data  uniq -c  2 bird     3 cat     2 dog     1 mongoose     1 snake localhost:~ vanilla$ 

4.
But now the output is no longer sorted numerically. So:

sort data uniq -c sort -rn

Pipe the output through sort again, with the rn option for reverse numerical sorting, and you get output as shown in Figure 4.12 .

Figure 4.12. Sorting the counted entries using sort -rn .
 localhost:~ vanilla$  sort data  uniq -c   sort -rn  3 cat      2 dog      2 bird      1 snake      1 mongoose localhost:~ vanilla$ 

Tips

  • If you want to have uniq act on more than one file, use sort to act on them simultaneously :

    sort file1 file2 file3 uniq


cut

The cut command is used to extract parts from each line of a file or pipeline of data and send the result to stdout . (see Chapter 2 for more about standard output, or stdout .) For example, the log files created by Web servers record several fields of data, including date/time of request, what the user requested , and how many bytes were sent to the browser. Figure 4.13 shows a sample from /var/log/httpd/access_log (this is your Web-server access log if you have enabled Personal Web Sharing by going to the Apple menu and choosing System Preferences and then Sharing).

Figure 4.13. Example of contents of a Web-server log from /var/log/httpd/access_log .
[View full width]
 66.47.69.205 - - [14/Mar/2002:14:53:48 -0800] "GET /~matisse/images/macosxlogo.gif HTTP/1  .1" 200 2829 66.47.69.205 - - [14/Mar/2002:14:53:48 -0800] "GET /~matisse/images/apache_pb.gif HTTP/1  .1" 200 2326 66.47.69.205 - - [14/Mar/2002:14:53:48 -0800] "GET /~matisse/images/web_share.gif HTTP/1  .1" 200 13370 66.47.69.205 - - [14/Mar/2002:14:54:00 -0800] "GET /~matisse/cgi-bin/test HTTP/1.1" 200 25 66.47.69.205 - - [14/Mar/2002:14:54:29 -0800] "GET /~matisse/cgi-bin/test HTTP/1.1" 200 25 66.47.69.205 - - [14/Mar/2002:15:08:27 -0800] "GET /~matisse/upload.html HTTP/1.1" 200 397 66.47.69.205 - - [14/Mar/2002:15:33:48 -0800] "GET /~matisse/images/web_share.gif HTTP/1  .1" 200 13370 66.47.69.205 - - [14/Mar/2002:15:33:50 -0800] "GET /~matisse/images/web_share.gif HTTP/1  .1" 200 13370 

Let's say you want to extract only the URLs that were requested in order to determine the most popular pages. cut allows you to specify what separates (or delimits ) the fields in each line. The default separator is a tab character (commonly used in tab-delimited files). In our example below, you will instead use the space character. If you look at each line of the data in Figure 4.13, you'll see that the field you want is field number 7; that is, if you break a line into pieces, wherever a space occurs, the seventh piece has the URL in it (for example, in the first line, the seventh field contains /~matisse/images/macosxlogo.gif ).

To print only one field from a file using spaces as a separator:

  • cut -d " " -f 7

    /var/log/httpd/access_log

    This produces output like that shown in Figure 4.14 .

    Figure 4.14. Using cut to print one field from a file, using the space character as the field separator.
     localhost:~ vanilla$  cut -d " " -f 7 /var/log/httpd/access_log  /~matisse/images/macosxlogo.gif /~matisse/images/apache_pb.gif /~matisse/images/web_share.gif /~matisse/cgi-bin/test /~matisse/cgi-bin/test /~matisse/upload.html /~matisse/images/web_share.gif /~matisse/images/web_share.gif localhost:~ vanilla$ 

    Notice how we used quotes around a space character; otherwise , the shell would not pass the space character to cut as our choice for the -d option. (See "About Spaces in the Command Line" in Chapter 2.)

    Each line in the original file represents one request made to your Web server, so each URL is from one request. You might already have realized that we could get a quick count of the requests by using the sort and uniq commands covered earlier in this chapter:

     cut -d " " -f 7  /var/log/httpd/access_log  sort  uniq -c  sort -nr 

    gives us output like that in Figure 4.15 .

    Figure 4.15. Using cut to feed a pipeline.
     localhost:~ vanilla$  cut -d " " -f 7 /var/log/httpd/access_log  uniq -c  sort -nr  3 /~matisse/images/web_share.gif           2 /~matisse/cgi-bin/test           1 /~matisse/upload.html           1 /~matisse/images/macosxlogo.gif           1 /~matisse/images/apache_pb.gif localhost:~ vanilla$ 

Tips

  • You can print more than one field:

     cut -d " " -f 1,7 /var/log/httpd/  access_log 

    Figure 4.16 shows the result.

  • You specify the field separator in cut with the -d option (for delimiter ). This allows you to specify a different field separator than the default (a tab character). If you use a separator of " , then cut will break the line into fields wherever a " appears. For example, in

     cut -d\" -f 2 /var/log/httpd/  access_log 

    the \ is required before the " to remove its special meaning to the shell.

    Figure 4.17 shows the result. See how field 2 contains a chunk of information that is enclosed in quotes. Field 1 is everything before the first " in the line, field 2 is the text between the first and second " , and field 3 is everything after that second " . TRy using different delimiters on different files and see what you get.

  • The cut command has several useful options, including the ability to extract only specific character positionsfor example, only characters 4254. This is very useful when dealing with fixed-length records , like those produced by older databases, and the output of many Unix commandsfor example, piping the output of ls -l tHRough cut -c42-54 will display only the date/time portion of each line. See man cut for more information.


Figure 4.16. Printing two fields with cut .
 localhost:~ vanilla$  cut -d " " -f 1,7 /var/log/httpd/access_log  66.47.69.205 /~matisse/images/macosxlogo.gif 66.47.69.205 /~matisse/images/apache_pb.gif 66.47.69.205 /~matisse/images/web_share.gif 66.47.69.205 /~matisse/cgi-bin/test 66.47.69.205 /~matisse/cgi-bin/test 66.47.69.205 /~matisse/upload.html 66.47.69.205 /~matisse/images/web_share.gif 66.47.69.205 /~matisse/images/web_share.gif localhost:~ vanilla$ 

Figure 4.17. Using a different field separator with cut .
 localhost:~ vanilla$  cut d\" -f 2 /var/log/httpd/access_log  GET /~matisse/images/macosxlogo.gif HTTP/1.1 GET /~matisse/images/apache_pb.gif HTTP/1.1 GET /~matisse/images/web_share.gif HTTP/1.1 GET /~matisse/cgi-bin/test HTTP/1.1 GET /~matisse/cgi-bin/test HTTP/1.1 GET /~matisse/upload.html HTTP/1.1 GET /~matisse/images/web_share.gif HTTP/1.1 GET /~matisse/images/web_share.gif HTTP/1.1 localhost:~ vanilla$ 

awk

The awk program, a multifeatured text-processing tool, gets its name from the initials of its three inventors (Aho, Kernighan, and Weinberger; Kernighan is Brian Kernighan, coinventor of the C programming language).

The basic idea behind awk is that it looks at each line of an input file and performs some action on each line. awk has its own language for defining the actions, and the resulting scripts can be quite complex. See man awk .

A common use of awk is to send to stdout only certain fields from a file or pipeline of data, just like with the simpler cut command described above. In the case of awk the default separator is whitespace . Whitespace is any blank space in a line, including any run of the space character and/or tabs. This is different from what we saw above with cut , where we set the delimiter to a single space characterby default, awk treats any number of spaces or tabs as a delimiter, instead of treating each single space as a delimiter. Also awk ignores whitespace at the very beginning of a line, and this fact can make awk a better choice than cut in some cases.

In the next task, we first show how awk does the same job as cut in extracting one field from a file. In the following task we show you a comparison between how awk and cut handle whitespace.

More About sed and awk

The standard reference for these two venerable Unix utilities is sed & awk , by Dale Dougherty and Arnold Robbins (O'Reilly; www.oreilly.com/catalog/sed2).


To use cut to print only one field from a file:

  • awk '{print $7}'

    /var/log/httpd/access_log

    This produces the same output that cut produced in Figure 4.14. The output should look the same as the example using cut from the previous section in this chapter; likewise,

     awk '{print }' /var/log/httpd/  access_log  sort  uniq -c  sort -nr 

    gives us the same output as shown for cut in Figure 4.15.

Tips

  • You can print more than one field:

     awk '{print ,}'  /var/log/httpd/access_log 

    The output should be the same as in Figure 4.16.

  • You can vary the field separator in awk with the -F option (for field ). This allows you to specify a different field separator than the default (whitespace). If you use a separator of " , then awk will break the line into fields wherever a " appears. For example, in

     awk -F\" '{print }'  /var/log/httpd/access_log 

    the \ is required before the " to remove its special meaning to the shell.


To compare how cut and awk process whitespace:

1.
ls -sk /usr/bin less

This shows the sizes (in kilobytes) of all the files in /usr/bin refer to Figure 4.18 . Notice how the lines actually begin with two or more space characters.

Figure 4.18. Comparing how awk and cut deal with whitespace.
 localhost:~ vanilla$  ls -sk /usr/bin  less  104 a2p           156 acid            16 aclocal            16 aclocal-1.6            60 addftinfo           164 afmtodit . . . (output abbreviated for space) . . . localhost:~ vanilla$  ls -sk /usr/bin  awk '{print }'  104 156  16  16  60 164 . . . (output abbreviated) localhost:~ vanilla$  ls sk /usr/bin  cut d " " f 1  (many blank lines) localhost:~ vanilla$ 

The output of ls is piped through the less command. See "Viewing the Contents of Text Files" in Chapter 5, "Using Files and Directories." You can proceed to the next screenful of output by pressing , or return to the shell prompt by pressing .

2.
ls -sk /usr/bin awk '{print $1}'

This pipes the output of the ls command through awk , printing only the first field, which in this case consists of the sizes (in kilobytes) of the files in /usr/bin .

3.
ls -sk /usr/bin cut -d " " -f 1

This pipes the output from ls tHRough cut , with the delimiter set to a single space character. The result is that we get no numbers at all in the output, because all the lines from ls begin with a space, and so cut considers the first field to be emptythere is nothing before the first occurrence of the delimiter on each line. Figure 4.18 shows the results of all three command lines above.

sed

sed is a stream editor that is, a tool for editing streams of data, whether in a file or the output of some other command.

Create a file called sedtest , using the text in Figure 4.19 . You can use sed to make it rhyme by changing all the occurrences of love to amore .

Figure 4.19. A sample data file. But it doesn't rhyme very well.
 There's just no better foray then the one they call love. When the moon hits your eye, like a big pizza pie, that's love. When an eel grabs your hand, and won't let go, that's love. 

To convert all occurrences of love to amore :

  • sed "/love/s//amore//" sedtest

    produces output as shown in Figure 4.20 .

    Figure 4.20. Using sed to make the data rhyme. (Given the prevalence of puns in Unix, it seems only natural that we should add a few.)
     localhost:~ vanilla$  sed "/love/s//amore/" sedtest  There's just no better foray then the one they call amore. When the moon hits your eye, like a big pizza pie, that's amore. When an eel grabs your hand, and won't let go, that's amore. localhost:~ vanilla$ 

Tip

  • If you are using sed to change a file and then redirect its output to a new file, check the new file to make sure it's correct, and then replace the old file with the new file.


textutil

The textutil command appears only in Mac OS X/Darwin and allows you to easily convert files from one format to another. Introduced in Mac OS X 10.4, textutil can convert to and from plain text, RTF, RTFD, HTML, Microsoft Word, Microsoft Word XML (wordml), and webarchive.

One especially cool textutil feature is that it creates HTML files converted from Word documents that are remarkably "clean" and meet strict HTML 4.01 specifications, way better than HTML created by Word's Save As feature. Another great thing about textutil is that you can create Microsoft Word documents from other formats without having Word installed on your machine. (Note that the Mac OS X application TextEdit can read Word files.) However, textutil is not perfect. If you convert from format X to Y to Z and back to X, the final result will be pretty close to the original, but not exactly the same.

To convert a file format using textutil:

  • textutil -convert format oldfile

    Where format is one of txt , html , rtf , rtfd , doc , wordml , or webarchive . The command produces a new file whose name is based on the new formatfor example, file.htmlif converted to HTML. The new file is placed in the same directory as the old file (which may be overridden with the -output option). The old file must be in one of the above formats. Here's an example of converting a file to Microsoft Word format:

    textutil -convert doc resume.html

    That creates a file called resume.doc in your current directory (notice how the .doc extension was added).

Tips

  • As always, read the man page for the full list of available options, such as combining multiple files into a single converted file using the -cat ( concatenate ) option in place of the -convert option. You can also specify the output filename or extension with other options. For example,

     textutil -cat html -output all.html  *.doc 

    would create a file called all.html containing the content from the .doc files in the current directory.

  • Some options only work if the command is run from a shell that has access to the Aqua user interface (technically to a program called Window Server). Basically, that means the command is run from a shell inside the Terminal application, rather than a shell that you logged in to using ssh or telnet (as described in Chapter 10," Connecting over the Internet").

  • The Mac OS X Spotlight system and the mdfind command (described later in this chapter) search metadata that can be set or viewed with textutil .

  • This section deals with textutil(1) , the textutil command covered in section 1 of the Unix manual, not textutil(n) , the one covered in section n. See Chapter 3 for more on textutil ( n ).


Perl

Perl is a programming language equally suited to small tasks and large, complex programs. (See www.perl.org.)

In this chapter, we are showing Perl in its role as a text-processing utilitya general-purpose tool in the same category as awk , cut , and sed .

Following is an example of creating a very small utility program written in the Perl language that provides a feature not available from any existing command.

Earlier in this chapter, you learned how to sort data alphabetically or numerically, but what if you simply want to look at a file backwardseeing the last lines in the file first? This might come up if you are looking at a file consisting of date/time entries and you want to see the latest ones first.

Although Unix provides numerous tools for situations like this, there are still times when you want to do something beyond what the current tools provide. Users of other operating systems look to a catalog of software to see if they can buy a tool that will serve the purpose, or, failing that, wait for someone else to create a new tool. Unix users build their own. And often Unix users choose Perl as the programming language in which to build their new tools.

Perl excels at text processing and allows you to do some things very easily that would be difficult or impossible using other tools.

Perl is a complete programming language used for everything from simple utility scripts to very large and complex programs containing tens of thousands of lines of code. Teaching Perl is beyond the scope of this book, but we do want to give you some sense of how useful it is and pique your interest in learning more (see the sidebar "Resources for Learning Perl"). And we cover the basics of Unix shell scripts in Chapter 9, "Creating and Using Scripts."

Our example here is a short script that reverses the order of its inputthe last line of input comes out first, and the first line comes out last, regardless of alphabetical or numerical sort order.

The steps are similar to the ones in Chapter 2, in the section "Creating a Simple Unix Shell Script":

 #!/usr/bin/perl # script to reverse input @input = <>; while ( @input) {     print pop(@input); } 

Resources for Learning Perl

  • If you have never heard of Perl, a good place to start is the Perl Directory About Perl (www.perl.org/press/fast_facts.html).

  • If you are looking for online documentation, mailing lists, and other support resources, then go to the Perl Directory Online Documentation (www.perl.org/support/online_support.html).

  • You can start right on your Mac with man perl .

  • A primary resource for beginning Perl programmers is Learning Perl, 3rd Edition , by Randal L. Schwartz and Tom Phoenix (O'Reilly; www.oreilly.com/catalog/lperl3).

  • The definitive programmer's guide is Programming Perl, 3rd Edition , by Larry Wall, Tom Christiansen, and Jon Orwant (O'Reilly; www.oreilly.com/catalog/pperl3).

  • If you are looking to start learning how to use Perl for CGI programming, then check out Perl and CGI for the World Wide Web: Visual QuickStart Guide, 2nd Edition , by Elizabeth Castro (Peachpit Press; www.peachpit.com/title/0201735687).


To create a script that will reverse its input:

1.
nano ~/bin/reverse

This opens the nano editor and creates the script (called reverse ) in the bin directory of your home directory, as we did in Chapter 2 with the system-status script. (Make sure you have added ~/bin to your PATH as described in Chapter 2 in "Creating a Simple Unix Shell Script.")

The following lines are entered into the nano editor:

2.
#!/usr/bin/perl

This must be the first line. It tells Unix to use Perl when running this script.

3.
# script to reverse input, by JME

2005-04-21

This line is just a comment, to remind us what the script does. Include your initial or e-mail address and the date. Comments are good. We love comments.

4.
@input = <>;

This line causes all input to go into a variable called @input . Each line of input is stored as a separate item in the variable. (Think of this variable as a stack of plateseach time we add an item, we are adding a plate to the stack. In Perl, the @ indicates a list of things, or "stack of plates," which is also called an array .)

5.
while ( @input ) {

This is the start of a loop that will continue as long as there is anything left in the @input variable.

6.
print pop( @input );

This line removes the most recently added item ( pop ) from the @input array and sends it to stdout (the print function). (The pop function "pops" the last "plate" off the "stack.")

7.
}

This ends the loop. While the script is running, Perl checks here to see if there is anything still in @input , and if there is, it executes the print pop(@input); line again; otherwise, Perl proceeds to the next line.

There isn't a next line in this case, so the script stops running when @input is empty.

8.
You can stop entering text, and exit from the nano editor, by pressing .

9.
nano will ask you if you want to save the changes; you say yes by pressing .

10.
nano asks you to confirm the name you are using to save the file. You confirm by pressing .

You'll be back at the shell prompt.

11.
chmod 755 ~/bin/reverse

That makes the script executable so that you can run it later (remember that chmod means change mode ; the 755 refers to the mode that sets permissions; see Chapter 8 for more details).

You're done.

Congratulationsyou've created a new Unix command by writing a Perl script.

To display input backward using reverse:

  • reverse filename

    This sends the contents of the file filename to stdout , last line first. Figure 4.21 shows a comparison between using cat and reverse on the same file.

    Figure 4.21. Comparing the output of cat and reverse .
     localhost:~ vanilla$  cat poetry.txt  And the coming wind did roar more loud, And the sails did sigh like sedge; And the rain poured down from one black cloud; The Moon was at its edge. localhost:~ vanilla$  reverse poetry.txt  The Moon was at its edge. And the rain poured down from one black cloud; And the sails did sigh like sedge; And the coming wind did roar more loud, localhost:~ vanilla$ 

    You can use multiple filesfor example,

    reverse file1 file2 file3

    You can use reverse in a pipeline:

    ls -l reverse



Unix for Mac OS X 10. 4 Tiger. Visual QuickPro Guide
Unix for Mac OS X 10.4 Tiger: Visual QuickPro Guide (2nd Edition)
ISBN: 0321246683
EAN: 2147483647
Year: 2004
Pages: 161
Authors: Matisse Enzer

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net