|    As you have probably noticed, the Unix command line is very file oriented. Almost everything you do involves at least one file, and often several. Here are a few of the most commonly used command-line tools for processing files. They function equally well in pipelines for processing text that comes directly from other commands without being saved to a file first.    wcCounting lines, words, and bytes   The  wc  command displays a count of lines, words, and bytes contained in its input. Input to  wc  can be one or more files specified as arguments;  wc  also takes input from  stdin  (see Chapter 2 to review  stdin  ).     To count the number of lines, words, and bytes in a file:         Tips    -  
 The  -l  option displays only the number of lines,  -w  the number of words, and  -c  the number of bytes. The default behavior is the same as when using the  -lwc  options together.  Figure 4.6  shows a comparison of output from each option.     -  
 If you give  wc  more than one file as an argument, it gives you a line for each and a summary line adding up the contents of all of them (  Figure 4.7  ).            Figure 4.6. Comparing the results of using different options with  wc  .   localhost:~ vanilla$  wc /etc/hostconfig  32    45   540 /etc/hostconfig localhost:~ vanilla$  wc -lwc /etc/hostconfig  32    45   540 /etc/hostconfig localhost:~ vanilla$  wc -l /etc/hostconfig  32    /etc/hostconfig localhost:~ vanilla$  wc -w /etc/hostconfig  45    /etc/hostconfig localhost:~ vanilla$  wc -c /etc/hostconfig  540   /etc/hostconfig localhost:~ vanilla$      Figure 4.7. Counting lines, words, and bytes in several files at once.   localhost:~ vanilla$  wc /etc/*.conf  20     90    753  /etc/6to4.conf      22     47    576  /etc/gdb.conf      57    361   2544  /etc/inetd.conf       0      0      0  /etc/kern_loader.conf      46    199   1160  /etc/named.conf       1      6     44  /etc/ntp.conf       2      4     44  /etc/resolv.conf      21    144    983  /etc/rtadvd.conf       0     12     52  /etc/slpsa.conf      50    273   1602  /etc/smb.conf      18     66    724  /etc/syslog.conf      12     29    238  /etc/xinetd.conf     249   1231   8720  total localhost:~ vanilla$      sortAlphabetical or numerical sorting   The  sort  command takes its input either from files named in its arguments or from  stdin  and produces sorted output on  stdout  . The default sorting is in alphabetical order. You can also sort numerically .    For the following tasks , create two plain-text files (you may use the  nano  editor as described in Chapter 2). The first file should be called "data" and should contain three lines:    100 pears 2 apples 1 orange      The second file should be called "data2" and contain three lines:    1 dog 1 cat 10 fish       To sort alphabetically :       To sort numerically:         Tips    -  
 Add the  -r  option to  reverse  the sort order:     sort -nr data data2      -  
 You can, of course, use  sort  in a pipeline:     ls -s /usr/bin  sort -n      -  
 To save the output, use output redirection:     sort -n data > sorted      -  
 If you give the  uniq  command one argument, it will use that as a filename to take its input. If you give it two arguments,  uniq  assumes that the second one is the name of an output file, and it will copy the results of its output to the second file,  overwriting its contents  :     uniq infile outfile             uniqOnly one of each line   The  uniq  command takes sorted text as input and produces output with duplicate lines removed.     To display only the unique lines:       |     1.     |    Create a file called data containing the following lines:     dog cat mongoose cat bird dog snake cat bird       |     |     2.     |     sort data  uniq    This produces the output shown in   Figure 4.10  .     Figure 4.10.  uniq  deletes duplicates so that there is only one of each line.   localhost:~ vanilla$  sort data  uniq  bird     cat     dog     mongoose     snake localhost:~ vanilla$       |     |     3.     |    But what if you want to know how many of each line there was? Adding the   -c  (   count  ) option to   uniq  :     sort data  uniq -c    produces the output shown in   Figure 4.11  .     Figure 4.11. Getting a count of each entry, using  uniq  .   localhost:~ vanilla$  sort data  uniq -c  2 bird     3 cat     2 dog     1 mongoose     1 snake localhost:~ vanilla$       |     |     4.     |    But now the output is no longer sorted numerically. So:     sort data  uniq -c  sort -rn    Pipe the output through   sort  again, with the   rn  option for   reverse numerical  sorting, and you get output as shown in   Figure 4.12  .     Figure 4.12. Sorting the counted entries using  sort -rn  .   localhost:~ vanilla$  sort data  uniq -c   sort -rn  3 cat      2 dog      2 bird      1 snake      1 mongoose localhost:~ vanilla$       |           Tips    -  
 If you want to have  uniq  act on more than one file, use  sort  to act on them simultaneously :     sort file1 file2 file3  uniq             cut   The  cut  command is used to extract parts from each line of a file or pipeline of data and send the result to  stdout  . (see Chapter 2 for more about  standard output,  or  stdout  .) For example, the log files created by Web servers record several fields of data, including date/time of request, what the user requested , and how many bytes were sent to the browser.  Figure 4.13  shows a sample from  /var/log/httpd/access_log  (this is your Web-server access log if you have enabled Personal Web Sharing by going to the Apple menu and choosing System Preferences and then Sharing).    Figure 4.13. Example of contents of a Web-server log from  /var/log/httpd/access_log  .   [View full width]    66.47.69.205 - - [14/Mar/2002:14:53:48 -0800] "GET /~matisse/images/macosxlogo.gif HTTP/1   .1" 200 2829 66.47.69.205 - - [14/Mar/2002:14:53:48 -0800] "GET /~matisse/images/apache_pb.gif HTTP/1   .1" 200 2326 66.47.69.205 - - [14/Mar/2002:14:53:48 -0800] "GET /~matisse/images/web_share.gif HTTP/1   .1" 200 13370 66.47.69.205 - - [14/Mar/2002:14:54:00 -0800] "GET /~matisse/cgi-bin/test HTTP/1.1" 200 25 66.47.69.205 - - [14/Mar/2002:14:54:29 -0800] "GET /~matisse/cgi-bin/test HTTP/1.1" 200 25 66.47.69.205 - - [14/Mar/2002:15:08:27 -0800] "GET /~matisse/upload.html HTTP/1.1" 200 397 66.47.69.205 - - [14/Mar/2002:15:33:48 -0800] "GET /~matisse/images/web_share.gif HTTP/1   .1" 200 13370 66.47.69.205 - - [14/Mar/2002:15:33:50 -0800] "GET /~matisse/images/web_share.gif HTTP/1   .1" 200 13370      Let's say you want to extract only the URLs that were requested in order to determine the most popular pages.  cut  allows you to specify what separates (or  delimits  ) the fields in each line. The default separator is a tab character (commonly used in  tab-delimited  files). In our example below, you will instead use the space character. If you look at each line of the data in Figure 4.13, you'll see that the field you want is field number 7; that is, if you break a line into pieces, wherever a space occurs, the seventh piece has the URL in it (for example, in the first line, the seventh field contains  /~matisse/images/macosxlogo.gif  ).     To print only one field from a file using spaces as a separator:     -   cut -d " " -f 7  
       /var/log/httpd/access_log      This produces output like that shown in  Figure 4.14  .      Figure 4.14. Using  cut  to print one field from a file, using the space character as the field separator.   localhost:~ vanilla$  cut -d " " -f 7 /var/log/httpd/access_log  /~matisse/images/macosxlogo.gif /~matisse/images/apache_pb.gif /~matisse/images/web_share.gif /~matisse/cgi-bin/test /~matisse/cgi-bin/test /~matisse/upload.html /~matisse/images/web_share.gif /~matisse/images/web_share.gif localhost:~ vanilla$     Notice how we used quotes around a space character; otherwise , the shell would not pass the space character to  cut  as our choice for the  -d  option. (See "About Spaces in the Command Line" in Chapter 2.)     Each line in the original file represents one request made to your Web server, so each URL is from one request. You might already have realized that we could get a quick count of the requests by using the  sort  and  uniq  commands covered earlier in this chapter:      cut -d " " -f 7   /var/log/httpd/access_log  sort   uniq -c  sort -nr     gives us output like that in  Figure 4.15  .      Figure 4.15. Using  cut  to feed a pipeline.   localhost:~ vanilla$  cut -d " " -f 7 /var/log/httpd/access_log  uniq -c  sort -nr  3 /~matisse/images/web_share.gif           2 /~matisse/cgi-bin/test           1 /~matisse/upload.html           1 /~matisse/images/macosxlogo.gif           1 /~matisse/images/apache_pb.gif localhost:~ vanilla$             Tips    -  
 You can print more than one field:    cut -d " " -f 1,7 /var/log/httpd/   access_log       Figure 4.16  shows the result.     -  
 You specify the field separator in  cut  with the  -d  option (for  delimiter  ). This allows you to specify a different field separator than the default (a tab character). If you use a separator of  "  , then  cut  will break the line into fields wherever a  "  appears. For example, in    cut -d\" -f 2 /var/log/httpd/   access_log      the  \  is required before the  "  to remove its special meaning to the shell.     Figure 4.17  shows the result. See how field 2 contains a chunk of information that is enclosed in quotes. Field 1 is everything before the first  "  in the line, field 2 is the text between the first and second  "  , and field 3 is everything after that second  "  . TRy using different delimiters on different files and see what you get.     -  
 The  cut  command has several useful options, including the ability to extract only specific character positionsfor example, only characters 4254. This is very useful when dealing with  fixed-length records  , like those produced by older databases, and the output of many Unix commandsfor example, piping the output of  ls -l  tHRough  cut -c42-54  will display only the date/time portion of each line. See  man cut  for more information.            Figure 4.16. Printing two fields with  cut  .   localhost:~ vanilla$  cut -d " " -f 1,7 /var/log/httpd/access_log  66.47.69.205 /~matisse/images/macosxlogo.gif 66.47.69.205 /~matisse/images/apache_pb.gif 66.47.69.205 /~matisse/images/web_share.gif 66.47.69.205 /~matisse/cgi-bin/test 66.47.69.205 /~matisse/cgi-bin/test 66.47.69.205 /~matisse/upload.html 66.47.69.205 /~matisse/images/web_share.gif 66.47.69.205 /~matisse/images/web_share.gif localhost:~ vanilla$      Figure 4.17. Using a different field separator with  cut  .   localhost:~ vanilla$  cut d\" -f 2 /var/log/httpd/access_log  GET /~matisse/images/macosxlogo.gif HTTP/1.1 GET /~matisse/images/apache_pb.gif HTTP/1.1 GET /~matisse/images/web_share.gif HTTP/1.1 GET /~matisse/cgi-bin/test HTTP/1.1 GET /~matisse/cgi-bin/test HTTP/1.1 GET /~matisse/upload.html HTTP/1.1 GET /~matisse/images/web_share.gif HTTP/1.1 GET /~matisse/images/web_share.gif HTTP/1.1 localhost:~ vanilla$      awk   The  awk  program, a multifeatured text-processing tool, gets its name from the initials of its three inventors (Aho, Kernighan, and Weinberger; Kernighan is Brian Kernighan, coinventor of the C programming language).    The basic idea behind  awk  is that it looks at each line of an input file and performs some action on each line.  awk  has its own language for defining the actions, and the resulting scripts can be quite complex. See  man awk  .    A common use of  awk  is to send to  stdout  only certain fields from a file or pipeline of data, just like with the simpler  cut  command described above. In the case of  awk  the default separator is  whitespace  . Whitespace is any blank space in a line, including any run of the space character and/or tabs. This is different from what we saw above with  cut  , where we set the delimiter to a single space characterby default,  awk  treats any number of spaces or tabs as a delimiter, instead of treating each single space as a delimiter. Also  awk  ignores whitespace at the very beginning of a line, and this fact can make  awk  a better choice than  cut  in some cases.    In the next task, we first show how  awk  does the same job as  cut  in extracting one field from a file. In the following task we show you a comparison between how  awk  and  cut  handle whitespace.        More About sed and awk   The standard reference for these two venerable Unix utilities is  sed  &  awk  , by Dale Dougherty and Arnold Robbins (O'Reilly; www.oreilly.com/catalog/sed2).    |           To use cut to print only one field from a file:         Tips    -  
 You can print more than one field:    awk '{print ,}'   /var/log/httpd/access_log     The output should be the same as in Figure 4.16.     -  
 You can vary the field separator in  awk  with the  -F  option (for  field  ). This allows you to specify a different field separator than the default (whitespace). If you use a separator of  "  , then  awk  will break the line into fields wherever a  "  appears. For example, in    awk -F\" '{print }'   /var/log/httpd/access_log     the  \  is required before the  "  to remove its special meaning to the shell.             To compare how cut and awk process whitespace:       |     1.     |     ls -sk /usr/bin  less    This shows the sizes (in kilobytes) of all the files in   /usr/bin  refer to   Figure 4.18  . Notice how the lines actually begin with two or more space characters.     Figure 4.18. Comparing how  awk  and  cut  deal with whitespace.   localhost:~ vanilla$  ls -sk /usr/bin  less  104 a2p           156 acid            16 aclocal            16 aclocal-1.6            60 addftinfo           164 afmtodit . . . (output abbreviated for space) . . . localhost:~ vanilla$  ls -sk /usr/bin  awk '{print }'  104 156  16  16  60 164 . . . (output abbreviated) localhost:~ vanilla$  ls sk /usr/bin  cut d " " f 1  (many blank lines) localhost:~ vanilla$   The output of   ls  is piped through the   less  command. See "Viewing the Contents of Text Files" in Chapter 5, "Using Files and Directories." You can proceed to the next screenful of output by pressing    , or return to the shell prompt by pressing    .       |     |     2.     |      ls -sk /usr/bin  awk '{print $1}'      This pipes the output of the  ls  command through  awk  , printing only the first field, which in this case consists of the sizes (in kilobytes) of the files in  /usr/bin  .        |     |     3.     |      ls -sk /usr/bin  cut -d " " -f 1      This pipes the output from  ls  tHRough  cut  , with the delimiter set to a single space character. The result is that we get no numbers at all in the output, because all the lines from  ls  begin with a space, and so  cut  considers the first field to be emptythere is nothing before the first occurrence of the delimiter on each line. Figure 4.18 shows the results of all three command lines above.        |        sed    sed  is a  stream editor  that is, a tool for editing streams of data, whether in a file or the output of some other command.    Create a file called  sedtest  , using the text in  Figure 4.19  . You can use  sed  to make it rhyme by changing all the occurrences of  love  to  amore  .    Figure 4.19. A sample data file. But it doesn't rhyme very well.   There's just no better foray then the one they call love. When the moon hits your eye, like a big pizza pie, that's love. When an eel grabs your hand, and won't let go, that's love.       To convert all occurrences of  love  to  amore  :     -   sed "/love/s//amore//" sedtest  
    produces output as shown in  Figure 4.20  .      Figure 4.20. Using  sed  to make the data rhyme. (Given the prevalence of puns in Unix, it seems only natural that we should add a few.)   localhost:~ vanilla$  sed "/love/s//amore/" sedtest  There's just no better foray then the one they call amore. When the moon hits your eye, like a big pizza pie, that's amore. When an eel grabs your hand, and won't let go, that's amore. localhost:~ vanilla$             Tip         textutil   The  textutil  command appears only in Mac OS X/Darwin and allows you to easily convert files from one format to another. Introduced in Mac OS X 10.4,  textutil  can convert to and from plain text, RTF, RTFD, HTML, Microsoft Word, Microsoft Word XML (wordml), and webarchive.    One especially cool  textutil  feature is that it creates HTML files converted from Word documents that are remarkably "clean" and meet strict HTML 4.01 specifications, way better than HTML created by Word's Save As feature. Another great thing about  textutil  is that you can create Microsoft Word documents from other formats without having Word installed on your machine. (Note that the Mac OS X application TextEdit can read Word files.) However,  textutil  is not perfect. If you convert from format X to Y to Z and back to X, the final result will be pretty close to the original, but not exactly the same.          To convert a file format using textutil:     -   textutil -convert    format oldfile   
    Where   format   is one of  txt  ,  html  ,  rtf  ,  rtfd  ,  doc  ,  wordml  , or  webarchive  . The command produces a new file whose name is based on the new formatfor example, file.htmlif converted to HTML. The new file is placed in the same directory as the old file (which may be overridden with the  -output  option). The old file must be in one of the above formats. Here's an example of converting a file to Microsoft Word format:      textutil -convert doc resume.html      That creates a file called resume.doc in your current directory (notice how the .doc extension was added).             Tips    -  
 As always, read the  man  page for the full list of available options, such as combining multiple files into a single converted file using the  -cat  (  concatenate  ) option in place of the  -convert  option. You can also specify the output filename or extension with other options. For example,    textutil -cat html -output all.html   *.doc      would create a file called  all.html  containing the content from the .doc files in the current directory.     -  
 Some options only work if the command is run from a shell that has access to the Aqua user interface (technically to a program called Window Server). Basically, that means the command is run from a shell inside the Terminal application, rather than a shell that you logged in to using  ssh  or  telnet  (as described in Chapter 10," Connecting over the Internet").     -  
 The Mac OS X Spotlight system and the  mdfind  command (described later in this chapter) search metadata that can be set or viewed with  textutil  .     -  
 This section deals with  textutil(1)  , the  textutil  command covered in section 1 of the Unix manual, not  textutil(n)  , the one covered in section n. See Chapter 3 for more on  textutil  (  n  ).            Perl   Perl is a programming language equally suited to small tasks and large, complex programs. (See www.perl.org.)    In this chapter, we are showing Perl in its role as a text-processing utilitya general-purpose tool in the same category as  awk  ,  cut  , and  sed  .    Following is an example of creating a very small utility program written in the Perl language that provides a feature not available from any existing command.    Earlier in this chapter, you learned how to sort data alphabetically or numerically, but what if you simply want to look at a file backwardseeing the last lines in the file first? This might come up if you are looking at a file consisting of date/time entries and you want to see the latest ones first.    Although Unix provides numerous tools for situations like this, there are still times when you want to do something beyond what the current tools provide. Users of other operating systems look to a catalog of software to see if they can buy a tool that will serve the purpose, or, failing that, wait for someone else to create a new tool. Unix users build their own. And often Unix users choose Perl as the programming language in which to build their new tools.    Perl excels at text processing and allows you to do some things very easily that would be difficult or impossible using other tools.    Perl is a complete programming language used for everything from simple utility scripts to very large and complex programs containing tens of thousands of lines of code. Teaching Perl is beyond the scope of this book, but we do want to give you some sense of how useful it is and pique your interest in learning more (see the sidebar "Resources for Learning Perl"). And we cover the basics of Unix shell scripts in Chapter 9, "Creating and Using Scripts."    Our example here is a short script that reverses the order of its inputthe last line of input comes out first, and the first line comes out last, regardless of alphabetical or numerical sort order.    The steps are similar to the ones in Chapter 2, in the section "Creating a Simple Unix Shell Script":    #!/usr/bin/perl # script to reverse input @input = <>; while ( @input) {     print pop(@input); }         Resources for Learning Perl   -  
 If you have never heard of Perl, a good place to start is the Perl Directory About Perl (www.perl.org/press/fast_facts.html).     -  
 If you are looking for online documentation, mailing lists, and other support resources, then go to the Perl Directory Online Documentation (www.perl.org/support/online_support.html).     -  
 You can start right on your Mac with  man perl  .     -  
 A primary resource for beginning Perl programmers is  Learning Perl, 3rd Edition  , by Randal L. Schwartz and Tom Phoenix (O'Reilly; www.oreilly.com/catalog/lperl3).     -  
 The definitive programmer's guide is  Programming Perl, 3rd Edition  , by Larry Wall, Tom Christiansen, and Jon Orwant (O'Reilly; www.oreilly.com/catalog/pperl3).     -  
 If you are looking to start learning how to use Perl for CGI programming, then check out  Perl and CGI for the World Wide Web: Visual QuickStart Guide, 2nd Edition  , by Elizabeth Castro (Peachpit Press; www.peachpit.com/title/0201735687).        |           To create a script that will reverse its input:       |     1.     |      nano ~/bin/reverse      This opens the  nano  editor and creates the script (called  reverse  ) in the  bin  directory of your home directory, as we did in Chapter 2 with the system-status script. (Make sure you have added  ~/bin  to your  PATH  as described in Chapter 2 in "Creating a Simple Unix Shell Script.")     The following lines are entered into the  nano  editor:        |     |     2.     |      #!/usr/bin/perl      This must be the first line. It tells Unix to use Perl when running this script.        |     |     3.     |     # script to reverse input, by JME       2005-04-21    This line is just a comment, to remind us what the script does. Include your initial or e-mail address and the date. Comments are good. We love comments.       |     |     4.     |      @input = <>;      This line causes all input to go into a variable called  @input  . Each line of input is stored as a separate item in the variable. (Think of this variable as a stack of plateseach time we add an item, we are adding a plate to the stack. In Perl, the @ indicates a list of things, or "stack of plates," which is also called an  array  .)        |     |     5.     |      while ( @input ) {      This is the start of a loop that will continue as long as there is anything left in the  @input  variable.        |     |     6.     |      print pop( @input );      This line removes the most recently added item (  pop  ) from the  @input  array and sends it to  stdout  (the  print  function). (The  pop  function "pops" the last "plate" off the "stack.")        |     |     7.     |      }      This ends the loop. While the script is running, Perl checks here to see if there is anything still in  @input  , and if there is, it executes the  print pop(@input);  line again; otherwise, Perl proceeds to the next line.     There isn't a next line in this case, so the script stops running when  @input  is empty.        |     |     8.     |    You can stop entering text, and exit from the   nano  editor, by pressing    .       |     |     9.     |     nano  will ask you if you want to save the changes; you say yes by pressing    .       |     |     10.     |     nano  asks you to confirm the name you are using to save the file. You confirm by pressing    .    You'll be back at the shell prompt.       |     |     11.     |      chmod 755 ~/bin/reverse      That makes the script executable so that you can run it later (remember that  chmod  means  change mode  ; the  755  refers to the mode that sets permissions; see Chapter 8 for more details).     You're done.        |        Congratulationsyou've created a new Unix command by writing a Perl script.     To display input backward using reverse:     -   reverse    filename   
    This sends the contents of the file  filename  to  stdout  , last line first.  Figure 4.21  shows a comparison between using  cat  and  reverse  on the same file.      Figure 4.21. Comparing the output of  cat  and  reverse  .   localhost:~ vanilla$  cat poetry.txt  And the coming wind did roar more loud, And the sails did sigh like sedge; And the rain poured down from one black cloud; The Moon was at its edge. localhost:~ vanilla$  reverse poetry.txt  The Moon was at its edge. And the rain poured down from one black cloud; And the sails did sigh like sedge; And the coming wind did roar more loud, localhost:~ vanilla$     You can use multiple filesfor example,      reverse file1 file2 file3      You can use  reverse  in a pipeline:      ls -l  reverse           |