File and Text Processing
As you have probably noticed, the Unix command line is very file oriented. Almost everything you do involves at least one file, and often several. Here are a few of the most commonly used command-line tools for processing files. They function equally well in pipelines for processing text that comes directly from other commands without being saved to a file first.
wcCounting lines, words, and bytes
The
wc
command displays a count of lines, words, and bytes contained in its input. Input to
wc
can be one or more files specified as arguments;
wc
also takes input from
stdin
(see Chapter 2 to review
stdin
).
To count the number of lines, words, and bytes in a file:
Tips
-
The
-l
option displays only the number of lines,
-w
the number of words, and
-c
the number of bytes. The default behavior is the same as when using the
-lwc
options together.
Figure 4.6
shows a comparison of output from each option.
-
If you give
wc
more than one file as an argument, it gives you a line for each and a summary line adding up the contents of all of them (
Figure 4.7
).
Figure 4.6. Comparing the results of using different options with
wc
.
localhost:~ vanilla$
wc /etc/hostconfig
32 45 540 /etc/hostconfig
localhost:~ vanilla$
wc -lwc /etc/hostconfig
32 45 540 /etc/hostconfig
localhost:~ vanilla$
wc -l /etc/hostconfig
32 /etc/hostconfig
localhost:~ vanilla$
wc -w /etc/hostconfig
45 /etc/hostconfig
localhost:~ vanilla$
wc -c /etc/hostconfig
540 /etc/hostconfig
localhost:~ vanilla$
Figure 4.7. Counting lines, words, and bytes in several files at once.
localhost:~ vanilla$
wc /etc/*.conf
20 90 753 /etc/6to4.conf
22 47 576 /etc/gdb.conf
57 361 2544 /etc/inetd.conf
0 0 0 /etc/kern_loader.conf
46 199 1160 /etc/named.conf
1 6 44 /etc/ntp.conf
2 4 44 /etc/resolv.conf
21 144 983 /etc/rtadvd.conf
0 12 52 /etc/slpsa.conf
50 273 1602 /etc/smb.conf
18 66 724 /etc/syslog.conf
12 29 238 /etc/xinetd.conf
249 1231 8720 total
localhost:~ vanilla$
sortAlphabetical or numerical sorting
The
sort
command takes its input either from files named in its arguments or from
stdin
and produces sorted output on
stdout
. The default sorting is in alphabetical order. You can also sort
numerically
.
For the following
tasks
, create two plain-text files (you may use the
nano
editor as described in Chapter 2). The first file should be called "data" and should contain three lines:
100 pears
2 apples
1 orange
The second file should be called "data2" and contain three lines:
1 dog
1 cat
10 fish
To sort
alphabetically
:
To sort numerically:
Tips
-
Add the
-r
option to
reverse
the sort order:
sort -nr data data2
-
You can, of course, use
sort
in a pipeline:
ls -s /usr/bin sort -n
-
To save the output, use output redirection:
sort -n data > sorted
-
If you give the
uniq
command one argument, it will use that as a filename to take its input. If you give it two arguments,
uniq
assumes that the second one is the
name
of an output file, and it will copy the results of its output to the second file,
overwriting its contents
:
uniq infile outfile
uniqOnly one of each line
The
uniq
command takes sorted text as input and produces output with duplicate lines removed.
To display only the unique lines:
|
1.
|
Create a file called data containing the following lines:
dog
cat
mongoose
cat
bird
dog
snake
cat
bird
|
|
2.
|
sort data uniq
This produces the output shown in
Figure 4.10
.
Figure 4.10.
uniq
deletes duplicates so that there is only one of each line.
localhost:~ vanilla$
sort data uniq
bird
cat
dog
mongoose
snake
localhost:~ vanilla$
|
|
3.
|
But what if you want to know how many of each line there was? Adding the
-c
(
count
) option to
uniq
:
sort data uniq -c
produces the output shown in
Figure 4.11
.
Figure 4.11. Getting a count of each entry, using
uniq
.
localhost:~ vanilla$
sort data uniq -c
2 bird
3 cat
2 dog
1 mongoose
1 snake
localhost:~ vanilla$
|
|
4.
|
But now the output is no longer sorted numerically. So:
sort data uniq -c sort -rn
Pipe the output through
sort
again, with the
rn
option for
reverse numerical
sorting, and you get output as shown in
Figure 4.12
.
Figure 4.12. Sorting the counted entries using
sort -rn
.
localhost:~ vanilla$
sort data uniq -c
sort -rn
3 cat
2 dog
2 bird
1 snake
1 mongoose
localhost:~ vanilla$
|
Tips
-
If you want to have
uniq
act on more than one file, use
sort
to act on them
simultaneously
:
sort file1 file2 file3 uniq
cut
The
cut
command is used to extract
parts
from each line of a file or pipeline of data and send the result to
stdout
. (see Chapter 2 for more about
standard output,
or
stdout
.) For example, the log files created by Web servers record several fields of data, including date/time of request, what the
user
requested
, and how many bytes were sent to the browser.
Figure 4.13
shows a sample from
/var/log/httpd/access_log
(this is your Web-server access log if you have enabled Personal Web Sharing by going to the Apple menu and choosing System Preferences and then Sharing).
Figure 4.13. Example of contents of a Web-server log from
/var/log/httpd/access_log
.
[View full width]
66.47.69.205 - - [14/Mar/2002:14:53:48 -0800] "GET /~matisse/images/macosxlogo.gif HTTP/1
.1" 200 2829
66.47.69.205 - - [14/Mar/2002:14:53:48 -0800] "GET /~matisse/images/apache_pb.gif HTTP/1
.1" 200 2326
66.47.69.205 - - [14/Mar/2002:14:53:48 -0800] "GET /~matisse/images/web_share.gif HTTP/1
.1" 200 13370
66.47.69.205 - - [14/Mar/2002:14:54:00 -0800] "GET /~matisse/cgi-bin/test HTTP/1.1" 200 25
66.47.69.205 - - [14/Mar/2002:14:54:29 -0800] "GET /~matisse/cgi-bin/test HTTP/1.1" 200 25
66.47.69.205 - - [14/Mar/2002:15:08:27 -0800] "GET /~matisse/upload.html HTTP/1.1" 200 397
66.47.69.205 - - [14/Mar/2002:15:33:48 -0800] "GET /~matisse/images/web_share.gif HTTP/1
.1" 200 13370
66.47.69.205 - - [14/Mar/2002:15:33:50 -0800] "GET /~matisse/images/web_share.gif HTTP/1
.1" 200 13370
Let's say you want to extract only the URLs that were requested in order to determine the most popular pages.
cut
allows you to specify what separates (or
delimits
) the fields in each line. The default separator is a tab character (commonly used in
tab-delimited
files). In our example below, you will instead use the space character. If you look at each line of the data in Figure 4.13, you'll see that the field you want is field number 7; that is, if you break a line into pieces, wherever a space occurs, the seventh piece has the URL in it (for example, in the first line, the seventh field contains
/~matisse/images/macosxlogo.gif
).
To print only one field from a file using spaces as a separator:
-
cut -d " " -f 7
/var/log/httpd/access_log
This produces output like that shown in
Figure 4.14
.
Figure 4.14. Using
cut
to print one field from a file, using the space character as the field separator.
localhost:~ vanilla$
cut -d " " -f 7 /var/log/httpd/access_log
/~matisse/images/macosxlogo.gif
/~matisse/images/apache_pb.gif
/~matisse/images/web_share.gif
/~matisse/cgi-bin/test
/~matisse/cgi-bin/test
/~matisse/upload.html
/~matisse/images/web_share.gif
/~matisse/images/web_share.gif
localhost:~ vanilla$
Notice how we used quotes around a space character;
otherwise
, the shell would not pass the space character to
cut
as our choice for the
-d
option. (See "About Spaces in the Command Line" in Chapter 2.)
Each line in the original file represents one request made to your Web server, so each URL is from one request. You might already have realized that we could get a quick count of the
requests
by using the
sort
and
uniq
commands covered earlier in this chapter:
cut -d " " -f 7
/var/log/httpd/access_log sort
uniq -c sort -nr
gives us output like that in
Figure 4.15
.
Figure 4.15. Using
cut
to feed a pipeline.
localhost:~ vanilla$
cut -d " " -f 7 /var/log/httpd/access_log uniq -c sort -nr
3 /~matisse/images/web_share.gif
2 /~matisse/cgi-bin/test
1 /~matisse/upload.html
1 /~matisse/images/macosxlogo.gif
1 /~matisse/images/apache_pb.gif
localhost:~ vanilla$
Tips
-
You can print more than one field:
cut -d " " -f 1,7 /var/log/httpd/
access_log
Figure 4.16
shows the result.
-
You specify the field separator in
cut
with the
-d
option (for
delimiter
). This allows you to specify a different field separator than the default (a tab character). If you use a separator of
"
, then
cut
will break the line into fields wherever a
"
appears. For example, in
cut -d\" -f 2 /var/log/httpd/
access_log
the
\
is required before the
"
to remove its special meaning to the shell.
Figure 4.17
shows the result. See how field 2 contains a
chunk
of information that is
enclosed
in quotes. Field 1 is everything before the first
"
in the line, field 2 is the text between the first and second
"
, and field 3 is everything after that second
"
. TRy using different delimiters on different files and see what you get.
-
The
cut
command has several useful options, including the ability to extract only specific character positionsfor example, only
characters
4254. This is very useful when dealing with
fixed-length records
, like those produced by older databases, and the output of many Unix commandsfor example, piping the output of
ls -l
tHRough
cut -c42-54
will display only the date/time portion of each line. See
man cut
for more information.
Figure 4.16. Printing two fields with
cut
.
localhost:~ vanilla$
cut -d " " -f 1,7 /var/log/httpd/access_log
66.47.69.205 /~matisse/images/macosxlogo.gif
66.47.69.205 /~matisse/images/apache_pb.gif
66.47.69.205 /~matisse/images/web_share.gif
66.47.69.205 /~matisse/cgi-bin/test
66.47.69.205 /~matisse/cgi-bin/test
66.47.69.205 /~matisse/upload.html
66.47.69.205 /~matisse/images/web_share.gif
66.47.69.205 /~matisse/images/web_share.gif
localhost:~ vanilla$
Figure 4.17. Using a different field separator with
cut
.
localhost:~ vanilla$
cut d\" -f 2 /var/log/httpd/access_log
GET /~matisse/images/macosxlogo.gif HTTP/1.1
GET /~matisse/images/apache_pb.gif HTTP/1.1
GET /~matisse/images/web_share.gif HTTP/1.1
GET /~matisse/cgi-bin/test HTTP/1.1
GET /~matisse/cgi-bin/test HTTP/1.1
GET /~matisse/upload.html HTTP/1.1
GET /~matisse/images/web_share.gif HTTP/1.1
GET /~matisse/images/web_share.gif HTTP/1.1
localhost:~ vanilla$
awk
The
awk
program, a multifeatured text-processing tool, gets its name from the
initials
of its three inventors (Aho, Kernighan, and Weinberger; Kernighan is Brian Kernighan, coinventor of the C programming language).
The basic idea behind
awk
is that it looks at each line of an input file and
performs
some action on each line.
awk
has its own language for defining the actions, and the resulting scripts can be quite complex. See
man awk
.
A common use of
awk
is to send to
stdout
only certain fields from a file or pipeline of data, just like with the simpler
cut
command described above. In the case of
awk
the default separator is
whitespace
. Whitespace is any blank space in a line, including any run of the space character and/or tabs. This is different from what we saw above with
cut
, where we set the delimiter to a single space characterby default,
awk
treats any number of spaces or tabs as a delimiter, instead of
treating
each single space as a delimiter. Also
awk
ignores whitespace at the very beginning of a line, and this fact can make
awk
a better choice than
cut
in some cases.
In the next task, we first show how
awk
does the same job as
cut
in extracting one field from a file. In the following task we show you a comparison between how
awk
and
cut
handle whitespace.
More About sed and awk
The standard reference for these two venerable Unix utilities is
sed
&
awk
, by Dale Dougherty and Arnold Robbins (O'Reilly; www.oreilly.com/catalog/sed2).
|
To use cut to print only one field from a file:
Tips
-
You can print more than one field:
awk '{print ,}'
/var/log/httpd/access_log
The output should be the same as in Figure 4.16.
-
You can vary the field separator in
awk
with the
-F
option (for
field
). This allows you to specify a different field separator than the default (whitespace). If you use a separator of
"
, then
awk
will break the line into fields wherever a
"
appears. For example, in
awk -F\" '{print }'
/var/log/httpd/access_log
the
\
is required before the
"
to remove its special meaning to the shell.
To compare how cut and awk process whitespace:
|
1.
|
ls -sk /usr/bin less
This shows the sizes (in kilobytes) of all the files in
/usr/bin
refer to
Figure 4.18
. Notice how the lines actually begin with two or more space characters.
Figure 4.18. Comparing how
awk
and
cut
deal with whitespace.
localhost:~ vanilla$
ls -sk /usr/bin less
104 a2p
156 acid
16 aclocal
16 aclocal-1.6
60 addftinfo
164 afmtodit
. . . (output abbreviated for space) . . .
localhost:~ vanilla$
ls -sk /usr/bin awk '{print }'
104
156
16
16
60
164
. . . (output abbreviated)
localhost:~ vanilla$
ls sk /usr/bin cut d " " f 1
(many blank lines)
localhost:~ vanilla$
The output of
ls
is piped through the
less
command. See "Viewing the Contents of Text Files" in Chapter 5, "Using Files and Directories." You can proceed to the next screenful of output by pressing
, or return to the shell prompt by pressing
.
|
|
2.
|
ls -sk /usr/bin awk '{print }'
This pipes the output of the
ls
command through
awk
, printing only the first field, which in this case consists of the sizes (in kilobytes) of the files in
/usr/bin
.
|
|
3.
|
ls -sk /usr/bin cut -d " " -f 1
This pipes the output from
ls
tHRough
cut
, with the delimiter set to a single space character. The result is that we get no numbers at all in the output, because all the lines from
ls
begin with a space, and so
cut
considers the first field to be emptythere is nothing before the first occurrence of the delimiter on each line. Figure 4.18 shows the results of all three command lines above.
|
sed
sed
is a
stream editor
that is, a tool for editing streams of data, whether in a file or the output of some other command.
Create a file called
sedtest
, using the text in
Figure 4.19
. You can use
sed
to make it
rhyme
by changing all the occurrences of
love
to
amore
.
Figure 4.19. A sample data file. But it doesn't rhyme very well.
There's just no better foray then the one they call love.
When the moon hits your eye, like a big pizza pie, that's love.
When an eel grabs your hand, and won't let go, that's love.
To convert all occurrences of
love
to
amore
:
-
sed "/love/s//amore//" sedtest
produces output as shown in
Figure 4.20
.
Figure 4.20. Using
sed
to make the data rhyme. (Given the prevalence of puns in Unix, it seems only natural that we should add a few.)
localhost:~ vanilla$
sed "/love/s//amore/" sedtest
There's just no better foray then the one they call amore.
When the moon hits your eye, like a big pizza pie, that's amore.
When an eel grabs your hand, and won't let go, that's amore.
localhost:~ vanilla$
Tip
textutil
The
textutil
command appears only in Mac OS X/Darwin and allows you to easily convert files from one format to another. Introduced in Mac OS X 10.4,
textutil
can convert to and from plain text, RTF, RTFD, HTML, Microsoft Word, Microsoft Word XML (wordml), and webarchive.
One
especially
cool
textutil
feature is that it creates HTML files converted from Word documents that are remarkably "clean" and meet strict HTML 4.01 specifications, way better than HTML created by Word's Save As feature. Another great thing about
textutil
is that you can create Microsoft Word documents from other formats without having Word installed on your machine. (Note that the Mac OS X application TextEdit can read Word files.) However,
textutil
is not perfect. If you convert from format X to Y to Z and back to X, the final result will be pretty close to the original, but not exactly the same.
To convert a file format using textutil:
-
textutil -convert
format oldfile
Where
format
is one of
txt
,
html
,
rtf
,
rtfd
,
doc
,
wordml
, or
webarchive
. The command produces a new file whose name is based on the new formatfor example, file.htmlif converted to HTML. The new file is placed in the same directory as the old file (which may be overridden with the
-output
option). The old file must be in one of the above formats. Here's an example of converting a file to Microsoft Word format:
textutil -convert doc resume.html
That creates a file called resume.doc in your current directory (notice how the .doc extension was added).
Tips
-
As always, read the
man
page for the full list of available options, such as combining multiple files into a single converted file using the
-cat
(
concatenate
) option in place of the
-convert
option. You can also specify the output filename or extension with other options. For example,
textutil -cat html -output all.html
*.doc
would create a file called
all.html
containing the content from the .doc files in the current directory.
-
Some options only work if the command is run from a shell that has access to the Aqua user interface (technically to a program called Window Server). Basically, that means the command is run from a shell inside the Terminal application, rather than a shell that you logged in to using
ssh
or
telnet
(as described in Chapter 10," Connecting over the Internet").
-
The Mac OS X Spotlight system and the
mdfind
command (described later in this chapter) search metadata that can be set or
viewed
with
textutil
.
-
This section deals with
textutil(1)
, the
textutil
command covered in section 1 of the Unix manual, not
textutil(n)
, the one covered in section n. See Chapter 3 for more on
textutil
(
n
).
Perl
Perl is a programming language equally suited to small tasks and large, complex programs. (See www.perl.org.)
In this chapter, we are showing Perl in its role as a text-processing utilitya general-purpose tool in the same category as
awk
,
cut
, and
sed
.
Following is an example of creating a very small utility program written in the Perl language that provides a feature not available from any existing command.
Earlier in this chapter, you learned how to sort data alphabetically or numerically, but what if you simply want to look at a file backwardseeing the last lines in the file first? This might come up if you are looking at a file consisting of date/time entries and you want to see the latest ones first.
Although Unix provides
numerous
tools for situations like this, there are still times when you want to do something beyond what the current tools provide. Users of other operating systems look to a catalog of software to see if they can buy a tool that will serve the purpose, or, failing that, wait for someone else to create a new tool. Unix users build their own. And often Unix users choose Perl as the programming language in which to build their new tools.
Perl excels at text processing and allows you to do some things very easily that would be difficult or
impossible
using other tools.
Perl is a complete programming language used for everything from simple utility scripts to very large and complex programs containing tens of thousands of lines of code. Teaching Perl is beyond the scope of this book, but we do want to give you some sense of how useful it is and pique your interest in learning more (see the sidebar "Resources for Learning Perl"). And we cover the basics of Unix shell scripts in Chapter 9, "Creating and Using Scripts."
Our example here is a short script that reverses the order of its inputthe last line of input comes out first, and the first line comes out last, regardless of alphabetical or numerical sort order.
The steps are similar to the ones in Chapter 2, in the section "Creating a Simple Unix Shell Script":
#!/usr/bin/perl
# script to reverse input
@input = <>;
while ( @input) {
print pop(@input);
}
Resources for Learning Perl
-
If you have never
heard
of Perl, a good place to start is the Perl Directory About Perl (www.perl.org/press/fast_facts.html).
-
If you are looking for online documentation, mailing lists, and other support resources, then go to the Perl Directory Online Documentation (www.perl.org/support/online_support.html).
-
You can start right on your Mac with
man perl
.
-
A primary resource for beginning Perl programmers is
Learning Perl, 3rd Edition
, by Randal L. Schwartz and Tom Phoenix (O'Reilly; www.oreilly.com/catalog/lperl3).
-
The definitive programmer's guide is
Programming Perl, 3rd Edition
, by Larry Wall, Tom Christiansen, and Jon Orwant (O'Reilly; www.oreilly.com/catalog/pperl3).
-
If you are looking to start learning how to use Perl for CGI programming, then check out
Perl and CGI for the World Wide Web: Visual QuickStart Guide, 2nd Edition
, by Elizabeth Castro (Peachpit Press; www.peachpit.com/title/0201735687).
|
To create a script that will reverse its input:
|
1.
|
nano ~/bin/reverse
This opens the
nano
editor and creates the script (called
reverse
) in the
bin
directory of your home directory, as we did in Chapter 2 with the system-status script. (Make sure you have added
~/bin
to your
PATH
as described in Chapter 2 in "Creating a Simple Unix Shell Script.")
The following lines are entered into the
nano
editor:
|
|
2.
|
#!/usr/bin/perl
This must be the first line. It
tells
Unix to use Perl when running this script.
|
|
3.
|
# script to reverse input, by JME
2005-04-21
This line is just a comment, to
remind
us what the script does. Include your initial or e-mail address and the date. Comments are good. We love comments.
|
|
4.
|
@input = <>;
This line causes all input to go into a variable called
@input
. Each line of input is stored as a separate item in the variable. (Think of this variable as a stack of plateseach time we add an item, we are adding a plate to the stack. In Perl, the @ indicates a list of things, or "stack of plates," which is also called an
array
.)
|
|
5.
|
while ( @input ) {
This is the start of a loop that will continue as long as there is anything left in the
@input
variable.
|
|
6.
|
print pop( @input );
This line
removes
the most recently added item (
pop
) from the
@input
array and sends it to
stdout
(the
print
function). (The
pop
function "pops" the last "plate" off the "stack.")
|
|
7.
|
}
This ends the loop. While the script is running, Perl checks here to see if there is anything still in
@input
, and if there is, it executes the
print pop(@input);
line again; otherwise, Perl proceeds to the next line.
There isn't a next line in this case, so the script stops running when
@input
is empty.
|
|
8.
|
You can stop entering text, and exit from the
nano
editor, by pressing
.
|
|
9.
|
nano
will ask you if you want to save the changes; you say yes by pressing
.
|
|
10.
|
nano
asks you to confirm the name you are using to save the file. You confirm by pressing
.
You'll be back at the shell prompt.
|
|
11.
|
chmod 755 ~/bin/reverse
That makes the script executable so that you can run it later (remember that
chmod
means
change mode
; the
755
refers to the mode that sets permissions; see Chapter 8 for more details).
You're done.
|
Congratulationsyou've created a new Unix command by writing a Perl script.
To display input backward using reverse:
-
reverse
filename
This sends the contents of the file
filename
to
stdout
, last line first.
Figure 4.21
shows a comparison between using
cat
and
reverse
on the same file.
Figure 4.21. Comparing the output of
cat
and
reverse
.
localhost:~ vanilla$
cat poetry.txt
And the coming wind did roar more loud,
And the sails did sigh like sedge;
And the rain poured down from one black cloud;
The Moon was at its edge.
localhost:~ vanilla$
reverse poetry.txt
The Moon was at its edge.
And the rain poured down from one black cloud;
And the sails did sigh like sedge;
And the coming wind did roar more loud,
localhost:~ vanilla$
You can use multiple filesfor example,
reverse file1 file2 file3
You can use
reverse
in a pipeline:
ls -l reverse
|