|< Day Day Up >|
Communication Between Processes: Redirection, Pipes
Building an operating system out of a multitude of small, cooperating processes would not provide such flexibility and power to the user were it not for a simple method of making all of these processes speak to each other. At the heart of the interprocess communications model of Unix is a simple but amazingly effective abstraction of the idea of input and output.
To paraphrase the model on which Unix bases input and output, you can imagine that Unix thinks of user input to a program as a stream a stream of information. Output from the program back to the user can be thought of in the same way. A stream of information is simply a collection of information that flows in or out of the program in a serial (ordered) fashion. A user can't send two pieces of information to a program at the same time two key presses, no matter how closely they occur, are ordered, one first and one second. A cursor moving across a screen provides information serially as to where it is now, and where it was then. Even if two events manage to occur simultaneously, the electronics of the machine can't really deal with simultaneous events, and so they end up being registered as separate events occurring very close in time. Output must be similarly serially ordered. Whether you are drawing data to the screen or sending data over an Internet connection, no two data items leave a program at exactly the same time; therefore, they are also a serial stream of information.
Because both input and output from processes are streams of information, and every function of the system from user programs to reading files to parts of the operating system is a running process, Unix models the implementation of communication between the processes as simply tying the output stream of one process to another's input stream. Tying the standard output stream (named STDOUT) from one process to the standard input stream (named STDIN) of another is called creating a pipe between them. When you understand the view of data moving into or out of a process as being a data stream, it is immediately obvious that there is no need for the system to concern itself over the endpoints of the stream. Data simply moves about the system between programs in streams, as though each program had input and output spigots, and someone had connected garden hoses between them. The input spigots all look the same, and the output spigots all look the same, so the operating system can tie any output into any input, and let the programs worry about whether they know what to do with the data in the stream.
For example, one endpoint of a stream might be connected to the output (STDOUT) of a process that is taking input from a user at a keyboard, and the other endpoint might be connected to the input (STDIN) of a process manipulating that information and writing it into a file. On the other hand, the same information could be placed in a file, and we could replace the user entering information with a process that could read the file and write the same information onto its STDOUT. If we tie the stream created in this fashion into the STDIN of the same manipulation program, there would be absolutely no difference between these two situations from the operating system's point of view.
In short, this abstraction provides that so long as the input coming to a process looks like the input the process expects; it does not matter to the process or the operating system where that input comes from. Likewise, provided that the destination of the output from the process acts as expected, it does not matter where the output is actually going.
Redirection: STDIN, STDOUT, STDERR
Unix makes this input/output model available to the user through a concept known as redirection. This is implemented as a requirement that all processes adhere to certain conventions regarding input and output.
At the base is the notion that input and output from programs is generally from, and to, a user typing information at the command line. Even programs that are not intended to be used by a person at a command line are expected to adhere to the model that input comes from a user, and output goes to a user.
This might seem counterintuitive, but further conventions are required that allow this seeming restriction to be less restrictive, while generalizing the input/output model sufficiently that it can be applied to almost any need. Two of these are the idea of input arriving in a program through a virtual interface known as STDIN (standard input), and output leaving the program through a virtual interface known as STDOUT (standard output). It also requires the convention of a third virtual interface by which error messages can be conveyed, which is STDERR (standard error).
Redirection is accomplished by attaching these virtual interfaces to each other in various combinations essentially redirecting the input or output from a process to a different location than to a user or from a user.
Standard In: STDIN
The virtual input interface to programs is called STDIN, for standard input. A program can expect the incoming data stream from the user (or any other source) to arrive at STDIN.
When you interact with a command-line program, the program is reading the data you are entering from STDIN. If you prefer not to enter the data by hand, you can put it in a file and redirect the contents of the file into the program's STDIN the program will not know the difference.
A program that you can use for an example is the spell program. Apple hasn't distributed spell with Mac OS X as of this writing, but we've provided instructions on how to install it in Chapter 13, "Using Common Command-Line Applications and Application Suites."
If you're using a system on which it's already been installed, follow along here. Also, the installation's not too difficult if you care to glance ahead and just trust us on the commands you don't recognize yet to perform the install. If not, spell still makes a good program for explanation because it has exactly the features we want to exhibit just read along and imagine that it's really working until you get to Chapter 13.
The spell command finds misspellings. Given input from STDIN, spell parses through it, checks the input against a dictionary, and returns any misspellings it finds. To issue the spell command from the command line, you might type something like the following:
brezup:ray testing $ spell Now is the tyem for all good authors to come to thie ayde of some very good Unix users Ctrl+D
Pressing Ctrl+D finishes the input, sending an end-of-data signal into STDIN, effectively telling the program that no further information is to come. The spell program goes to work, and returns the following:
tyem thie ayde
Each of the misspelled words (or at least words that aren't in the dictionary) is displayed, exactly as expected.
This might not seem to be a particularly useful program at first glance how often do you want to type a sentence, just to find out what words are misspelled in it? The key to its usefulness, however, is that the spell program does not care whether you typed the input, or whether the input came from a file.
Now try it with data from a file. Fire up your favorite text editor, and create a file containing the same text you typed to spell previously. Then try spell by redirecting this file into its STDIN interface. If you named your file reallydumbfile, you can run spell on it by typing the following:
brezup:ray testing $ spell < reallydumbfile tyme thie ayde
This looks a little more useful. The < character redirects STDIN for the program to its left to come from the file named to its right. Here, it redirects STDIN for the spell program so that it comes from the file reallydumbfile rather than from your keyboard.
Standard Out: STDOUT
The virtual output interface that Unix provides to programs is called STDOUT, for standard output. Just as you can redirect STDIN from a file, if you want to store the output of a command in a file, you can redirect STDOUT from the program into the file. The > character directs the STDOUT of the program to its left into the file named to its right. For example, if you want to collect the last few lines of /var/log/system.log into a file in your home directory, you could type
brezup:ray testing $ tail -20 /etc/services > ~/my-output
This command directs the shell to create a file named my-output in your home directory, and to redirect STDOUT from the tail command (that is, the data tail would print if you just issued the command tail -20 /etc/services) into the file. If my-output already exists in your home directory, it will be overwritten by the output from tail.
If you want to collect and archive the data, by appending it to my-output instead of overwriting it, the shell can be directed to append rather than replace the data. In this case, STDOUT is redirected with >> rather than the single >. The >> character pair appends the STDOUT of the program to the left into the file named on its right.
You can also simultaneously redirect STDOUT and STDIN, like this:
[localhost:~/Documents] nermal% spell < reallydumbfile > reallydumbspelling [localhost:~/Documents] nermal% ls get_termcap lynx.cfg reallydumbspelling termcap-1.3.tar lynx reallydumbfile termcap-1.3 test [localhost:~/Documents] nermal% cat reallydumbspelling tyem thie ayde
Standard Error: STDERR
To make your life easier, Unix actually has two different output interfaces that it defines for programs. The first, STDOUT, has just been covered. The second, STDERR, is used to allow the program to provide error and diagnostic information to the user. This is done for two reasons. First, it allows error information to be reported in such a way that it does not interfere with data on the STDOUT interface. Second, if you are redirecting STDOUT from a program to another program or to a file, you would not see error messages if they were carried on STDOUT. By providing a separate error channel, Unix gives the user the choice of how and where error and diagnostic information should be displayed, independent of information that is actually correct output data.
tcsh and bash syntax disagree rather significantly here. In tcsh if you want to redirect STDERR into the same stream as STDOUT, effectively combining these two different pieces of information, you can do so by using the character pair >& to indicate redirection in the command, instead of >. bash allows this syntax for combining the streams (though it prefers the use of &>), but in bash you also have the option of redirecting STDERR independently of STDOUT. To redirect just STDERR, use 2> as the redirection specifier rather than >.
As mentioned earlier, both bash and tcsh are vastly more complex than can be completely covered in a book of this size, and input/output redirection is one of the principal areas of complexity. If you want to perform more complex manipulations of your command's input and output, see your online man pages to learn how the shell of your preference behaves.
Finally, there is nothing in the input/output model that restricts redirection to coming from or going into files/users (if everything looks like a user or a file, letting software talk to anything else is just as good). STDIN and STDOUT can just as easily be tied together instead of being tied into files or the command line.
Perhaps more correctly, the operating system never really redirects to or from files. What the operating system is really doing when you redirect into a file is invisibly creating a process that writes into a file, and redirecting your output to the STDIN of the process writing the file. Likewise, when you redirect a file into a program's STDIN, the operating system is invisibly creating a process that opens and reads the file, and is tying the STDOUT from this process into your process's STDIN (and now you see why we said the model was based on input and output being attached to users, rather than to files). For the user's convenience, these common actions are abbreviated into the < and > redirection characters.
Programs, on the other hand, are connected by directly redirecting their STDOUT and STDIN interfaces with a pipe. To create a pipe in Unix, you simply use a | character between the programs on the command line.
Again, an example is more illustrative than a considerable amount of explanation. Consider a situation in which you want to examine the content of a file that is larger than will fit on one screen. You can accomplish this easily by piping the output from the cat command into a pager, such as the more command.
brezup:ray testing $ cat /usr/share/file/magic | more # Magic # Magic data for file(1) command. # Machine-generated from src/cmd/file/magdir/*; edit there only! # Format is described in magic(files), where: # files is 5 on V7 and BSD, 4 on SV, and ?? in the SVID. #------------------------------------------------------------------------------ # Localstuff: file(1) magic for locally observed files # # $Id: Localstuff,v 1.1 2003/07/02 18:00:17 eseidel Exp $ # Add any locally observed files here. Remember: # text if readable, executable if runnable binary, data if unreadable. #------------------------------------------------------------------------------ # acorn: file(1) magic for files found on Acorn systems # # RISC OS Chunk File Format # From RISC OS Programmer's Reference Manual, Appendix D # We guess the file type from the type of the first chunk. 0 lelong 0xc3cbc6c5 RISC OS Chunk data >12 string OBJ_ \b, AOF object >12 string LIB_ \b, ALF library byte 920
Of course, you already know that you could have accomplished this by just using more /usr/share/file/magic. The point, though, is that although we told you how to use more to read a file before, more actually wants to take its input from STDIN and uses a file specified as an argument only as a last resort.
Knowing this, you now know how to make any other output from any other program viewable with the more pager. This enables you to do things such as look at the full contents of your filesystem without needing an immensely large scroll buffer in your terminal:
brezup:ray testing $ ls -lRaF / | more total 8721 drwxrwxr-t 39 root wheel 1326 16 Aug 17:16 ./ drwxrwxr-t 39 root wheel 1326 16 Aug 17:16 ../ -rwxrwxr-x 1 ray unknown 6148 16 Aug 17:15 .DS_Store* d-wx-wx-wt 4 ray admin 136 12 Aug 01:09 .Trashes/ -rw-r--r-- 1 ray admin 39568 11 Aug 22:42 .VolumeIcon.icns -r--r--r-- 1 root wheel 156 29 Jul 14:15 .hidden dr--r--r-- 2 root wheel 256 16 Aug 17:16 .vol/ drwxrwxr-x 28 root admin 952 11 Aug 23:50 Applications/ drwxr-xr-x 2 ray unknown 68 11 Aug 23:54 Calendars/ drwxr-xr-x 4 ray unknown 136 11 Aug 23:54 Contacts/ -rw-r--r-- 1 root admin 1024 11 Aug 23:50 Desktop DB -rw-r--r-- 1 root admin 2 11 Aug 22:43 Desktop DF drwxr-xr-x 2 ray unknown 68 13 Aug 10:58 Desktop Folder/ drwxrwxr-x 13 root wheel 442 2 Jul 17:22 Developer/ -rw-r--r-- 1 ray admin 0 11 Aug 22:42 Icon drwxrwxr-x 35 root wheel 1190 13 Aug 20:45 Library/ drwxr-xr-x 1 root wheel 512 16 Aug 23:19 Network/ drwxr-xr-x 5 root wheel 170 2 Jul 17:22 System/ drwxr-xr-x 3 ray unknown 102 13 Aug 10:58 TheVolumeSettingsFolder/ drwxr-xr-x 2 ray unknown 68 13 Aug 10:58 Trash/ drwxrwxr-t 9 root admin 306 16 Aug 16:28 Users/ drwxrwxrwt 8 root admin 272 16 Aug 17:16 Volumes/ byte 1356
One particularly useful use of such piping of commands together comes when you want to filter the output of a command so that you only see the most interesting parts. For example, if you want to find files in your current directory that were edited in August, you could refer to Chapter 9, "Accessing the BSD Subsystem," and look up how to filter dates in ls, or you could use what you probably already remember about grep. Adding a pipe from ls -l into a grep command looking for the string Aug is as simple as entering both commands on the command line, separated by a | character.
brezup:ray Documents $ ls -l total 16456 -rw-r--r-- 1 ray staff 4925440 Jun 11 00:11 BTS.tar drwxr-xr-x 17 ray staff 578 Jun 13 08:57 BTS_folder drwxr-xr-x 28 ray staff 952 Jun 12 11:27 Core drwxr-xr-x 4 ray staff 136 Mar 9 02:13 DnD_data drwxr-xr-x 10 ray staff 340 Jul 2 10:43 Mailsmith User Data Backup drwxr-xr-x 17 ray staff 578 Aug 15 23:58 Microsoft User Data drwxr-xr-x 7 ray staff 238 Jul 16 2002 Software_Docs drwxr-xr-x 48 ray staff 1632 Aug 15 00:19 buying_the_farm drwxrwxrwx 39 ray staff 1326 Apr 24 19:02 dna_demos_2002 drwxr-xr-x 4 ray staff 136 May 29 13:12 games -rw-r--r-- 1 ray staff 610678 Aug 15 02:24 hd_genes_aa.fa -rw-r--r-- 1 ray staff 1645031 Aug 15 02:24 hd_genes_nt.fa -rw-r--r-- 1 ray staff 609071 Jun 9 21:06 lumberjk.mp3 drwxr-xr-x 17 ray staff 578 May 21 16:56 openGL drwxr-xr-x 36 ray staff 1224 Aug 15 00:19 research drwxr-xr-x 104 ray staff 3536 Jun 26 10:39 security drwxr-xr-x 5 ray staff 170 Dec 27 2002 source -rw-r--r-- 1 ray staff 623484 Jan 18 2002 squirrel.mpg drwxrwxrwx 16 ray staff 544 Sep 16 2002 stylewriter drwxrwxrwx 7 ray staff 238 Aug 16 15:56 unleashed brezup:ray Documents $ ls -l | grep "Aug" drwxr-xr-x 17 ray staff 578 Aug 15 23:58 Microsoft User Data drwxr-xr-x 48 ray staff 1632 Aug 15 00:19 buying_the_farm -rw-r--r-- 1 ray staff 610678 Aug 15 02:24 hd_genes_aa.fa -rw-r--r-- 1 ray staff 1645031 Aug 15 02:24 hd_genes_nt.fa drwxr-xr-x 36 ray staff 1224 Aug 15 00:19 research drwxrwxrwx 7 ray staff 238 Aug 16 15:56 unleashed
Sometimes, you want to filter both STDOUT from the command and STDERR. find, for example, has an annoying tendency to report all manner of errors about directories that you're not allowed to look in. This often clutters up the output such that you can't actually find what it was that you wanted to find. Here, redirecting STDOUT using only the | character is insufficient because find helpfully puts the error messages on STDERR, where they won't be captured by the | pipe (which only works on STDIN).
In tcsh, redirecting both STDIN and STDERR simultaneously is simply a matter of using the |& pipe combination instead of the single | character. In bash, it's just slightly more complex. bash can't redirect both at once, but it can redirect STDERR into STDOUT. The syntax for that looks like 2>&1. After STDERR has been redirected into STDOUT, the combined stream can then be redirected into grep with the | character.
brezup:ray testing $ find / -name 702_fall_grades -print find: cannot read dir /lost+found: Permission denied find: cannot read dir /usr/lost+found: Permission denied find: cannot read dir /usr/local/lost+found: Permission denied. . . . ^C brezup:ray testing $ find / -name grades1 -print 2>&1 | grep "702_fall_grades" /Users/ray/Biophysics/702_fall_grades
These are, of course, simplistic examples of connecting programs, but keep an eye out for how pipes are used throughout the rest of the book. The ability to create small programs with small functions and to tie these together into arbitrarily large programs with arbitrarily complex behaviors is powerful. This is one of the main reasons that having access to the BSD half of your new operating system is so valuable.
Think back to programs such as grep, and you can probably begin to see how you could apply this to creating custom solutions to problems that you might have encountered. You should also begin to see why this functionality cannot be conveniently duplicated with a GUI-only interface.
Joints in Pipes: tee
On occasion, you might want to redirect STDOUT to both a file and another program at the same time. In such a case, you can use the tee command. This command accepts data on STDIN, writes it to a filename specified on the command line, and continues to send the data, unaltered on STDOUT.
Consider an example in which you want to search through your files, looking for files that match a particular name pattern. You want to both browse the found names as they appear, and collect the names into a log file so that you can use the information again later. In this example, we will look in a rather inefficient fashion for files with names that contain java. Because many of them are probably on the system, we want the output piped through a pager (more). We also want to collect the filenames into a file in our home directory named my_output.
brezup:ray testing $ find / -name \*java\* -print | tee ~/my_output | more /Applications/Utilities/Java/Java Web Start.app/Contents/MacOS/javaws.cfg /Applications/Utilities/Java/Java Web Start.app/Contents/MacOS/javaws.jar /Applications/Utilities/Java/Java Web Start.app/Contents/MacOS/javaws.policy . . .
It might take some time for this to start printing output to the screen because it could take find a while to start finding appropriately named files. If you let this run to completion, you can then look at the file my_output, and it will have all the stuff you just scrolled through with more. If you press Ctrl+C to stop the find and the listing, you'll kill the tee process, and it won't write its output. Because this isn't a valuable listing, you might prefer to kill it off rather than actually wait for this to finish just to see the output, but you wouldn't want to do this if it were important to capture the output.
The tee command is invaluable if you need to split one STDOUT stream to be used by multiple different processes, or if you need to collect logging or partial output from intermediate steps in a large, multiprogram piped command. Table 12.8 shows syntax and options for tee.
|< Day Day Up >|