Working with Columns and Fields | UNIX: The Complete Reference, Second Edition (Complete Reference Series)

Many files contain information that is organized in terms of position within a line. These include tables, which organize text in columns, and files such as /etc/password that consist of lines made up of fields. The UNIX System includes a number of tools designed specifically to work with files organized in columns or fields. You can use the commands described in this section to extract and modify or rearrange information in field-structured or columnstructured files.

cut allows you to select particular columns or fields from files.
colrm deletes one or more columns from a file or set of files.
paste glues together columns or fields from existing files.
join merges information from two database files.

cut

Often you are interested in only some of the fields or columns contained in a table or file. For example, you may want to get a list of e-mail addresses from a personnel file that contains names, employee numbers, e-mail addresses, telephone numbers, and so forth. cut allows you to extract from such files only the fields or columns you want.

When you use cut, you have to tell it how to identify fields and which fields to select. You can identify fields either by character position or by the use of field separator characters. You must specify either the −c or the −f option and the field or fields to select.

Using cut with Fields

Many files can be thought of as a list of records, each consisting of several fields, with a specific kind of information in each field. An example is the file contact-info shown here, which contains names, usernames, phone numbers, and office numbers:

 $ cat contact-info Barker-Plummer,D   dbp     555–1111   1J333 Etchemendy,J       etch    555–2222   2F328 Liu, A             a-liu   555–3333   1J322

Field-structured files like this are used often in the UNIX System, both for personal databases like this one and to hold system information.

A field-structured file uses a field separator or delimiter to separate the different fields. In the preceding example, the field separator is the tab character, but any other character-such as a colon (:) or the percent sign (%)-could be used.

To retrieve a particular field from each record of a file, you tell cut the number of the field you want. For example, the following command uses cut to retrieve the e-mail addresses from contact-info by cutting out the second field from each line or record:

 $ cut −f2 contact-info dbp etch a-liu

You can use cut to select any set of fields from a file. The following command uses cut to produce a list of names and telephone numbers from contact-info by selecting the first and third fields from each record:

 $ cut −f1, 3 contact-info > phone-list

You can also specify a range of adjacent fields, as in the following example, which includes each person’s username and telephone number in the output:

 $ cut −f1–3 contact-info > contact-short

If you omit the last number from a range, it means “to the end of the line.” The following command copies everything except field two from contact-info to contact-short:

 $ cut −f1, 3− contact-info > contact_short

Using cut with Multiple Files

You can use cut to select fields from several files at once. For example, if you have two files of contact information, one containing personal contacts and one for work-related contacts, you could create a list of all the names and phone numbers in both files with the following command:

 cut −f1, 3 contacts.work contacts.home > contacts.all

Of course, the files must share the same formatting, so that the command cut −f1,3 works correctly on both of them.

Specifying Delimiters

Fields are separated by delimiters. The default field delimiter is a tab, as in the preceding example. This is a convenient choice because when you print out a file that uses tabs to separate fields, the fields automatically line up in columns. However, for files containing many fields, the use of tabs often causes individual records to run over into two lines, which can make the display confusing or unreadable. The use of tabs as a delimiter can also cause confusion because a tab looks just like a collection of spaces. As a result, sometimes it is better to use a different character as the field separator.

To tell cut to treat some other character as the field separator, use the −d (delimiter) option, followed by the character. Separators are often infrequently used characters like the colon (:), percent sign (%), and caret (^).

The /etc/passwd file contains information about users in records using : as the field separator. This example shows how you could use cut to select the login name, user name, and home directory (the first, fifth, and sixth fields) from the /etc/passwd file:

 $ cat /etc/passwd root:x:0:0:root:/root:/bin/bash dbp:x:944:100:Dave Barker-Plummer:/home/dbp:/bin/bash etch:x:945:100:John Etchemendy:/home/etch:/bin/bash a-liu:x:946:100 :Albert Liu:/home/a-liu:/bin/bash $ cut −d: −f 1, 5–6 /etc/passwd root:root:/ dbp:Dave Barker-Plummer:/home/dbp etch:John Etchemendy:/home/etch a-liu:Albert Liu:/home/a-liu

If the delimiter has special meaning to the shell, it should be enclosed in quotes. For example, the following tells cut to print all fields from the second one on, using a space as the delimiter:

$ cut −d' ' −f2− file

Using cut with Columns

Some files arrange information into columns with fixed widths. For example, the long form of the ls command uses spaces to align its output:

 $ ls −1 -rw-rw-r--1 jmf      users        2958 Oct  8 13:02 inbox -rw-rw-r--1 jmf      users         553 Oct  8 12:32 save -rw-rw-r--1 jmf      users      464787 Oct  8 13:03 sent

Each of the types of information in this output is assigned a fixed number of characters. In this example, the permissions field consists of the characters in positions 1–10, the size is contained in characters 35–42, and the name field is characters 56 and following. (The size of the columns may vary on different systems.)

The −c (column) option tells cut to identify fields in terms of character positions within a line. The following command selects the size (positions 35–42) and name (positions 56 to end) for each file in the long output of ls:

 $ ls −l | cut −c35–42, 56–     2958 inbox      553 save   464787 sent

colrm

The colrm command is a specialized command that you can use to remove one or more columns from a file or set of files. Although you can use the cut command to do this, colrm is a simple alternative when that is exactly what you need to do. You specify the range of character positions to remove from standard input. For example, the following command deletes the characters in columns 8–12 from the file pangrams.

 $ cat pangrams The quick brown fox jumps over the lazy dog. The five boxing wizards jump quickly. Sphinx of black quartz, judge my vow. $ cat pangrams | colrm 8 12 The quips over the lazy dog. The five jump quickly. Sphinx judge my vow.

paste

The paste command joins files together line by line. You can use it to create new tables by gluing together fields or columns from two or more files. In this example, paste creates a new file by combining the information in states and state_abbr:

 $ cat states Alabama Alaska Arizona Arkansas California $ cat state_abbr AL AK AZ AR CA $ paste states state_abbr > states.comp $ cat states.comp Alabama            AL Alaska             AK Arizona            AZ Arkansas           AR California         CA

Of course, if the contents of the files do not line up correctly (e.g., if they are not in the same order) the output from paste may not be what you were expecting.

Specifying the paste Field Separator

The paste command separates the parts of the lines it pastes together with a field separator. The default delimiter is tab, but as with cut, you can use the −d (delimiter) option to specify another one if you want. The following command combines the states files with a third file containing the capitals, using a colon as the separator:

 $ paste −d: states state_abbr capitals Alabama:AL:Montgomery Alaska:AK:Juneau Arizona:AZ:Phoenix Arkansas:AR:Little Rock California:CA:Sacramento

Using paste with Standard Input

You can use the minus sign (−) to tell paste to use standard input as one of its input “files.” This feature allows you to paste information from a command pipeline or from the keyboard.

For example, the following command will add a new field to each line of the addresses file.

 $ paste addresses − > addresses.new

Here, paste reads each line of addresses and then waits for you to type a line from your keyboard. paste prints the output line to the file addresses.new and then goes on to read the next line of input from addresses.

Using cut and paste to Reorganize a File

You can use cut and paste together to reorder the contents of a structured file. A typical use is to switch the order of some of the fields in a file. The following commands switch the second and third fields of the contact-info file:

 $ cut −f1, 3 contact-info > temp $ cut −f4- contact-info > temp2 $ cut −f2 contact-info | paste temp-temp2 > contacts.new

The first command cuts fields one and three from contact-info and places them in temp. The second command cuts out the fourth field from contact-info and puts it in temp2. Finally, the last command cuts out the second field and uses a pipe to send its output to paste, which creates a new file, contacts.new with the fields in the desired order. The result is to change the order of fields from name, username, phone number, room number to name, phone number, room number, username. Note the use of the minus sign to tell paste to put the standard input (from the pipeline) between the contents of temp and temp2.

There is a much easier way to do the swapping of fields illustrated here, using the awk language. You’ll see how in Chapter 21.

join

The join command joins together two existing files on the basis of a key field that contains entries common to both of them. It is similar to paste, but join matches lines according to the key field, rather than simply gluing them together. The key field appears only once in the output.

For example, a jewelry store might use two files to keep information about merchandise, one named merch containing the stock number and description of each item, and one, costs, containing the stock number and cost of each item. The following uses join to create a single file from these two, listing stock numbers, descriptions, and costs. (Here the first field is the key field.)

 $ cat merch 63A457       man's gold watch 73B312       garnet ring 82B119       sapphire pendant $ cat costs 63A457       125.50 73B312       255.00 82B119       534.75 $ join merch costs 63A457       man's gold watch        125.50 73B312       garnet ring             255.00 82B119       sapphire pendant        534.75

The join command requires that both input files be sorted according to the common field on which they are joined.

Specifying the join Field

By default, join uses the first field of each input file as the common field. You can specify which fields to use with the −j (join) option. The following command tells join to join the files using field 2 in the first file and field 3 in the second file:

 $ join −j1 2 −j2 3 ss_no personnel > new_data

Specifying Field Separators

The join command treats any white space (a space or tab) in the input as a field separator and uses the space character as the default delimiter in the output. You can change the field separator with the −t (tab) option. The following command joins the data in the system files /etc/passwd and /etc/group, both of which use a colon as their field separator. The colon is also used as the delimiter for the output.

 $ join −t: /etc/passwd /etc/group > all_data

Unfortunately, the option letter that join uses to specify the delimiter (−t) is different from the one (−d) that is used by cut, paste, and several other UNIX System commands.