Sorting the Contents of Files | UNIX: The Complete Reference, Second Edition (Complete Reference Series)

The UNIX command sort is a powerful, general-purpose tool for sorting information in a file or as part of a command pipeline. It is sometimes used with uniq, a command that identifies and removes duplicate lines from sorted data. The sort and uniq commands can operate on either whole lines or specific fields.

sort

The sort command orders or reorders the lines of a file. In the simplest form, all you need to do is give it the name of the file to sort, and it will print the lines from the file in ASCII order. This example shows how you could use sort to put a list of names into alphabetical order:

 $ sort names cunningham, j.p. lewis,s.h. long, s. rosen,k.h. rosinski,r.r. wiseman, s.

You can use sort to combine the contents of several files into a single sorted file. The following command creates a file names.all containing all of the names in three input files, sorted in alphabetical order:

 $ sort names.work names.class names.personal > names.all

The −o (output) option tells sort to save the results to a file. For example, this command will sort commandlist and replace its contents with the sorted output:

 $ sort −o commandlist commandlist

Be careful: you cannot just redirect the output of sort to the original file. Because the shell creates the output file before it runs sort, the following command would delete the original file before sorting it:

 $ sort commandlist > commandlist         # File will be emptied!

Alternative Sorting Rules

By default, sort sorts its input according to the order of characters in the ASCII character set. This is similar to alphabetical order, with the difference that all uppercase letters precede any lowercase letters. In addition, numbers are sorted by their ASCII representation, not their numerical value, so 100 precedes 20, and so forth.

Several options allow you to change the rule that sort uses to order its output. These include options to ignore case, sort in numerical order, and reverse the order of the sorted output. You can also tell sort which column or field of a file to act on, and whether or not to include duplicate lines in the output.

Table 19–3 summarizes the most common options for sort.

Table 19–3: Options for sort
Option	Mnemonic	Effect
−d	Dictionary	Sort on letters, digits, and blanks only.
−f	Fold	Ignore uppercase and lowercase distinctions.
−n	Numeric	Sort by numeric value, in ascending order.
−r	Reverse	Reverse order of output.
−o filename	Output	Send output to a file.
−u	Unique	Eliminate duplicate lines in output.

Ignore Case You can get a more normal alphabetical ordering with the −f (fold) option that tells sort to ignore the differences between uppercase and lowercase versions of the same letter. The following example shows how the output of sort changes when you use the −f option:

 $ sort locations Lincroft Summit holmdel middletown $ sort −f locations holmdel Lincroft middletown Summit

Numerical Sorting To tell sort to sort numbers by their numerical value, use the −n (numeric) option. Here’s an example of how the −n option changes the output of sort. This uses wc to get the size of each file in the output from ls and then pipes the list of sizes and files to sort.

 $ wc 'ls' | sort 100             Palo Alto 12              Fox Island 130             Seattle 22              Rumson 4               Santa Monica $ wc 'ls' | sort −n 4               Santa Monica 12              Fox Island 22              Rumson 100             Palo Alto 130             Seattle

Reverse Order The −r (reverse) option tells sort to reverse the order of its output. In the previous example, the −r option could be used to list the largest files first, like this:

 $ wc −c 'ls' sort −rn 130              Seattle 100              Palo Alto 22               Rumson 12               Fox Island 4                Santa Monica

Sorting by Column or Field The sort command provides a way for you to specify the field or column to use for its comparisons. You do this by telling sort to skip one or more fields or columns. For example, the following command ignores the first column of the output from file, so it sorts by the second column, which is the file type.

 $ file * | sort +1 notes:      ASCII English text tmp:        ASCII English text mbox:       ASCII mail text bin:        directory Desktop:    directory Mail:       directory zwrite:     symbolic link to /home/raf/scripts/Python/zwrite.py

Like cut, sort allows you to specify an alternative field separator. You do this with the −t (tab) option. The following command tells sort to skip the first four fields in a file that uses a colon (:) as a field separator:

 $ sort −t: +4 /etc/passwd

Suppressing Repeated Lines Sorting often reveals that a file contains multiple copies of the same line. The next section describes the uniq command, which is designed to remove repeated lines from input files. But because this is such a common sorting task, sort also provides an option, −u (unique), that removes repeated lines from its output. Repeated lines are likely to occur when you combine and sort data from several different files into a single file. For example, if you have several lists of e-mail addresses, you may want to create a single file containing all of them. The following command uses the −u option to ensure that the resulting file contains only one copy of each address:

 $ sort −u names.* > uniq-names

uniq

The uniq command filters or removes repeated lines from files. It is usually used with files that have first been sorted by sort. In its simplest form it has the same effect as the −u option to sort, but uniq also provides several useful options of its own.

The following example illustrates how you can use uniq as an alternative to the −u option of sort:

 $ sort names.* | uniq > names

Counting Repetitions

One of the most valuable uses of uniq is in counting the number of occurrences of each line. This is a very convenient way to collect frequency data. The following illustrates how you could use uniq along with cut and sort to produce a listing of the number of entries for each ZIP code in a mailing list:

 $ cut −f6 mail.list 07760 07733 07733 07760 07738 07760 07731 $ cut −f6 mail.list | sort | uniq −c | sort −rn 3 07760 2 07733 1 07738 1 07731

The preceding pipeline uses four commands: The first cuts the ZIP code field from the mailing list file. The second uses sort to group identical lines together. The third uses uniq −c to remove repeated lines and add a count of how many times each line appeared in the data. The final sort −rn arranges the lines numerically (n) in reverse order (r), so that the data is displayed in order of descending frequency.

Finding Repeated and Nonrepeated Lines

uniq can also be used to show which lines occur more than once and which occur only once. The −d (duplicate) option tells uniq to show only repeated lines, and the −u (unique) option prints only lines that appear exactly once. For example, the following shows ZIP codes that appear only once in the mailing list from the preceding example:

 $ cut −f6 mail.list | uniq −u 07738 07731