Project57.Edit Text Files


Project 57. Edit Text Files

"How do I quickly strip blank lines from hundreds of files?"

Learn More

Project 6 explains the concept of input/output redirection.


This project introduces utilities that apply simple transformations to text files, such as translating case, removing excessive white space, stripping blank lines, folding long lines, and converting between space and tab characters. It covers the commands TR, expand, unexpand, fold, and fmt. When you need to modify a file in a more complex manner, consider using either sed, covered in Projects 59 and 61, or awk, covered in Projects 60 and 62.

Change File Content

The tr command searches a file for specific characters and translates them into other characters. It takes as its arguments two strings, translating characters found in the first string into the corresponding characters from the second string. Rather oddly, tr does not take filename arguments but always reads its standard input and writes to its standard output. We'll get'round this by using input/output redirection.

Let's convert the contents of the file jill.txt to be all uppercase, reading jill.txt as standard input and writing to loud.txt as standard output.

$ cat jill.txt She likes black. I'm not sure if she has ambitions to be a Goth, or an undertaker. Just a passing phase I suspect - hearse today, gone tomorrow. $ tr 'abcdefghijklmnopqrstuvwxyz' ¬           'ABCDEFGHIJKLMNOPQRSTUVWXYZ' <jill.txt >loud.txt $ cat loud.txt SHE LIKES BLACK. I'M NOT SURE IF SHE HAS AMBITIONS TO BE A GOTH, OR AN UNDERTAKER. JUST A PASSING PHASE I SUSPECT - HEARSE TODAY, GONE TOMORROW.


The tr command understands character classes such as "all lowercase characters" or "all printable characters." Therefore, we can shorten our command to

$ tr '[:lower:]' '[:upper:]' <jill.txt >loud.txt


Read the man page for tr for a list of the classes it recognizes.

Convert from Mac to Unix

A typical use for TR is to convert an old-style Mac file to a Unix-compliant file. Mac OS 9 used a Return character (ASCII code 13) to mark the end of a line, whereas Unix uses a Newline character (ASCII code 10). A Mac-style text file appears to consist of one very long line when viewed by a Unix text-processing utility and editor.

Suppose that we have files imported from Mac OS 9 and wish to make them play nice in a Unix environment. We use tr to translate Return, represented by the special sequence \r, into Newline, represented by \n.

$ tr '\r' '\n' < mac-file > unix-file


Write Back to the Original File

It's not possible to write output back to the file being read because of the way input/output redirection works. However, the following trick, which uses a semicolon to separate two commands on a single line, will produce that effect. The command before the semicolon redirects translated output from mac-file to a new file called tmp. When that command completes, the mv command renames tmp to mac-file, overwriting the original file with a translated replacement.

$ tr '\r' '\n' < mac-file > tmp; mv tmp mac-file


Strip Lines and Characters

In this section, we look at ways to tidy up files. You might want to strip out nonprinting characters, for example, or remove excessive white space.

Tip

If you wish to know exactly which characters are included in a particular class, check out the Section 3 man page for the corresponding library function. A library function is named like its class but starts is. To read about character class [:space:], for example, look at the man page for isspace by typing

$ man 3 isspace



Let's start by removing excessive white spacewhich we define as being two or more consecutive spaces, tabs (\t), or Newlines (\n)from file spaced. We employ the TR command again, with option -s to squeeze repeated occurrences of selected characters into a single occurrence, and direct the translated output to file squashed.

Tip

Employ commands such as grep to edit files. Here, we search for lines that are empty, using the regular expression ^$, and report on all lines that do not match (are not empty). The results are written back to the original file via the temporary-file trick shown in the main text.

$ grep -v '^$' space > $$¬     tmp; mv tmp space


Project 23 covers grep, and Projects 77 and 78 explain regular expressions.


$ tr -s ' \t\n' <spaced >squashed


Alternatively, if we accept the definition of white space given by man 3 isspace, we can achieve the same effect by typing

$ tr -s '[:space:]' <spaced >squashed


Next, suppose that you have a file containing control and other nonprinting characters that you wish to remove. Let's view the file by using cat and option -v to display nonprinting characters visibly (for example, Control-a is displayed by ^A).

$ cat -v control abc^A^B^C   def^D^E^F   ghi^G^H jkl uvw^U^V^W   xyz


To remove all nonprinting characters, use tr. As a first attempt, try applying it with option -d, which deletes specified characters. Use the class [:print:], which specifies all printing characters (the ones you want to preserve), but then use option -c to specify the inverse of the class (everything that isn't in the class):

$ tr -cd '[:print:]' <control abc   def   ghijkluvw   xyz$


This deletes all non printing characters, all right, but unfortunately, those characters include some, such as Tab and Newline, that are essential to text formatting. We can get around this problem by adding the class [:space:] to the selected characters. Our next attempt deletes all characters that are not printable but leaves behind "white space" that provides formatting.

$ tr -cd '[:print:][:space:]' <control abc   def   ghi   jkl uvw   xyz


Expand and Unexpand

The command expand expands tab characters into the appropriate number of spaces; the command unexpand does the reverse. Pass option -a to unexpand to ensure that all spaces are converted; otherwise, only leading spaces are converted.

Files containing long lines, such as those often found in HTML source code, can have their contents broken into shorter lines with the fold command. Here's the original file (which is one long line, but shown split across four lines in the book).

$ cat count He puzzled at my counting. Did I exceed his range? Apparently, there is an African tribe who count one, two, many. So perhaps he's related. More likely he's related to the African tribe who have "one too many".


We specify that the output should be lines exactly 50 characters in length by using the option -w50. The output is shown displayed on the Terminal screen, but we could redirect it to a file or back to the original file by using the technique described in "Write Back to the Original File" earlier in this project.

$ fold -w50 count He puzzled at my counting. Did I exceed his range? Apparently, there is an African tribe who count o ne, two, many. So perhaps he's related. More likely he's related to the African tribe who have "one too many".


Alternatively, we might use the command fmt, which breaks lines at spaces instead of midword.

$ fmt -50 count He puzzled at my counting. Did I exceed his range? Apparently, there is an African tribe who count one, two, many. So perhaps he's related. More likely he's related to the African tribe who have "one too many".


Just for Fun

Here's a command that parses file.txt and produces a count of the ten most-used words. See Project 26 for examples of how to use the sort and uniq commands.

$ tr -cs "[:alpha:]\'" ¬    '\n' < file.txt | ¬    sort | uniq -c | ¬ sort -nr | head -10 75 the 46 and 45 to ...



Tip

The fmt command does much more than just break lines. Check its man page by typing

$ man fmt






Mac OS X UNIX 101 Byte-Sized Projects
Mac OS X Unix 101 Byte-Sized Projects
ISBN: 0321374118
EAN: 2147483647
Year: 2003
Pages: 153
Authors: Adrian Mayo

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net