Project29.File-Content Tips | Mac OS X Unix 101 Byte-Sized Projects

Project 29. File-Content Tips

"Is there an easy way to format the contents of text files?"

This project gives you tips for detecting the type of content a file contains and introduces some handy text-processing utilities.

Tip

If a file's content is human-readable, file always includes the word text somewhere in the description. This fact can be used to filter a list of files (using grep, for example), leaving all and only those that are human readable.

Determine File Content

Command file tells you the type of content a file contains.

$ file * about-html.txt:  ASCII text fake.html:       empty index.html:      ASCII HTML document text letter.doc:      ASCII English text nodif:           a /bin/tcsh script text executable smtp-auth-plain: a /usr/bin/perl script text executable unix2mac:        a /bin/bash script text executable week1:           directory week1.tar:       POSIX tar archive week1.tbz2:      bzip2 compressed data, block size = 900k

Specify option -i if you would like the file type displayed in mime format.

$ file -i * about-html.txt:  text/plain; charset=us-ascii fake.html:       application/x-empty index.html:      text/html; charset=us-ascii letter.doc:      text/plain, English; charset=us-ascii nodif:           application/x-shellscript smtp-auth-plain: application/x-perl unix2mac:        application/x-shellscript week1:           application/x-not-regular-file week1.tar:       application/x-tar, POSIX week1.tbz2:      application/octet-stream

It's Magic

The file command determines the type of a file by examining its magic number stored near the beginning of the file, which is intended to identify the file type to the Unix operating system. If a file does not have a recognizable magic number, the content is scanned. Reference is made to the magic file in /usr/share/file/magic. This file maps magic numbers and content to file type.

In Mac OS X, Mach-O executables begin with the hexadecimal number feedface.

Can you find the file type that begins cafebabe? (cafe gives us a clue.)

Search for Files with a Specific Type of Content

We can pipe the results from file to grep to look for files with specific content.

$ file * | grep -i html about-html.txt: ASCII text fake.html:      empty index.html:     ASCII HTML document text

Learn More

Refer to Project 23 to learn more about grep.

Project 77 covers regular expressions.

This simple approach suffers from a problem: If the filename contains the search term, it will match too, regardless of the content. We must add a little sophistication to the search term to absorb everything from the beginning of the line to the colon after the filename, using a regular expression such as "^.*:", and then search for html.

$ file * | grep -i "^.*:.*html" index.html:      ASCII HTML document text

The regular expression searches from the start of a line (^) for anything (.*) followed by a colon and then anything followed by html.

Process Files with a Specific Content Type

It's easy to extend the pipeline example given above, making it pass the list of filenames to a command like Apple's textedit.

To realize this, we use awk to pass on just the filename, which is the first field of the line.

$ file * | grep -i "^.*:.*html" | awk '{print $1}' index.html:

Then we use sed to chop off the colon.

$ file * | grep -i "^.*:.*html" | awk '{print $1}' ¬     | sed 's/://' index.html

Finally, we use xargs to form a command line from the list of files.

$ file * | grep -i "^.*:.*html" | awk '{print $1}' ¬    | sed 's/://' | xargs open -a textedit

In this example, the command line will be

open -a textedit index.html

The command open -a runs the specified GUI program, resulting in TextEdit's opening index.html.

Learn More

Project 18 explains the use of xargs.

Projects 59 and 60 cover sed and awk.

An alternative approach uses option -F, telling file to separate the filename from the content type with space-colon instead of just colon. Consequently, the first field seen by awk will be the filename without the colon.

$ file -F " :" * | grep -i "^.*:.*html" ¬    | awk '{print $1}' | xargs open -a textedit

Search Compressed Files

Option -z tells file to look inside compressed files. Compare the output of the next two examples.

$ file week1.tbz2 week1.tbz2: bzip2 compressed data, block size = 900k $ file -z week1.tbz2 week1.tbz2: POSIX tar archive (bzip2 compressed data, block size = 900k)

Expand and Unexpand Tabs

The expand command expands tab characters to the appropriate number of spaces, and unexpand does the reverse. Pass option -a to unexpand to ensure that all spaces are converted; otherwise, only leading spaces are converted.

Fold Long Lines

Long lines can be broken into shorter lines by the fold command. In this example, the output has lines of no more than 40 characters. Output is displayed on the terminal screen; to save the results, simply redirect output to a file by using > name-of-output-file.

$ cat longlines this is a file with one very long line and no linefeeds in it to demonstrate the use of fold to break long lines into the specified width $ fold -w40 longlines this is a file with one very long line a nd no linefeeds in it to demonstrate the  use of fold to break long lines into th e specified width

Tip

The fmt command does much more than break lines. Read its man page by typing

$ man fmt

The fmt command is more sophisticated and breaks lines at spaces instead of midword.

$ fmt -40 longlines this is a file with one very long line and no linefeeds in it to demonstrate the use of fold to break long lines into the speficied width

Split Large Files

Use the split command to split a long file into many smaller files, each 1,000 lines long. Specify option -l to change the sizes of the smaller files.