Project 29. File-Content Tips"Is there an easy way to format the contents of text files?" This project gives you tips for detecting the type of content a file contains and introduces some handy text-processing utilities. Tip
Determine File ContentCommand file tells you the type of content a file contains. $ file * about-html.txt: ASCII text fake.html: empty index.html: ASCII HTML document text letter.doc: ASCII English text nodif: a /bin/tcsh script text executable smtp-auth-plain: a /usr/bin/perl script text executable unix2mac: a /bin/bash script text executable week1: directory week1.tar: POSIX tar archive week1.tbz2: bzip2 compressed data, block size = 900k Specify option -i if you would like the file type displayed in mime format. $ file -i * about-html.txt: text/plain; charset=us-ascii fake.html: application/x-empty index.html: text/html; charset=us-ascii letter.doc: text/plain, English; charset=us-ascii nodif: application/x-shellscript smtp-auth-plain: application/x-perl unix2mac: application/x-shellscript week1: application/x-not-regular-file week1.tar: application/x-tar, POSIX week1.tbz2: application/octet-stream
Search for Files with a Specific Type of ContentWe can pipe the results from file to grep to look for files with specific content. $ file * | grep -i html about-html.txt: ASCII text fake.html: empty index.html: ASCII HTML document text Learn More
This simple approach suffers from a problem: If the filename contains the search term, it will match too, regardless of the content. We must add a little sophistication to the search term to absorb everything from the beginning of the line to the colon after the filename, using a regular expression such as "^.*:", and then search for html. $ file * | grep -i "^.*:.*html" index.html: ASCII HTML document text The regular expression searches from the start of a line (^) for anything (.*) followed by a colon and then anything followed by html. Process Files with a Specific Content TypeIt's easy to extend the pipeline example given above, making it pass the list of filenames to a command like Apple's textedit. To realize this, we use awk to pass on just the filename, which is the first field of the line. $ file * | grep -i "^.*:.*html" | awk '{print $1}' index.html: Then we use sed to chop off the colon. $ file * | grep -i "^.*:.*html" | awk '{print $1}' ¬ | sed 's/://' index.html Finally, we use xargs to form a command line from the list of files. $ file * | grep -i "^.*:.*html" | awk '{print $1}' ¬ | sed 's/://' | xargs open -a textedit In this example, the command line will be open -a textedit index.html The command open -a runs the specified GUI program, resulting in TextEdit's opening index.html. Learn More
An alternative approach uses option -F, telling file to separate the filename from the content type with space-colon instead of just colon. Consequently, the first field seen by awk will be the filename without the colon. $ file -F " :" * | grep -i "^.*:.*html" ¬ | awk '{print $1}' | xargs open -a textedit Search Compressed FilesOption -z tells file to look inside compressed files. Compare the output of the next two examples. $ file week1.tbz2 week1.tbz2: bzip2 compressed data, block size = 900k $ file -z week1.tbz2 week1.tbz2: POSIX tar archive (bzip2 compressed data, block size = 900k) Expand and Unexpand TabsThe expand command expands tab characters to the appropriate number of spaces, and unexpand does the reverse. Pass option -a to unexpand to ensure that all spaces are converted; otherwise, only leading spaces are converted. Fold Long LinesLong lines can be broken into shorter lines by the fold command. In this example, the output has lines of no more than 40 characters. Output is displayed on the terminal screen; to save the results, simply redirect output to a file by using > name-of-output-file. $ cat longlines this is a file with one very long line and no linefeeds in it to demonstrate the use of fold to break long lines into the specified width $ fold -w40 longlines this is a file with one very long line a nd no linefeeds in it to demonstrate the use of fold to break long lines into th e specified width Tip
The fmt command is more sophisticated and breaks lines at spaces instead of midword. $ fmt -40 longlines this is a file with one very long line and no linefeeds in it to demonstrate the use of fold to break long lines into the speficied width Split Large FilesUse the split command to split a long file into many smaller files, each 1,000 lines long. Specify option -l to change the sizes of the smaller files. |